7,306 Matching Annotations
  1. Nov 2025
    1. Author Response

      Reviewer #1 (Public Review):

      In Figure 1A, the authors should show TEM images of control mock treated samples to show the difference between infected and healthy tissue. Based on the data shown in Figure 1B-E that the overexpression of GFP-P in N. benthamiana leads to formation of liquid-like granules. Does this occur during virus infection? Since authors have infectious clones, can it be used to show that the virally encoded P protein in infected cells does indeed exist as liquid-like granules? If the fusion of GFP to P protein affects its function, the authors could fuse just the spGFP11 and co-infiltrate with p35S-spGFP1-10. These experiments will show that the P protein when delivered from virus does indeed form liquid-like granules in plants cells. Authors should include controls in Figure 1H to show that the interaction between P protein and ER is specific.

      We agree with the reviewer and appreciate the helpful suggestion. As suggested, we added TEM images of control mock treated barley leaves. We also carried out immune-electron microscope to show the presence of BYSMV P protein in the viroplasms. Please see Figure 1–Figure supplement 1.

      BYSMV is a negative-stranded RNA virus, and is strictly dependent on insect vector transmission for infecting barley plants. We have tried to fuse GFP to BYSMV P in the full-length infectious clones. Unfortunately, we could not rescue BYSMV-GFP-P into barley plants through insect transmission.

      In Figure 1H, we used a PM localized membrane protein LRR84A as a negative control to show LRR84A-GS and BYSMV P could not form granules although they might associate at molecular distances. Therefore, the P granules were formed and tethered to the ER tubules. Please see Figure 1–Figure supplement 4

      Data shown in Figure 2 do demonstrate that the purified P protein could undergo phase separation. Furthermore, it can recruit viral N protein and part of viral genomic RNA to P protein induced granules in vitro.

      Because the full-length BYSMV RNA has 12,706 nt and is difficult to be transcribed in vitro, we cannot show whether the BYSMV genome is recruited into the droplets. We have softened the claim and state that the P-N droplets can recruit 5′ trailer of BYSMV genome as shown in Figure 3B. Please see line 22, 177 and 190.

      Based on the data shown in Figure 4 using phospho-null and phospho-mimetic mutants of P protein, the authors conclude that phosphorylation inhibits P protein phase separation. It is unclear based on the experiments, why endogenous NbCK1 fails to phosphorylate GFP-P-WT and inhibit formation of liquid-like granules similar to that of GFP-P-S5D mutant? Is this due to overexpression of GFP-P-WT? To overcome this, the authors should perform these experiments as suggested above using infectious clones and these P protein mutants.

      As we known, phosphorylation and dephosphorylation are reversible processes in eukaryotic cells. Therefore, as shown in Figure 5B and 6B, the GFP-PWT protein have two bands, corresponding to P74 and P72, which represent hyperphosphorylation and hypophosphorylated forms, respectively. Only overexpression of NbCK1 induced high ratio of P74 to P72 in vivo, and then abolished phase separation of BYSMV.

      In Figure 5, the authors overexpress NbCK1 in N. benthamiana or use an in vitro co-purification scheme to show that NbCK1 inhibits phase separation properties of P protein. These results show that overexpression of both GFP-P and NbCK1 proteins is required to induce liquid-like granules. Does this occur during normal virus infection? During normal virus infection, P protein is produced in the plant cells and the endogenous NbCK1 will regulate the phosphorylation state of P protein. These are reasons for authors to perform some of the experiments using infectious clones. Furthermore, the authors have antibodies to P protein and this could be used to show the level of P protein that is produced during the normal infection process.

      We detected the P protein existed as two phosphorylation forms in BYSMV-infected barley leaves, and λPPase treatment decreased the P44 phosphorylation form. Therefore, these results indicate that endogenous CK1 cannot phosphorylate BYSMV P completely.

      Based on the data shown in Figure 6, the authors conclude that phase separated P protein state promotes replication but inhibits transcription by overexpressing P-S5A and P-S5D mutants. To directly show that the NbCK1 controlled phosphorylation state of P regulates this process, authors should knockdown/knockout NbCK1 and see if it increases P protein condensates and promote recruitment of viral proteins and genomic RNA to increase viral replication.

      In our previous studies, BLAST searches showed that the N. benthamiana and barley genomes encode 14 CK1 orthologs, most of which can phosphorylated the SR region of BYSMV P. Therefore, it is difficult to make knockdown/knockout lines of all the CK1 orthologues. Accordingly, we generated a point mutant (K38R and D128N) in HvCK1.2, in which the kinase activity was abolished. Overexpression of HvCK1.2DN inhibit endogenous CK1-mediated phosphorylation of BYSMV P, indicating that HvCK1.2DN is a dominant-negative mutant.

      It is important to note that both replication and transcription are required for efficient infection of negative-stranded RNA viruses. Therefore, our previous studies have revealed that both PS5A and PS5D are required for BYSMV infection. Therefore, expression of HvCK1.2DN in BYSMV vector inhibit virus infection by impairing the balance of endogenous CK1-mediated phosphorylation in BYSMV P.

      Reviewer #2 (Public Review):

      The manuscript by Fang et al. details the ability of the P protein from Barley yellow striate mosaic virus (BYSMV) to form phase-separated droplets both in vitro and in vivo. The authors demonstrate P droplet formation using recombinant proteins and confocal microscopy, FRAP to demonstrate fluidity, and observed droplet fusion. The authors also used an elaborate split-GFP system to demonstrate that P droplets associate with the tubulur ER network. Next, the authors demonstrate that the N protein and a short fragment of viral RNA can also partition into P droplets. Since Rhabdovirus P proteins have been shown to phase separate and form "virus factories" (see https://doi.org/10.1038/s41467-017-00102-9), the novelty from this work is the rigorous and conclusive demonstration that the P droplets only exist in the unphosphorylated form. The authors identify 5 critical serine residues in IDR2 of P protein that when hyper-phosphorylated /cannot form droplets. Next, the authors conclusively demonstrate that the host kinase CK1 is responsible for P phosphorylation using both transient assays in N. benthamiana and a co-expression assay in E. coli. These findings will likely lead to future studies identifying cellular kinases that affect phase separation of viral and cellular proteins and increases our understanding of regulation of condensate formation. Next, the authors investigated whether P droplets regulated virus replication and transcription using a minireplicon system. The minireplicon system needs to be better described as the results were seemingly conflicting. The authors also used a full-length GFP-reporter virus to test whether phase separation was critical for virus fitness in both barley and the insect vector. The authors used 1, 6-hexanediol which broadly suppresses liquid-liquid phase separation and concluded that phase separation is required for virus fitness (based on reduced virus accumulation with 1,6 HD). However, this conclusion is flawed since 1,6-hexanediol is known to cause cell toxicity and likely created a less favorable environment for virus replication, independent of P protein phase separation. These with other issues are detailed below:

      1. In Figure 3B, the authors display three types of P-N droplets including uniform, N hollow, and P-N hollow droplets. The authors do not state the proportion of droplets observed or any potential significance of the three types. Finally, as "hollow" droplets are not typically observed, is there a possibility that a contaminating protein (not fluorescent) from E. coli is a resident client protein in these droplets? The protein purity was not >95% based on the SDS-PAGE gels presented in the supplementary figures. Do these abnormalities arise from the droplets being imaged in different focal planes? Unless some explanation is given for these observations, this reviewer does not see any significance in the findings pertaining to "hollow" droplets.

      Thanks for your constructive suggestions. We removed the "hollow" droplets as suggested. We think that the hollow droplets might be an intermediate form of LLPS. Please see PAGE 7 and 8 of revised manuscript.

      1. Pertaining to the sorting of "genomic" RNA into the P-N droplets, it is unlikely that RNA sorting is specific for BYSMV RNA. In other words, if you incubate a non-viral RNA with P-N droplets, is it sorted? The authors conclusion that genomic RNA is incorporated into droplets is misleading in a sense that a very small fragment of RNA was used. Cy5 can be incorporated into full-length genomic RNAs during in vitro transcription and would be a more suitable approach for the conclusions reached.

      Thanks for your constructive suggestions. Unfortunately, we could not obtain the in vitro transcripts of the full-length genomic RNAs (12706 nucleotides). We have softened the claim and state that the P-N droplets can recruit the 5′ trailer of BYSMV genome as shown in Figure 3B. Please see line 22, 177 and 190.

      According to previous studies (Ivanov, et al., 2011), the Rhabdovirus P protein can bind to nascent N moleculaes, forming a soluble N/P complex, to prevent from encapsidating cellular RNAs. Therefore, we suppose that the P-N droplets can incorporate viral genomic RNA specifically.

      Reference: Ivanov I, Yabukarski F, Ruigrok RW, Jamin M. 2011. Structural insights into the rhabdovirus transcription/ replication complex. Virus Research 162:126–137. DOI: https://doi.org/10.1016/j.virusres.2011.09.025

      1. In Figure 4C, it is unclear how the "views" were selected for granule counting. The methods should be better described as this reviewer would find it difficult to select fields of view in an unbiased manner. This is especially true as expression via agroinfiltration can vary between cells in agroinfiltrated regions. The methods described for granule counting and granule sizes are not suitable for publication. These should be expanded (i.e. what ImageJ tools were used?).

      We agree with the reviewer that it is important to select fields of view in an unbiased manner. We selected the representative views and provided large views in the new Supplement Figures. In addition, we added new detail methods in revision. Please see Figure 4–Figure supplement 1, Figure 5–Figure supplement 1, and method (line 489-498).

      1. In Figure 4F, the authors state that they expected P-S5A to only be present in the pellet fraction since it existed in the condensed state. However, WT P also forms condensates and was not found in the pellet, but rather exclusively in the supernatant. Therefore, the assumption of condensed droplets only being found in the pellet appears to be incorrect.

      Many thanks for pointing this out. This method is based on a previous study (Hubstenberger et al., 2017). The centrifugation method might efficiently precipitate large granules more than small granules. As shown in Figure 4B, GFP-PS5A formed large granules, therefore GFP-PS5A mainly existed in the pellet. In contrast, GFP-PWT only existed in small granule and fusion state, thus most of GFP-PWT protein was existed in supernatant, and only little GFP-PWT protein in the pellet. These results also indicate the increased phase separation activity of GFP-PS5A compared with GFP-PWT. Please see the new Figure 4F.

      Reference: Hubstenberger A, Courel M, Benard M, Souquere S, Ernoult-Lange M, Chouaib R, Yi Z, Morlot JB, Munier A, Fradet M, et al. 2017. P-Body Purification Reveals the Condensation of Repressed mRNA Regulons. Molecular Cell 68(1): 144-157 e145.

      1. The authors conclude that P-S5A has enhanced phase separation based on confocal microscopy data (Fig S6A). The data presented is not convincing. Microscopy alone is difficult for comparing phase separation between two proteins. Quantitative data should be collected in the form of turbidity assays (a common assay for phase separation). If P-S5A has enhanced phase separation compared to WT, then S5A should have increased turbidity (OD600) under identical phase separation conditions. The microscopy data presented was not quantified in any way and the authors could have picked fields of view in a biased manner.

      Thanks for your constructive suggestions. As suggested, turbidity assays were performed to show both GFP-PWT and GFP-PS5A had increased turbidity (OD600) compared with GFP. Please see Figure 4–Figure supplement 3.

      1. The authors constructed minireplicons to determine whether mutant P proteins influence RNA replication using trans N and L proteins. However, this reviewer finds the minireplicon design confusing. How is DsRFP translated from the replicon? If a frameshift mutation was introduced into RsGFP, wouldn't this block DsRFP translation as well? Or is start/stop transcription used? Second, the use of the 2x35S promoter makes it difficult to differentiate between 35S-driven transcription and replication by L. How do you know the increased DsRFP observed with P5A is not due to increased transcription from the 35S promoter? The RT-qPCR data is also very confusing. It is not clear that panel D is only examining the transcription of RFP (I assume via start/stop transcription) whereas panel C is targeting the minireplicon.

      Thank you for your questions and we are sorry for the lack of clarity regarding to the mini-replicon vectors. Here, we updated the Figure supplement 14 to show replication and transcription of BYSMV minireplicon, a negative-stranded RNA virus derivative. In addition, we insert an A after the start codon to abolish the translation of GFP mRNA, which allow us to observe phase separation of GFP-PWT, GFP-PS5A, and GFP-PS5D during virus replication. Use this system, we wanted to show the localization and phase separation of GFP-PWT, GFP-PS5A, and GFP-PS5D during replication and transcription of BYS-agMR. Please see Figure 6–Figure supplement 1.

      1. Pertaining to the replication assay in Fig. 6, transcription of RFP mRNA was reduced by S5A and increased by S5D. However, the RFP translation (via Panel A microscopy) is reversed. How do you explain increased RFP mRNA transcription by S5D but very low RFP fluorescence? The data between Panels A, C, and D do not support one another.

      Many thanks for pointing this out! We also noticed the interesting results that have been repeated independently. As shown the illustration of BYSMV-agMR system in Figure 6–Figure supplement 1, the relative transcriptional activities of different GFP-P mutants were calculated from the normalized RFP transcript levels relative to the gMR replicate template (RFP mRNA/gMR), because replicating minigenomes are templates for viral transcription.

      Since GFP-PS5D supported decreased replication, the ratio of RFP mRNA/gMR increased although the RFP mRNA of GFP-PS5D is not increased. In addition, the foci number of GFP-PS5D is much less than GFP-PWT and GFP-PS5A, indicating mRNAs in GFP-PS5D samples may contain aberrant transcripts those cannot be translated the RFP protein. In contrast, mRNAs in GFP-PS5A samples are translated efficiently. These results were in consistent with our previous studies using the free PWT, PS5A, and PS5D.

      Reference: Gao Q, et al. 2020. Casein kinase 1 regulates cytorhabdovirus replication and transcription by phosphorylating a phosphoprotein serine-rich motif. The Plant Cell 32(9): 2878-2897.

      1. The authors relied on 1,6-hexanediol to suppress phase separation in both insect vectors and barley. However, the authors disregarded several publications demonstrating cellular toxicity by 1,6-hexanediol and a report that 1,6-HD impairs kinase and phosphatase activities (see below). doi: 10.1016/j.jbc.2021.100260,

      We agree with the reviewer that 1, 6-hexanediol induced cellular toxicity. Therefore, we removed these results, which does not affect the main conclusion of our results.

      1. The authors state that reduced accumulation of BYSMV-GFP in insects and barley under HEX treatment "indicate that phase separation is important for cross-kingdom infection of BYSMV in insect vectors and host plants." The above statement is confounded by many factors, the most obvious being that HEX treatment is most likely toxic to cells and as a result cannot support efficient virus accumulation. Also, since HEX treatment interferes with phosphorylation (see REF above) its use here should be avoided since P phase separation is regulated by phosphorylation.

      We agree with the reviewer that 1, 6-hexanediol induced cellular toxicity and hereby affected infections of BYSMV and other viruses. In addition, 1, 6-hexanediol would inhibit LLPS of cellular membraneless organelles, such as P-bodies, stress granules, cajal bodies, and the nucleolus, which also affect different virus infections directly or indirectly. Therefore, we removed these results, which does not affect the main conclusion of our results.

      Reviewer #3 (Public Review):

      Membrane-less organelles formed through liquid-liquid phase separation (LLPS) provide spatiotemporal control of host immunity responses and other cellular processes. Viruses are obligate pathogens proliferating in host cells which lead their RNAs and proteins are more likely to be targeted by immune-related membrane-less organelles. To successfully infect and proliferate in host cells, virus need to efficiently suppressing the immune function of those immune-related membrane-less organelles. Moreover, viruses also generate exogenous membrane-less organelles/RNA granules to facilitate their proliferation. Accordingly, host cells also need to target and suppress the functions of exogenous membrane-less organelles/RNA granules generated by viruses, the underlying mechanisms of which are still mysterious.

      In this study, Fang et al. investigated how plant kinase confers resistance against viruses via modulating the phosphorylation and phase separation of BYSMV P protein. They firstly characterized the phase separation feature of BYSMV P protein. They also discovered that droplets formed by P protein recruit viral RNA and other viral protein in vivo. The phase separation activity of P protein is inhibited by the phosphorylation on its intrinsically disordered region. Combined with their previous study, this study demonstrated that host casein kinase (CK1) decreases the phase separation of P protein via increasing the phosphorylation of P protein. Finally, the author claimed that the phase separation of P protein facilitates BYSMV replication but decreases its transcription. Taking together, this study uncovered the molecular mechanism of plant regulating viral proliferation via decreasing the formation of exogenous RNA granules/membraneless organelles. Overall, this paper tells an interesting story about the host immunity targeting viruses via modulating the dynamics of exogenous membraneless organelles, and uncovers the modulation of viral protein phase separation by host protein, which is a hotspot in plant immunity, and the writing is logical.

      Thanks for your positive comment on our studies.

    1. Author Response:

      Reviewer #1 (Public Review):

      Here the authors use a variety of sophisticated approaches to assess the contribution of synaptic parameters to dendritic integration across neuronal maturation. They provide high-quality data identifying cellular parameters that underlie differences in AMPAR-mediated synaptic currents measured between adolescent and adult cerebellar stellate cells, and conclude that differences are attributed to an increase in the complexity of the dendritic arbor. This conclusion relies primarily on the ability of a previously described model for adult stellate cells to recapitulate the age-dependent changes in EPSCs by a change in dendritic branching with no change in synapse density. These rigorous results have implications for understanding how changing structure during neuronal development affects integration of AMPR-mediated synaptic responses.

      The data showing that younger SCs have smaller dendritic arbors but similar synapse density is well-documented and provides compelling evidence that these structural changes affect dendritic integration. But the main conclusion also relies on the assumption that the biophysical model built for adult SCs applies to adolescent SCs, and there are additional relevant variables related to synaptic function that have not been fully assessed. Thus, the main conclusions would be strengthened and broadened by additional experimental validation.

      We thank the reviewer for the positive assessment of the quality and importance of our manuscript. Below we address the reviewer’s comments directly but would like to stress that the goal of the manuscript was to understand the cellular mechanisms underlying developmental slowing of mEPSCs in SCs and the consequent implication for developmental changes in dendritic integration, which have rarely been examined to date, and not to establish a detailed biophysical model of cerebellar SCs. The latter would require dual-electrode recordings (one on 0.5 um dendrites), detailed description of the expression, dendritic localization of the gap junction protein connexin 36 (as done in Szoboszlay neuron 2016), and a detailed description prameter variability across the SC population (e.g. variations in AMPAR content at synapses, Rm, and dendritic morphology). Such experiments are well beyond the scope of the manuscript. Here we use biophysical simulations to support conclusions derived from specific experiments, more as a proof of principle rather than a strict quantitative prediction.

      Nevertheless, we would like to clarify our selection of parameters for the biophysical models for immature and adult SCs. We did not simply “assume” that the biophysical models were the same at the two developmental stages. We either used evidence from the literature or our own measured parameters to establish an immature SC model. As compared to adult SCs, we found that immature SCs had 1) an identical membrane time constant, 2) an only slightly larger dendrite diameter, 3) decreased dendritic branching and maximum lengths, 4) a comparable synapse density, and 5) a homogeneous synapse distribution. Taken together, we concluded that increased dendritic branching during SC maturation resulted in a larger fraction of synapses at longer electrotonic distances in adult SCs. These experimental findings were incorporated into two distinct biophysical models representing immature and adult SCs. Evidence from the literature suggests that voltage-gated channels expression is not altered between the two developmental stages studied here. Therefore, like the adult SC model, we considered only the passive membrane properties and the dendritic morphology. The simulation results supported our conclusion that the increased apparent dendritic filtering of mEPSCs resulted from a change in the distribution of synapse distance to the soma rather than cable properties. Some of the measured parameters (e.g., membrane time constant) were not clearly stated manuscript, which we have corrected in the revised manuscript.

      We are not sure what the reviewer meant by suggesting that we did not examine “other relevant variables related to synaptic function.” Later, the reviewer refers to alterations in AMPAR subunit composition or changes in cleft glutamate concentration (low-affinity AMPAR antagonist experiments). We performed experiments to directly examine both possible contributions by comparing qEPSC kinetics and performing low-affinity antagonist experiments, respectively, but we found that neither mechanism could account for the developmental slowing of mEPSCs. We, therefore, did not explore further possible developmental changes AMPAR subunits. See below for a more specific response and above for newly added text.

      While many exciting questions could be examined in the future, we do not think the present study requires additional experiments. Nevertheless, we recognize that perhaps we can improve the description of the results to justify our conclusions better (see specifics below).

      Reviewer #2 (Public Review):

      This manuscript investigates the cellular mechanisms underlying the maturation of synaptic integration in molecular layer interneurons in the cerebellar cortex. The authors use an impressive combination of techniques to address this question: patch-clamp recordings, 2-photon and electron microscopy, and compartmental modelling. The study builds conceptually and technically on previous work by these authors (Abrahamsson et al. 2012) and extends the principles described in that paper to investigate how developmental changes in dendritic morphology, synapse distribution and strength combine to determine the impact of synaptic inputs at the soma.

      1) Models are constructed to confirm the interpretation of experimental results, mostly repeating the simulations from Abrahamsson et al. (2012) using 3D reconstructed morphologies. The results are as expected from cable theory, given the (passive) model assumptions. While this confirmation is welcome and important, it is disappointing to see the opportunity missed to explore the implications of the experimental findings in greater detail. For instance, with the observed distributions of synapses, are there more segregated subunits available for computation in adult vs immature neurons?

      As described in our response to reviewer 1, this manuscript intends to identify the cellular mechanisms accounting developmental slowing of mEPSCs and its implication for dendritic integration. The modeling was designed to support the most plausible explanation that increased branching resulted in more synapses at longer electrotonic distances. This finding is novel and merits more in-depth examination at a computation level in future studies.

      Quantifying dendritic segregation is non-trivial due to dendritic nonlinearities and the difficulties in setting criteria for electrical “isolation” of inputs. However, because the space constant does not change with development, while both dendrite length and branching increase, it is rather logical to conclude qualitatively that the number of computational segments increases with development.

      We have added the following sentence to the Discussion (line 579):

      “Moreover, since the space constant does not change significantly with development and the dendritic tree complexity increases, the number of computational segments is expected to increase with development.”

      How do SCs respond at different developmental stages with in vivo-like patterns of input, rather than isolated activation of synapses? Answering these sorts of questions would provide quantitative support for the conclusion that computational properties evolve with development.

      While this is indeed a vital question, the in vivo patterns of synaptic activity are not known, so it is difficult to devise experiments to arrive at definitive conclusions.

      2) From a technical perspective, the modeling appears to be well-executed, though more methodological detail is required for it to be reproducible. The AMPA receptor model and reversal potential are unspecified, as is the procedure for fitting the kinetics to data.

      We did not use an explicit channel model to generate synaptic conductances. We simply used the default multiexponential function of Neuron (single exponential rise and single exponential decay) and adjusted the parameters tauRise and tauDecay such that simulated EPSCs matched somatic quantal EPSC amplitude, rise time and τdecay (Figure 4).

      We added the following text to the methods (line 708):

      “The peak and kinetics of the AMPAR-mediated synaptic conductance waveforms (gsyn) were set to simulate qEPSCs that matched the amplitude and kinetics of experimental somatic quantal EPSCs and evoked EPSCs. Immature quantal gsyn had an peak amplitude of 0.00175 μS, a 10-90 % RT of 0.0748 ms and a half-width of 0.36 ms (NEURON synaptic conductance parameter Tau0 = 0.073 ms, Tau1 = 0.26 ms and Gmax = 0.004 μS) while mature quantal gsyn had an peak amplitude of 0.00133 μS, a 10-90 % RT of 0.072 ms and a half-width of 0.341 ms (NEURON synaptic conductance parameters Tau0 = 0.072 ms, Tau1 = 0.24 ms and Gmax = 0.0032 μS). For all simulations, the reversal potential was set to 0 mV and the holing membrane potential was to – 70 mV. Experimental somatic PPR for EPSCs were reproduced with a gsyn 2/ gsyn 1 of 2.25.”

      Were simulations performed at resting potential, and if yes, what was the value?

      The membrane potential was set at – 70 mV to match that of experimental recordings and has been updated in the Methods section.

      How was the quality of the morphological reconstructions assessed? Accurate measurement of dendritic diameters is crucial to the simulations in this study, so providing additional morphometrics would be helpful for assessing the results. Will the models and morphologies be deposited in ModelDB or similar?

      For the two reconstructions imported into NEURON for simulations, we manually curated the dendritic diameters to verify a matching of the estimated diameter to that of the fluorescence image using NeuroStudio, which uses a robust subpixel estimation algorithm (Rayburst diameter, Rodriguez et al. 2008). The reconstructions include all variations in diameter throughout the dendritic tree (see as a example the the result of the reconstruction on the image below for the immature SC presented in the Figure 2D). The mean diameter across the entire dendritic tree of the reconstructed immature and adult SC was 0.42 and 0.36 μm, respectively, similar to the ratio of measured diameters estimated using confocal microscopy.

      We have updated the methods section to include how reconstructions were curated and analyzed (line 693).

      “An immature (P16) and adult SC (P42) were patch loaded with 30 μM Alexa 594 in the pipette and imaged using 2PLSM. Both cells were reconstructed in 3D using NeuronStudio in a semiautomatic mode which uses a robust subpixel estimation algorithm (calculation of Rayburst diameter (Rodriguez et al., 2008)). We manually curated the diameters to verify that it matched the fluorescence image to faithfully account for all variations in diameter throughout the dendritic tree. The measured diameter across the entire dendritic tree of the reconstructed immature and adult SCs was 0.42 and 0.36 μm, respectively. The 16% smaller diameter in adult was similar to the 13% obtained from confocal image analysis from many SCs (see Figure 2B).”

      We agree with the reviewer that accurate measurements of dendritic diameters are crucial for the simulations. We did not rely soley on the reconstructed SCs, but we also performed highresolution confocal microscopy analysis of 16 different dye-filled SCs. We examined differences in the FWHM of intensity line profiles drawn perpendicular to the dendrite between immature and adult SCs. The FWHM is a good approximation of dendritic diameter and was performed similarly to adult SCs (Abrahamsson et al., 2012) to allow direct assessment of possible developmental differences. We confirmed that 98% of the estimated diameters are larger than the imaging resolution (0.27 μm). We observed only a small developmental difference in the mean FWHM (0.41 vs. 0.47 μm, 13% reduction) using this approach. Because the dendritic filtering is similar for diameters ranging from 0.3 to 0.6 μm (Figure 4G and 4H, Abrahamsson et al. 2012), we concluded that developmental changes in dendritic diameter cannot account for for developmental differences in mEPSC time course.

      We added the following text to the methods (line 777):

      “The imaging resolution within the molecular layer was estimated from the width of intensity line profiles of SC axons. The FWHM was 0.30 +/- 0.01 μm (n = 57 measurements over 16 axons) and a mean of 0.27 +/- 0.01 μm (n = 16) when taking into account the thinnest section for each axon. Only 2% of all dendritic measurements are less than 270 nm, suggesting that the dendritic diameter estimation is hardly affected by the resolution of our microscope”

      Regarding additional morphometrics:

      1) We added two panels (H and I) to Figure 6 showing the number of primary dendrites and branch points for immature and adult using the same estimation criteria as Myoga et al;, 2009. We have updated the Results section (line 389). “Thus, the larger number of puncta located further from the soma in adult SCs is not due to increased puncta density with distance, but a larger dendritic lengths (Figure 6E and 6F) and many more distal dendritic branches (Figure 6G, Sholl analysis) due to a larger number of branch points (Figure 6H), but not a larger number of primary dendrites (Figure 6I). The similarity between the shapes of synapse (Figure 6B) and dentric segment (Figure 6C) distributions was captured by a similarity in their skewness (0.38 vs. 0.32 for both distributions in immature and -0.10 and -0.08 for adult distributions). These data demonstrate that increased dendritic complexity during SC maturation is responsible for a prominent shift toward distal synapses in adult SCs.

      2) As suggested by the reviewer, we estimated the dendritic width as a function branch order and observed a small reduction of dendritic segments as a function of distance from the soma that does not significantly alter the dendritic filtering (0.35 to 0.6 μm): there is a tendency to observe smaller diameter for more distal segments.

      3) We also show the variability in dendritic diameter within single SCs and between different SCs, which can be very large. These results have been added to Figure 2B. See also point one below in response to “comment to authors.”

      We will upload the two SC reconstructions to ModelDB.

      3) The Discussion should justify the assumption of AMPA-only synapses in the model (by citing available experimental data) as well as the limitations of this assumption in the case of different spatiotemporal patterns of parallel fiber activation.

      NMDARs are extrasynaptic in immature and adult SCs. Therefore they do not contribute to postsynaptic strength in response to low-frequency synaptic activation. We therefore do not consider their contribution to synaptic integration in this study. Please see also out detailed response to reviewer’s point 4. We have updated the Results accordingly.

      4) What is the likely influence of gap junction coupling between SCs on the results presented here, and on synaptic integration in SCs more generally - and how does it change during development? This should also be discussed.

      Please see a detailed response to Editor’s point 2. In brief, all recordings were performed without perturbing gap junction coupling between cells, which have been shown to affect axial resistance and membrane capacitance in other cell types (Szoboszlay et al., 2016). While our simulations do not explicitly include gap junctions, their effect on passive membrane properties is implicitly included because we matched the simulated membrane time constant to experimental values. Moreover, gap junctions are more prominent in cerebellar basket cells than SCs in both p18 to p21 animals (Rieubland 2015) and adult mice (Hoehne et al., 2020). Ultimately, the impact of gap junctions also depends on their distance from the activated synapses (Szoboszlay et al., 2016). Unfortunately, the distribution of gap junctions in SCs and their conductance is not known at this time. We, therefore, did not explicitly consider gap junction in this study.

      Nevertheless, we have added a section in the Discussion (line 552):

      “We cannot rule out that developmental changes in gap junction expression could contribute to the maturation of SC dendritic integration, since they are thought to contribute to the axial resistivity and capacitance of neurons (Szoboszlay et al., 2016). All the recordings were made with gap junctions intact, including for membrane time constant measurements. However, their expression in SCs is likely to be lower than their basket cell counterparts (Hoehne et al., 2020; Rieubland et al., 2014).”

      5) All experiments and all simulations in the manuscript were done in voltage clamp (the Methods section should give further details, including the series resistance). What is the significance of the key results of the manuscript on synapse distribution and branching pattern of postsynaptic dendrites in immature and adult SCs for the typical mode of synaptic integration in vivo, i.e. in current clamp? What is their significance for neuronal output, considering that SCs are spontaneously active?

      It should be noted that not all simulations were done in voltage-clamp, see figure 8.

      Nevertheless, we have given additional details about the following experimental and simulation parameters:

      1) Description of the whole-cell voltage-clamp procedure.

      2) Series resistance values of experiments and used for simulations.

      Initial simulations with the idealized SC model were performed with a Rs of 20 MOhm. In the reconstructed model Rs was set at 16 mOhm to match more precisely the experimental values obtained for the mEPSC experiments. We verified that there were no statistical difference in Rs between Immature and adult recordings.

      Reviewer #3 (Public Review):

      1) Although the authors were thorough in their efforts to find the mechanism underlying the differences in the young and adult SC synaptic event time course, the authors should consider the possibility of inherently different glutamate receptors, either by alterations in the subunit composition or by an additional modulatory subunit. The literature actually suggests that this might be the case, as several publications described altered AMPA receptor properties (not just density) during development in stellate cells (Bureau, Mulle 2004; Sun, Liu 2007; Liu, Cull-Candy 2002). The authors need to address these possibilities, as modulatory subunits are known to alter receptor kinetics and conductance as well.

      Properties of synaptic AMPAR in SCs are known to change during development and in an activity-dependent manner. EPSCs in immature SC have been shown to be mediated by calcium permeable AMPARs, predominantly containing GluR3 subunits that are associated with TARP γ2 and γ7 (Soto et al. 2007; Bats et al., 2012). During development GluR2 subunits are inserted to the synaptic AMPAR in an activity-dependent manner (Liu et al, 2000), affecting the receptors’ calcium permeability (Liu et al., 2002). However, those developmental changes do not appear to affect EPSC kinetics (Liu et al., 2002) and have very little impact on AMPAR conductance (Soto et al., 2007). When we compare qEPSC kinetics for somatic synapses between immature and adult SC, we did not observe changes in EPSC decay. In the light of this observation and also consistent with the studies cited above, we concluded that differences in AMPAR composition could not contribute to kinetics differences observed in the developmental changes in mEPSC properties.

      We have modified the manuscript to make this point clearer (see section starting line 332) :

      “This reduction in synaptic conductance could be due to a reduction in the number of synaptic AMPARs activated and/or a developmental change in AMPAR subunits. SC synaptic AMPARs are composed of GluA2 and GluA3 subunits associated with TARP γ2 and γ7 (Bats et al., 2012; Liu and Cull-Candy, 2000; Soto et al., 2007; Yamazaki et al., 2015). During development, GluR2 subunits are inserted to the synaptic AMPAR in an activity-dependent manner (Liu and Cull-Candy, 2002), affecting receptors calcium permeability (Liu and Cull-Candy, 2000). However, those developmental changes have little impact on AMPAR conductance (Soto et al., 2007), nor do they appear to affect EPSC kinetics (Liu and Cull-Candy, 2002); the latter is consistent with our findings. Therefore the developmental reduction in postsynaptic strength most likely results from fewer AMPARs activated by the release of glutamate from the fusion of a single vesicle. “

      The authors correctly identify the relationship between local dendritic resistance and the reduction of driving force, but they assume the same relationship for young SCs as well in their model. This assumption is not supported by recordings, and as there are several publications about the disparity of input impedance for young versus adult cells (Schmidt-Hieber, Bischoffberger 2007).

      The input resistance of the dendrite will indeed determine local depolarization and loss of driving force. However, its impact on dendritic integration depends on it precise value, and perhaps the reviewer thought we “assumed” that the input resistance to be the same between immature and adult SCs. This was not the case, and we have since clarified this in the manuscript. We performed three important measurements that support a loss of driving force in immature SCs (for reference, the input resistance for an infinite cable is described by the following equation (Rn= sqrt(RmRi/2)/(2pi*r^(3/2)), where r is the dendrite radius):

      1) The input resistance is inversely proportional to the dendritic diameter, which we measured to be only slightly larger in immature SCs (0.47 versus 0.41 μm). This result is described in Figure 2.

      2) We measured the membrane time constant, which provides an estimate of the total membrane conductance multiplied by the total capacitance. The values between the two ages were similar, suggesting a slightly larger membrane resistance to compensate the smaller total membrane capacitance of the immature SCs. This was explicitly accounted for when performing the simulations using reconstructed immature and adult SCs (Figure 2 and 7 and 8) by adjusting the specific membrane resistance until the simulated membrane time constant matched experimental values. These values were not clearly mentioned and are now included on line 233 in the Results and 704 in the Methods.

      3) We directly examined paired-pulse facilitation of synapses onto immature SC dendrites versus that for somatic synapses. We previously showed in adult SCs that sublinear summation of synaptic responses, due to loss of synaptic current driving force (Tran- Van-Minh et al. 2016), manifests in decreased facilitation for dendritic synapses (Abrahamsson et al. 2012). Figure 8A shows that indeed dendritic facilitation was less than observed in the soma. We have now modified Figure 8 to include the results of the simulations showing that the biophysical model could reproduce this difference in shortterm plasticity (Figure 8B).

      Together, we believe these measurements support the presence of similar sublinear summation mechanisms in immature SCs.

      2) The authors use extracellular stimulation of parallel fibers. The authors note that due to the orientation of the PF, and the slicing angle, they can restrict the spatial extent of the stimuli. However, this method does not guarantee that the stimulated fibers will all connect to the same dendritic branch. Whether two stimulated synapses connect to the same dendrite or not can heavily influence summation. This is especially a great concern for these cells as the Scholl analysis showed that young and adult SC cells have different amount of distal dendrites. Therefore, if the stimulated axons connect to several different neighboring dendrites instead of the one or two in case of young SC cells, then the model calculations and the conclusions about the summation rules may be erroneous.

      We selected isolated dendrites and delivered voltage stimuli using small diameter glass electrodes (~ 1 μm) 10 - 15 V above threshold to stimulate single dendrites. This procedure excites GC axons in brain slices made from adult mice within less than 10 μm from the tip (Figure 2C, Tran-Van-Minh et al. 2016). It produces large dendritic depolarizations that are sufficient to decrease synaptic current driving force (Figure 1, Tran-Van-Minh et al. 2016). When we reproduced the conductance ratio using uncaging of single dendrites, we observed paired-pulse facilitation in the dendrites – suggesting that electrical stimulation activated synapses on common dendritic branches, or at least within close electrotonic distance to cause large dendritic depolarizations (Figure 7, Abrahamsson et al. 2012). Finally, we expect that the decreased branching in immature SCs further ensures that a majority of recorded synapses are contacting a common dendritic segment. We cannot rule out that occasionally some synaptic responses recorded at the soma are from synapses on different dendritic branches, but we do not see how this would alter our results and change our principal conclusions, particularly since this possible error only effects the interpretation of how many synapses are activated in paired-pulse experiments. The majority of the conclusions arise from the stimulation of single vesicle release events, and given the strikingly perpendicular orientation of GC axons, a 10 μm error in synapse location along a dendrite when we stimulated in the outthird would not alter our interpretations of the data.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study by Teplenin and coworkers assesses the combined effects of localized depolarization and excitatory electrical stimulation in myocardial monolayers. They study the electrophysiological behaviour of cultured neonatal rat ventricular cardiomyocytes expressing the light-gated cation channel Cheriff, allowing them to induce local depolarization of varying area and amplitude, the latter titrated by the applied light intensity. In addition, they used computational modeling to screen for critical parameters determining state transitions and to dissect the underlying mechanisms. Two stable states, thus bistability, could be induced upon local depolarization and electrical stimulation, one state characterized by a constant membrane voltage and a second, spontaneously firing, thus oscillatory state. The resulting 'state' of the monolayer was dependent on the duration and frequency of electrical stimuli, as well as the size of the illuminated area and the applied light intensity, determining the degree of depolarization as well as the steepness of the local voltage gradient. In addition to the induction of oscillatory behaviour, they also tested frequency-dependent termination of induced oscillations.

      Strengths:

      The data from optogenetic experiments and computational modelling provide quantitative insights into the parameter space determining the induction of spontaneous excitation in the monolayer. The most important findings can also be reproduced using a strongly reduced computational model, suggesting that the observed phenomena might be more generally applicable.

      Weaknesses:

      While the study is thoroughly performed and provides interesting mechanistic insights into scenarios of ventricular arrhythmogenesis in the presence of localized depolarized tissue areas, the translational perspective of the study remains relatively vague. In addition, the chosen theoretical approach and the way the data are presented might make it difficult for the wider community of cardiac researchers to understand the significance of the study.

      Reviewer #2 (Public review):

      In the presented manuscript, Teplenin and colleagues use both electrical pacing and optogenetic stimulation to create a reproducible, controllable source of ectopy in cardiomyocyte monolayers. To accomplish this, they use a careful calibration of electrical pacing characteristics (i.e., frequency, number of pulses) and illumination characteristics (i.e., light intensity, surface area) to show that there exists a "sweet spot" where oscillatory excitations can emerge proximal to the optogenetically depolarized region following electrical pacing cessation, akin to pacemaker cells. Furthermore, the authors demonstrate that a high-frequency electrical wave-train can be used to terminate these oscillatory excitations. The authors observed this oscillatory phenomenon both in vitro (using neonatal rat ventricular cardiomyocyte monolayers) and in silico (using a computational action potential model of the same cell type). These are surprising findings and provide a novel approach for studying triggered activity in cardiac tissue.

      The study is extremely thorough and one of the more memorable and grounded applications of cardiac optogenetics in the past decade. One of the benefits of the authors' "two-prong" approach of experimental preps and computational models is that they could probe the number of potential variable combinations much deeper than through in vitro experiments alone. The strong similarities between the real-life and computational findings suggest that these oscillatory excitations are consistent, reproducible, and controllable.

      Triggered activity, which can lead to ventricular arrhythmias and cardiac sudden death, has been largely attributed to sub-cellular phenomena, such as early or delayed afterdepolarizations, and thus to date has largely been studied in isolated single cardiomyocytes. However, these findings have been difficult to translate to tissue and organ-scale experiments, as well-coupled cardiac tissue has notably different electrical properties. This underscores the significance of the study's methodological advances: the use of a constant depolarizing current in a subset of (illuminated) cells to reliably result in triggered activity could facilitate the more consistent evaluation of triggered activity at various scales. An experimental prep that is both repeatable and controllable (i.e., both initiated and terminated through the same means).

      The authors also substantially explored phase space and single-cell analyses to document how this "hidden" bi-stable phenomenon can be uncovered during emergent collective tissue behavior. Calibration and testing of different aspects (e.g., light intensity, illuminated surface area, electrical pulse frequency, electrical pulse count) and other deeper analyses, as illustrated in Appendix 2, Figures 3-8, are significant and commendable.

      Given that the study is computational, it is surprising that the authors did not replicate their findings using well-validated adult ventricular cardiomyocyte action potential models, such as ten Tusscher 2006 or O'Hara 2011. This may have felt out of scope, given the nice alignment of rat cardiomyocyte data between in vitro and in silico experiments. However, it would have been helpful peace-of-mind validation, given the significant ionic current differences between neonatal rat and adult ventricular tissue. It is not fully clear whether the pulse trains could have resulted in the same bi-stable oscillatory behavior, given the longer APD of humans relative to rats. The observed phenomenon certainly would be frequency-dependent and would have required tedious calibration for a new cell type, albeit partially mitigated by the relative ease of in silico experiments.

      For all its strengths, there are likely significant mechanistic differences between this optogenetically tied oscillatory behavior and triggered activity observed in other studies. This is because the constant light-elicited depolarizing current is disrupting the typical resting cardiomyocyte state, thereby altering the balance between depolarizing ionic currents (such as Na+ and Ca2+) and repolarizing ionic currents (such as K+ and Ca2+). The oscillatory excitations appear to later emerge at the border of the illuminated region and non-stimulated surrounding tissue, which is likely an area of high source-sink mismatch. The authors appear to acknowledge differences in this oscillatory behavior and previous sub-cellular triggered activity research in their discussion of ectopic pacemaker activity, which is canonically expected more so from genetic or pathological conditions. Regardless, it is exciting to see new ground being broken in this difficult-to-characterize experimental space, even if the method illustrated here may not necessarily be broadly applicable.

      We thank the reviewers for their thoughtful and constructive feedback, as well as for recognizing the conceptual and technical strengths of our work. We are especially pleased that our integrated use of optogenetics, electrical pacing, and computational modelling was seen as a rigorous and innovative approach to investigating spontaneous excitability in cardiac tissue.

      At the core of our study was the decision to focus exclusively on neonatal rat ventricular cardiomyocytes. This ensured a tightly controlled and consistent environment across experimental and computational settings, allowing for direct comparison and deeper mechanistic insight. While extending our findings to adult or human cardiomyocytes would enhance translational relevance, such efforts are complicated by the distinct ionic properties and action potential dynamics of these cells, as also noted by Reviewer #2. For this foundational study, we chose to prioritize depth and clarity over breadth.

      Our computational domain was designed to faithfully reflect the experimental system. The strong agreement between both domains is encouraging and supports the robustness of our framework. Although some degree of theoretical abstraction was necessary (thereby sometimes making it a bit harder to read), it reflects the intrinsic complexity of the collective behaviours we aimed to capture such as emergent bi-stability. To make these ideas more accessible, we included simplified illustrations, a reduced model, and extensive supplementary material.

      A key insight from our work is the emergence of oscillatory behaviour through interaction of illuminated and non-illuminated regions. Rather than replicating classical sub-cellular triggered activity, this behaviour arises from systems-level dynamics shaped by the imposed depolarizing current and surrounding electrotonic environment. By tuning illumination and local pacing parameters, we could reproducibly induce and suppress these oscillations, thereby providing a controllable platform to study ectopy as a manifestation of spatial heterogeneity and collective dynamics.

      Altogether, our aim was to build a clear and versatile model system for investigating how spatial structure and pacing influence the conditions under which bistability becomes apparent in cardiac tissue. We believe this platform lays strong groundwork for future extensions into more physiologically and clinically relevant contexts.

      In revising the manuscript, we carefully addressed all points raised by the reviewers. We have also responded to each of their specific comments in detail, which are provided below.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      Please find my specific comments and suggestions below:

      (1) Line 64: When first introduced, the concept of 'emergent bi-stability' may not be clear to the reader.

      We concur that the full breadth of the concept of emergent bi-stability may not be immediately clear upon first mention. Nonetheless, its components have been introduced separately: “emergent” was linked to multicellular behaviour in line 63, while “bi-stability” was described in detail in lines 39–56. We therefore believe that readers could form an intuitive understanding of the combined term, which will be further clarified as the manuscript develops. To further ease comprehension of the reader, we have added the following clarification to line 64:

      “Within this dynamic system of cardiomyocytes, we investigated emergent bi-stability (a concept that will be explained more thoroughly later on) in cell monolayers under the influence of spatial depolarization patterns.”

      (2) Lines 67-80: While the introduction until line 66 is extremely well written, the introduction of both cardiac arrhythmia and cardiac optogenetics could be improved. It is especially surprising that miniSOG is first mentioned as a tool for optogenetic depolarisation of cardiomyocytes, as the authors would probably agree that Channelrhodopsins are by far the most commonly applied tools for optogenetic depolarisation (please also refer to the literature by others in this respect). In addition, miniSOG has side effects other than depolarisation, and thus cannot be the tool of choice when not directly studying the effects of oxidative stress or damage.

      The reviewer is absolutely correct in noting that channelrhodopsins are the most commonly applied tools for optogenetic depolarisation. We introduced miniSOG primarily for historical context: the effects of specific depolarization patterns on collective pacemaker activity were first observed with this tool (Teplenin et al., 2018). In that paper, we also reported ultralong action potentials, occurring as a side effect of cumulative miniSOG-induced ROS damage. In the following paragraph (starting at line 81), we emphasize that membrane potential can be controlled much better using channelrhodopsins, which is why we employed them in the present study.

      (3) Line 78: I appreciate the concept of 'high curvature', but please always state which parameter(s) you are referring to (membrane voltage in space/time, etc?).

      We corrected our statement to include the specification of space curvature of the depolarised region:

      “In such a system, it was previously observed that spatiotemporal illumination can give rise to collective behaviour and ectopic waves (Teplenin et al. (2018)) originating from illuminated/depolarised regions (with high spatial curvature).”

      (4) Line 79: 'bi-stable state' - not yet properly introduced in this context.

      The bi-stability mentioned here refers back to single cell bistability introduced in Teplenin et al. (2018), which we cited again for clarity.

      “These waves resulted from the interplay between the diffusion current and the single cell bi-stable state (Teplenin et al. (2018)) that was induced in the illuminated region.”

      (5) Line 84-85: 'these ion channels allow the cells to respond' - please describe the channel used; and please correct: the channels respond to light, not the cells. Re-ordering this paragraph may help, because first you introduce channels for depolarization, then you go back to both de- and hyperpolarization. On the same note, which channels can be used for hyperpolarization of cardiomyocytes? I am not aware of any, even WiChR shows depolarizing effects in cardiomyocytes during prolonged activation (Vierock et al. 2022). Please delete: 'through a direct pathway' (Channelrhodopsins a directly light-gated channels, there are no pathways involved).

      We realised that the confusion arose from our use of incorrect terminology: we mistakenly wrote hyperpolarisation instead of repolarisation. In addition to channelrhodopsins such as WiChR, other tools can also induce a repolarising effect, including light-activatable chloride pumps (e.g., JAWS). However, to improve clarity, we recognize that repolarisation is not relevant to our manuscript and therefore decided to remove its mention (see below). Regarding the reported depolarising effects of WiChR in Vierock et al. (2022), we speculate that these may arise either from the specific phenotype of the cardiomyocytes used in the study, i.e. human induced pluripotent stem cell-derived atrial myocytes (aCMs), or from the particular ionic conditions applied during patch-clamp recordings (e.g., a bath solution containing 1 mM KCl). Notably, even after prolonged WiChR activation, the aCMs maintained a strongly negative maximum diastolic potential of approximately –55 mV.

      “Although effects of illuminating miniSOG with light might lead to formation of depolarised areas, it is difficult to control the process precisely since it depolarises cardiomyocytes indirectly. Therefore, in this manuscript, we used light-sensitive ion channels to obtain more refined control over cardiomyocyte depolarisation. These ion channels allow the cells to respond to specific wavelengths of light, facilitating direct depolarisation (Ördög et al. (2021, 2023)). By inducing cardiomyocyte depolarisation only in the illuminated areas, optogenetics enables precise spatiotemporal control of cardiac excitability, an attribute we exploit in this manuscript (Appendix 2 Figure 1).”

      (6) Figure 1: What would be the y-axis of the 'energy-like curves' in B? What exactly did you plot here?

      The graphs in Figure 1B are schematic representations intended to clarify the phenomenon for the reader. They do not depict actual data from any simulation or experiment. We clarified this misunderstanding by specifying that Figure 1B is a schematic representation of the effects at play in this paper.

      “(B) Schematic representation showing how light intensity influences collective behaviour of excitable systems, transitioning between a stationary state (STA) at low illumination intensities and an oscillatory state (OSC) at high illumination intensities. Bi-stability occurs at intermediate light intensities, where transitions between states are dependent on periodic wave train properties. TR. OSC, transient oscillations.”

      To expand slightly beyond the paper: our schematic representation was inspired by a common visualization in dynamical systems used to illustrate bi-stability (for an example, see Fig. 3 in Schleimer, J. H., Hesse, J., Contreras, S. A., & Schreiber, S. (2021). Firing statistics in the bistable regime of neurons with homoclinic spike generation. Physical Review E, 103(1), 012407.). In this framework, the y-axis can indeed be interpreted as an energy landscape, which is related to a probability measure through the Boltzmann distribution: . Here, p denotes the probability of occupying a particular state (STA or OSC). This probability can be estimated from the area (BCL × number of pulses) falling within each state, as shown in Fig. 4C. Since an attractor corresponds to a high-probability state, it naturally appears as a potential well in the landscape.

      (7) Lines 92-93: 'this transition resulted for the interaction of an illuminated region with depolarized CM and an external wave train' - please consider rephrasing (it is not the region interacting with depolarized CM; and the external wave train could be explained more clearly).

      We rephrased our unclear sentence as follows:

      “This transition resulted from the interaction of depolarized cardiomyocytes in an illuminated region with an external wave train not originating from within the illuminated region.”

      (8) Figure 2 and elsewhere: When mentioning 'frequency', please state frequency values and not cycle lengths. Please also reconsider your distinction between high and low frequencies; 200 ms (5 Hz) is actually the normal heart rate for neonatal rats (300 bpm).

      In the revised version, we have clarified frequency values explicitly and included them alongside period values wherever frequency is mentioned, to avoid any ambiguity. We also emphasize that our use of "high" and "low" frequency is strictly a relative distinction within the context of our data, and not meant to imply a biological interpretation.

      (9) Lines 129-131: Why not record optical maps? Voltage dynamics in the transition zone between depolarised and non-depolarised regions might be especially interesting to look at?

      We would like to clarify that optical maps were recorded for every experiment, and all experimental traces of cardiac monolayer activity were derived from these maps. We agree with the reviewer that the voltage dynamics in the transition zone are particularly interesting. However, we selected the data representations that, in our view, best highlight the main mechanisms. When we analysed full voltage profiles, they didn’t add extra insights to this main mechanism. As the other reviewer noted, the manuscript already presents a wide range of regimes, so we decided not to introduce further complexity.

      (10) Lines 156-157: Why was the model not adapted to match the biophysical properties (e.g., kinetics, ion selectivity, light sensitivity) of Cheriff?

      The model was not adapted to the biophysical properties of Cheriff, because this would entail a whole new study involving extensive patch-clamping experiments, fitting, and calibration to model the correct properties of the ion channel. Beyond considerations of time efficiency, incorporating more specific modelling parameters would not change the essence of our findings. While numeric parameter ranges might shift, the core results would remain unchanged. This is a result of our experimental design where we applied constant illumination of long duration (6s or longer), thus making a difference in kinetical properties of an optogenetic tool irrelevant. In addition, we were able to observe qualitatively similar phenomena using many other depolarising optogenetic tools (e.g. ChR2, ReaChR, CatCh and more) in our in-vitro experiments. We ended up with Cheriff as our optotool-of-choice for the practical reasons of good light-sensitivity and a non-overlapping spectrum with our fluorescent dyes.

      Therefore, computationally using a more general depolarising ion channel hints at the more general applicability of the observed phenomena, supporting our claim of a universal mechanism  (demonstrated experimentally with CheRiff and computationally with ChR2).

      (11) Line 158: 1.7124 mW/mm^2 - While I understand that this is the specific intensity used as input in the model, I am convinced that the model is not as accurate to predict behaviour at this specific intensity (4 digits after the comma), especially given that the model has not been adapted to Cheriff (probably more light sensitive than ChR2). Can this be rephrased?

      We did not aim for quantitative correspondence between the computational model and the biological experiments, but rather for qualitative agreement and mechanistic insight (see line 157). Qualitative comparisons are computationally obtained in a whole range of different intensities, as demonstrated in the 3D diagram of Fig. 4C. We wanted to demonstrate that at one fixed light intensity (chosen to be 1.7124 mW/mm^2 for the most clear effect), it was possible for all three states (STA, OSC. TR. OSC.) to coexist depending on the number of pulses and their period. Therefore the specific intensity used in the computational model is correct, and for reproducibility, we have left it unchanged while clarifying that it refers specifically to the in silico model:

      “Simulating at a fixed constant illumination of 1.7124 𝑚𝑊∕𝑚𝑚<sup>2</sup> and a fixed number of 4 pulses, frequency dependency of collective bi-stability was reproduced in Figure 4A.”

      (12) Lines 160, 165, and elsewhere: 'Once again, Once more' - please delete or rephrase.

      We agree that we could have written these binding words better and reformulated them to:

      “Similar to the experimental observations, only intermediate electrical pacing frequencies (500-𝑚𝑠 period) caused transitions from collective stationary behaviour to collective oscillatory behaviour and ectopic pacemaker activity had periods (710 𝑚𝑠) that were different from the stimulation train period (500 𝑚𝑠). Figure 4B shows the accumulation of pulses necessary to invoke a transition from the collective stationary state to the collective oscillatory state at a fixed stimulation period (600 𝑚𝑠). Also in the in silico simulations, ectopic pacemaker activity had periods (750 𝑚𝑠) that were different from the stimulation train period (600 𝑚𝑠). Also for the transient oscillatory state, the simulations show frequency selectivity (Appendix 2 Figure 4B).”

      (13) Line 171: 'illumination strength': please refer to 'light intensity'.

      We have revised our formulation to now refer specifically to “light intensity”:

      “We previously identified three important parameters influencing such transitions: light intensity, number of pulses, and frequency of pulses.”

      (14) Lines 187-188: 'the illuminated region settles into this period of sending out pulses' - please rephrase, the meaning is not clear.

      We reformulated our sentence to make its content more clear to the reader:

      “For the conditions that resulted in stable oscillations, the green vertical lines in the middle and right slices represent the natural pacemaker frequency in the oscillatory state. After the transition from the stationary towards the oscillatory state, oscillatory pulses emerging from the illuminated region gradually dampen and stabilize at this period, corresponding to the natural pacemaker frequency.”

      (15) Figure 7: A)- please state in the legend which parameter is plotted on the y-axis (it is included in the main text, but should be provided here as well); C) The numbers provided in brackets are confusing. Why is (4) a high pulse number and (3) a low pulse number? Why not just state the number of pulses and add alpha, beta, gamma, and delta for the panels in brackets? I suggest providing the parameters (e.g., 800 ms cycle length, 2 pulses, etc) for all combinations, but not rate them with low, high, etc. (see also comment above).

      We appreciate the reviewer’s comments and have revised the caption for figure 7, which now reads as follows:

      “Figure 7. Phase plane projections of pulse-dependent collective state transitions. (A) Phase space trajectories (displayed in the Voltage – x<sub>r</sub> plane) of the NRVM computational model show a limit cycle (OSC) that is not lying around a stable fixed point (STA). (B) Parameter space slice showing the relationship between stimulation period and number of pulses for a fixed illumination intensity (1.72 𝑚𝑊 ∕𝑚𝑚2) and size of the illuminated area (67 pixels edge length). Letters correspond to the graphs shown in C. (C) Phase space trajectories for different combinations of stimulus train period and number of pulses (α: 800 ms cycle length + 2 pulses, β: 800 ms cycle length + 4 pulses, γ: 250 ms cycle length + 3 pulses, δ: 250 ms cycle length + 8 pulses). α and δ do not result in a transition from the resting state to ectopic pacemaker activity, as under these circumstances the system moves towards the stationary stable fixed point from outside and inside the stable limit cycle, respectively. However, for β and γ, the stable limit cycle is approached from outside and inside, respectively, and ectopic pacemaker activity is induced.”

      (16) Line 258: 'other dimensions by the electrotonic current' - not clear, please rephrase and explain.

      We realized that our explanation was somewhat convoluted and have therefore changed the text as follows:

      “Rather than producing oscillations, the system returns to the stationary state along dimensions other than those shown in Figure 7C (Voltage and x<sub>r</sub>), as evidenced by the phase space trajectory crossing itself. This return is mediated by the electrotonic current.”

      (17) Line 263: ‘increased too much’ – please rephrase using scientific terminology.

      We rephrased our sentence to:

      “However, this is not a Hopf bifurcation, because in that case the system would not return to the stationary state when the number of pulses exceeds a critical threshold.”

      (18) Line 275: 'stronger diffusion/electrotonic influence from the non-illuminated region' - not sure diffusion is the correct term here. Please explain by taking into account the membrane potential. Please make sure to use proper terminology. The same applies to lines 281-282.

      We appreciate this comment, which prompted us to revisit on our text. We realised that some sections could be worded more clearly, and we also identified an error in the legend of Supplementary Figure 7. The corresponding corrections are provided below:

      “However, repolarisation reserve does have an influence, prolonging the transition when it is reduced (Appendix 2 Figure 7). This effect can be observed either by moving further from the boundary of the illuminated region, where the electrotonic influence from the non-illuminated region is weaker, or by introducing ionic changes, such as a reduction in I<sub>Ks</sub> and/or I<sub>to</sub>. For example, because the electrotonic influence is weaker in the center of the illuminated region, the voltage there is not pulled down toward the resting membrane potential as quickly as in cells at the border of the illuminated zone.”

      “To add a multicellular component to our single cell model we introduced a current that replicates the effect of cell coupling and its associated electrotonic influence.”

      “Figure 7. The effect of ionic changes on the termination of pacemaker activity. The mechanism that moves the oscillating illuminated tissue back to the stationary state after high frequency pacing is dependent on the ionic properties of the tissue, i.e. lower repolarisation reserves (20% 𝐼<sub>𝐾𝑠</sub> + 50% 𝐼<sub>𝑡𝑜</sub>) are associated with longer transition times.”

      (19) Line 289: -58 mV (to be corrected), -20 mV, and +50 mV - please justify the selection of parameters chosen. This also applies elsewhere- the selection of parameters seems quite arbitrary, please make sure the selection process is more transparent to the reader.

      Our choice of parameters was guided by the dynamical properties of the illuminated cells as well as by illustrative purposes. The value of –58 mV corresponds to the stimulation threshold of the model. The values of 50 mV and –20 mV match those used for single-cell stimulation (Figure 8C2, right panel), producing excitable and bistable dynamics, respectively. We refer to this point in line 288 with the phrase “building on this result.” To maintain conciseness, we did not elaborate on the underlying reasoning within the manuscript and instead reported only the results.

      We also corrected the previously missed minus sign: -58 mV.

      (20) Figure 8 and corresponding text: I don't understand what stimulation with a voltage means. Is this an externally applied electric field? Or did you inject a current necessary to change the membrane voltage by this value? Please explain.

      Stimulation with a specific voltage is a standard computational technique and can be likened to performing a voltage-clamp experiment on each individual cell. In this approach, the voltage of every cell in the tissue is briefly forced to a defined value.

      (21) Figure 8C- panel 2: Traces at -20 mV and + 50 mV are identical. Is this correct? Please explain.

      Yes, that is correct. The cell responds similarly to a voltage stimulus of -20 mV or one of 50 mV, because both values are well above the excitation threshold of a cardiomyocyte.

      (22) Line 344 and elsewhere: 'diffusion current' - This is probably not the correct terminology for gap-junction mediated currents. Please rephrase.

      A diffusion current is a mathematical formulation for a gap junction mediated current here, so , depending on the background of the reader, one of the terms might be used focusing on different aspects of the results. In a mathematical modelling context one often refers to a diffusion current because cardiomyocytes monolayers and tissues can be modelled using a reaction-diffusion equation. From the context of fine-grain biological and biophysical details, one uses the term gap-junction mediated current. Our choice is motivated by the main target audience we have in mind, namely interdisciplinary researchers with a core background in the mathematics/physics/computer science fields.

      However, to not exclude our secondary target audience of biological and medical readers we now clarified the terminology, drawing the parallel between the different fields of study at line 79:

      “These waves resulted from the interplay between the diffusion current (also known in biology/biophysics as the gap junction mediated current) and the bi-stable state that was induced in the illuminated region.”

      (23) Lines 357-58: 'Such ectopic sources are typically initiated by high frequency pacing' - While this might be true during clinical testing, how would you explain this when not externally imposed? What could be biological high-frequency triggers?

      Biological high-frequency triggers could include sudden increases in heart rates, such as those induced by physical activity or emotional stress. Another possibility is the occurrence of paroxysmal atrial or ventricular fibrillation, which could then give rise to an ectopic source.

      (24) Lines 419-420: 'large ionic cell currents and small repolarising coupling currents'. Are coupling currents actually small in comparison to cellular currents? Can you provide relative numbers (~ratio)?

      Coupling currents are indeed small compared to cellular currents. This can be inferred from the I-V curve shown in Figure 8C1, which dips below 0 and creates bi-stability only because of the small coupling current. If the coupling current were larger, the system would revert to a monostable regime. To make this more concrete, we have now provided the exact value of the coupling current used in Figure 8C1.

      “Otherwise, if the hills and dips of the N-shaped steady-state IV curve were large (Figure 8C-1), they would have similar magnitudes as the large currents of fast ion channels, preventing the subtle interaction between these strong ionic cell currents and the small repolarising coupling currents (-0.103649 ≈ 0.1 pA).”

      (25) Line 426: Please explain how ‘voltage shocks’ were modelled.

      We would like to refer the reviewer to our response to comment (20) regarding how we model voltage shocks. In the context of line 426, a typical voltage shock corresponds to a tissue-wide stimulus of 50 mV. Independent of our computational model, line 426 also cites other publications showing that, in clinical settings, high-voltage shocks are unable to terminate ectopic sustained activity, consistent with our findings.

      (26) Lines 429 ff: 0.2pA/pF would correspond to 20 pA for a small cardiomyocyte of 100 pF, this current should be measurable using patch-clamp recordings.

      In trying to be succinct, we may have caused some confusion. The difference between the dips (-0.07 pA/pF) and hills (_≈_0.11 pA/pF) is approximately 0.18 pA/pF. For a small cardiomyocyte, this corresponds to deviations from zero of roughly ±10 pA. Considering that typical RMS noise levels in whole-cell patch-clamp recordings range from 2-10 pA , it is understandable that detecting these peaks and dips in an I-V curve (average current after holding a voltage for an extended period)  is difficult. Achieving statistical significance would therefore require patching a large number of cells.

      Given the already extensive scope of our manuscript in terms of techniques and concepts, we decided not to pursue these additional patch-clamp experiments.

      Reviewer #2 (Recommendations for the authors):

      Given the deluge of conditions to consider, there are several areas of improvement possible in communicating the authors' findings. I have the following suggestions to improve the manuscript.

      (1) Please change "pulse train" straight pink bar OR add stimulation marks (such as "*", or individual pulse icons) to provide better visual clarity that the applied stimuli are "short ON, long OFF" electrical pulses. I had significant initial difficulty understanding what the pulse bars represented in Figures 2, 3, 4A-B, etc. This may be partially because stimuli here could be either light (either continuous or pulsed) or electrical (likely pulsed only). To me, a solid & unbroken line intuitively denotes a continuous stimulation. I understand now that the pink bar represents the entire pulse-train duration, but I think readers would be better served with an improvement to this indicator in some fashion. For instance, the "phases" were much clearer in Figures 7C and 8D because of how colour was used on the Vm(t) traces. (How you implement this is up to you, though!)

      We have addressed the reviewer’s concern and updated the figures by marking each external pulse with a small vertical line (see below).

      (2) Please label the electrical stimulation location (akin to the labelled stimulation marker in circle 2 state in Figure 1A) in at least Figures 2 and 4A, and at most throughout the manuscript. It is unclear which "edge" or "pixel" the pulse-train is originating from, although I've assumed it's the left edge of the 2D tissue (both in vitro and silico). This would help readers compare the relative timing of dark blue vs. orange optical signal tracings and to understand how the activation wavefront transverses the tissue.

      We indicated the pacing electrode in the optical voltage recordings with a grey asterisk. For the in silico simulations, the electrode was assumed to be far away, and the excitation was modelled as a parallel wave originating from the top boundary, indicated with a grey zone.

      (3) Given the prevalence of computational experiments in this study, I suggest considering making a straightforward video demonstrating basic examples of STA, OSC, and TR.OSC states. I believe that a video visualizing these states would be visually clarifying to and greatly appreciated by readers. Appendix 2 Figure 3 would be the no-motion visualization of the examples I'm thinking of (i.e., a corresponding stitched video could be generated for this). However, this video-generation comment is a suggestion and not a request.

      We have included a video showing all relevant states, which is now part of the Supplementary Material.

      (4) Please fix several typos that I found in the manuscript:

      (4A) Line 279: a comma is needed after i.e. when used in: "peculiar, i.e. a standard". However, this is possibly stylistic (discard suggestion if you are consistent in the manuscript).

      (4B) Line 382: extra period before "(Figure 3C)".

      (4C) Line 501: two periods at end of sentence "scientific purposes.." .

      We would like to thank the reviewer for pointing out these typos. We have corrected them and conducted an additional check throughout the manuscript for minor errors.

    1. Author Response:

      Reviewer #1 (Public Review):

      [...] The major limitation of the manuscript lies in the framing and interpretation of the results, and therefore the evaluation of novelty. Authors claim for an important and unique role of beliefs-of-other-pain in altruistic behavior and empathy for pain. The problem is that these experiments mainly show that behaviors sometimes associated with empathy-for-pain can be cognitively modulated by changing prior beliefs. To support the notion that effects are indeed relating to pain processing generally or empathy for pain specifically, a similar manipulation, done for instance on beliefs about the happiness of others, before recording behavioural estimation of other people's happiness, should have been performed. If such a belief-about-something-else-than-pain would have led to similar results, in terms of behavioural outcome and in terms of TPJ and MFG recapitulating the pattern of behavioral responses, we would know that the results reflect changes of beliefs more generally. Only if the results are specific to a pain-empathy task, would there be evidence to associate the results to pain specifically. But even then, it would remain unclear whether the effects truly relate to empathy for pain, or whether they may reflect other routes of processing pain.

      We thank Reviewer #1's for these comments/suggestions regarding the specificity of belief effects on brain activity involved in empathy for pain. Our paper reported 6 behavioral/EEG/fMRI experiments that tested effects of beliefs of others’ pain on empathy and monetary donation (an empathy-related altruistic behavior). We showed not only behavioral but also neuroimaging results that consistently support the hypothesis of the functional role of beliefs of others' pain in modulations of empathy (based on both subjective and objective measures as clarified in the revision) and altruistic behavior. We agree with Reviewer 1# that it is important to address whether the belief effect is specific to neural underpinnings of empathy for pain or is general for neural responses to various facial expressions such as happy, as suggested by Reviewer #1. To address this issue, we conducted an additional EEG experiment (which can be done in a limited time in the current situation), as suggested by Reviewer #1. This new EEG experiment tested (1) whether beliefs of authenticity of others’ happiness influence brain responses to perceived happy expressions; (2) whether beliefs of happiness modulate neural responses to happy expressions in the P2 time window as that characterized effects of beliefs of pain on ERPs.

      Our behavioral results in this experiment (as Supplementary Experiment 1 reported in the revision) showed that the participants reported less feelings of happiness when viewing actors who simulate others' smiling compared to when viewing awardees who smile due to winning awards (see the figure below). Our ERP results in Supplementary Experiment 1 further showed that lack of beliefs of authenticity of others’ happiness (e.g., actors simulate others' happy expressions vs. awardees smile and show happy expressions due to winning an award) reduced the amplitudes of a long-latency positive component (i.e., P570) over the frontal region in response to happy expressions. These findings suggest that (1) there are possibly general belief effects on subjective feelings and brain activities in response to facial expressions; (2) beliefs of others' pain or happiness affect neural responses to facial expressions in different time windows after face onset; (3) modulations of the P2 amplitude by beliefs of pain may not be generalized to belief effects on neural responses to any emotional states of others. We reported the results of this new ERP experiment in the revision as Supplementary Experiment 1 and also discussed the issue of specificity of modulations of empathic neural responses by beliefs of others' pain in the revised Discussion (page 49-50).

      Figure Supplementary Experiment Figure 1. EEG results of Supplementary Experiment 1. (a) Mean rating scores of happy intensity related to happy and neutral expressions of faces with awardee or actor/actress identities. (b) ERPs to faces with awardee or actor/actress identities at the frontal electrodes. The voltage topography shows the scalp distribution of the P570 amplitude with the maximum over the central/parietal region. (c) Mean differential P570 amplitudes to happy versus neutral expressions of faces with awardee or actor/actress identities. The voltage topographies illustrate the scalp distribution of the P570 difference waves to happy (vs. neutral) expressions of faces with awardee or actor/actress identities, respectively. Shown are group means (large dots), standard deviation (bars), measures of each individual participant (small dots), and distribution (violin shape) in (a) and (c).

      In the revised Introduction we cited additional literatures to explain the concept of empathy, behavioral and neuroimaging measures of empathy, and how, similar to previous research, we studied empathy for others' pain using subjective (self reports) and objective (brain responses) estimation of empathy (page 6-7). In particular, we mentioned that subjective estimation of empathy for pain depends on collection of self-reports of others' pain and ones' own painful feelings when viewing others' suffering. Objective estimation of empathy for pain relies on recording of brain activities (using fMRI, EEG, etc.) that differentially respond to painful or non-painful stimuli applied to others. fMRI studies revealed greater activations in the ACC, AI, and sensorimotor cortices in response to painful or non-painful stimuli applied to others. EEG studies showed that event-related potentials (ERPs) in response to perceived painful stimulations applied to others' body parts elicited neural responses that differentiated between painful and neutral stimuli over the frontal region as early as 140 ms after stimulus onset (Fan and Han, 2008; see Coll, 2018 for review). Moreover, the mean ERP amplitudes at 140–180 ms predicted subjective reports of others' pain and ones' own unpleasantness. Particularly related to the current study, previous research showed that pain compared to neutral expressions increased the amplitude of the frontal P2 component at 128–188 ms after stimulus onset (Sheng and Han, 2012; Sheng et al., 2013; 2016; Han et al., 2016; Li and Han, 2019) and the P2 amplitudes in response to others' pain expressions positively predicted subjective feelings of own unpleasantness induced by others' pain and self-report of one's own empathy traits (e.g., Sheng and Han, 2012). These brain imaging findings indicate that brain responses to others' pain can (1) differentiate others' painful or non-painful emotional states to support understanding of others' pain and (2) predict subjective feelings of others' pain and one's own unpleasantness induced by others' pain to support sharing of others' painful feelings. These findings provide effective subjective and objective measures of empathy that were used in the current study to investigate neural mechanisms underlying modulation of empathy and altruism by beliefs of others’ pain.

      In addition, we took Reviewer #1’s suggestion for VPS analyses which examined specifically how neural activities in the empathy-related regions identified in the previous research (Krishnan et al., 2016, eLife) were modulated by beliefs of others’ pain. The results (page 40) provide further evidence for our hypothesis. We also reported new results of RSA analyses(page 39) that activities in the brain regions supporting affective sharing (e.g., insula), sensorimotor resonance (e.g., post-central gyrus), and emotion regulation (e.g., lateral frontal cortex) provide intermediate mechanisms underlying modulations of subjective feelings of others' pain intensity due to lack of BOP. We believe that, putting all these results together, our paper provides consistent evidence that empathy and altruistic behavior are modulated by BOP.

      Reviewer #2 (Public Review):

      [...] 1. In laying out their hypotheses, the authors write, "The current work tested the hypothesis that BOP provides a fundamental cognitive basis of empathy and altruistic behavior by modulating brain activity in response to others' pain. Specifically, we tested predictions that weakening BOP inhibits altruistic behavior by decreasing empathy and its underlying brain activity whereas enhancing BOP may produce opposite effects on empathy and altruistic behavior." While I'm a little dubious regarding the enhancement effects (see below), a supporting assumption here seems to be that at baseline, we expect that painful expressions reflect real pain experience. To that end, it might be helpful to ground some of the introduction in what we know about the perception of painful expressions (e.g., how rapidly/automatically is pain detected, do we preferentially attend to pain vs. other emotions, etc.).

      Thanks for this suggestion! We included additional details about previous findings related to processes of painful expressions in the revised Introduction (page 7-8). Specifically, we introduced fMRI and ERP studies of pain expressions that revealed structures and temporal procedure of neural responses to others' pain (vs. neutral) expressions. Moreover, neural responses to others' pain (vs. neutral) expressions were associated with self-report of others' feelings, indicating functional roles of pain-expression induced brain activities in empathy for pain.

      1. For me, the key takeaway from this manuscript was that our assessment of and response to painful expressions is contextually-sensitive - specifically, to information reflecting whether or not targets are actually in pain. As the authors state it, "Our behavioral and neuroimaging results revealed critical functional roles of BOP in modulations of the perception-emotion-behavior reactivity by showing how BOP predicted and affected empathy/empathic brain activity and monetary donations. Our findings provide evidence that BOP constitutes a fundamental cognitive basis for empathy and altruistic behavior in humans." In other words, pain might be an incredibly socially salient signal, but it's still easily overridden from the top down provided relevant contextual information - you won't empathize with something that isn't there. While I think this hypothesis is well-supported by the data, it's also backed by a pretty healthy literature on contextual influences on pain judgments (including in clinical contexts) that I think the authors might want to consider referencing (here are just a few that come to mind: Craig et al., 2010; Twigg et al., 2015; Nicolardi et al., 2020; Martel et al., 2008; Riva et al., 2015; Hampton et al., 2018; Prkachin & Rocha, 2010; Cui et al., 2016).

      Thanks for this great suggestion! Accordingly, we included an additional paragraph in the revised Discussion regarding how social contexts influence empathy and cited the studies mentioned here (page 46-47).

      1. I had a few questions regarding the stimuli the authors used across these experiments. First, just to confirm, these targets were posing (e.g., not experiencing) pain, correct? Second, the authors refer to counterbalancing assignment of these stimuli to condition within the various experiments. Was target gender balanced across groups in this counterbalancing scheme? (e.g., in Experiment 1, if 8 targets were revealed to be actors/actresses in Round 2, were 4 female and 4 male?) Third, were these stimuli selected at random from a larger set, or based on specific criteria (e.g., normed ratings of intensity, believability, specificity of expression, etc.?) If so, it would be helpful to provide these details for each experiment.

      We'd be happy to clarify these questions. First, photos of faces with pain or neutral expressions were adopted from the previous work (Sheng and Han, 2012). Photos were taken from models who were posing but not experience pain. These photos were taken and selected based on explicit criteria of painful expressions (i.e., brow lowering, orbit tightening, and raising of the upper lip; Prkachin, 1992). In addition, the models' facial expressions were validated in independent samples of participants (see Sheng and Han, 2012). Second, target gender was also balanced across groups in this counterbalancing scheme. We also analyzed empathy rating score and monetary donations related to male and female target faces and did not find any significant gender effect (see our response to Point 5 below). Third, because the face stimuli were adopted from the previous work and the models' facial expressions were validated in independent samples of participants regarding specificity of expression, pain intensity, etc (Sheng and Han, 2012), we did not repeat these validation in our participants. Most importantly, we counterbalanced the stimuli in different conditions so that the stimuli in different conditions (e.g., patient vs. actor/actress conditions) were the same across the participants in each experiment. The design like this excluded any potential confound arising from the stimuli themselves.

      1. The nature of the charitable donation (particularly in Experiment 1) could be clarified. I couldn't tell if the same charity was being referenced in Rounds 1 and 2, and if there were multiple charities in Round 2 (one for the patients and one for the actors).

      Thanks for this comment! Yes, indeed, in both Rounds 1 and 2, the participants were informed that the amount of one of their decisions would be selected randomly and donated to one of the patients through the same charity organization (we clarified these in the revised Method section, page 55-56). We made clear in the revision that after we finished all the experiments of this study, the total amount of the participants' donations were subject to a charity organization to help patients who suffer from the same disease after the study.

      1. I'm also having a hard time understanding the authors' prediction that targets revealed to truly be patients in the 2nd round will be associated with enhanced BOP/altruism/etc. (as they state it: "By contrast, reconfirming patient identities enhanced the coupling between perceived pain expressions of faces and the painful emotional states of face owners and thus increased BOP.") They aren't in any additional pain than they were before, and at the outset of the task, there was no reason to believe that they weren't suffering from this painful condition - therefore I don't see why a second mention of their pain status should increase empathy/giving/etc. It seems likely that this is a contrast effect driven by the actor/actress targets. See the Recommendations for the Authors for specific suggestions regarding potential control experiments. (I'll note that the enhancement effect in Experiment 2 seems more sensible - here, the participant learns that treatment was ineffective, which may be painful in and of itself.)

      Thanks for comments on this important point! Indeed, our results showed that reassuring patient identities in Experiment 1 or by noting the failure of medical treatment related to target faces in Experiment 2 increased rating scores of others' pain and own unpleasantness and prompted more monetary donations to target faces. The increased empathy rating scores and monetary donations might be due to that repeatedly confirming patient identity or knowing the failure of medical treatment increased the belief of authenticity of targets' pain and thus enhanced empathy. However, repeatedly confirming patient identity or knowing the failure of medical treatment might activate other emotional responses to target faces such as pity or helplessness, which might also influence altruistic decisions. We agree with Reviewer #2 that, although our subjective estimation of empathy in Exp. 1 and 2 suggested enhanced empathy in the 2nd_round test, there are alternative interpretations of the results and these should be clarified in future work. We clarified these points in the revised Discussion (page 41-42).

      1. I noted that in the Methods for Experiment 3, the authors stated "We recruited only male participants to exclude potential effects of gender difference in empathic neural responses." This approach continues through the rest of the studies. This raises a few questions. Are there gender differences in the first two studies (which recruited both male and female participants)? Moreover, are the authors not concerned about target gender effects? (Since, as far as I can tell, all studies use both male and female targets, which would mean that in Experiments 3 and on, half the targets are same-gender as the participants and the other half are other-gender.) Other work suggests that there are indeed effects of target gender on the recognition of painful expressions (Riva et al., 2011).

      Thanks for raising this interesting question! Therefore, we reanalyzed data in Exp. 1 by including participants' gender or face gender as an independent variable. The three-way ANOVAs of pain intensity scores and amounts of monetary donations with Face Gender (female vs. male targets) × Test Phase (1st vs. 2nd_round) × Belief Change (patient-identity change vs. patient-identity repetition) did not show any significant three-way interaction (F(1,59) = 0.432 and 0.436, p = 0.514 and 0.512, ηp2 = 0.007 and 0.007, 90% CI = (0, 0.079) and (0, 0.079), indicating that face gender do not influence the results (see the figure below). Similarly, the three-way ANOVAs with Participant Gender (female vs. male participants) × Test Phase × Belief Change did not show any significant three-way interaction (F(1,58) = 0.121 and 1.586, p = 0.729 and 0.213, ηp2 = 0.002 and 0.027, 90% CI = (0, 0.055) and (0, 0.124), indicating no reliable difference in empathy and donation between men and women. It seems that the measures of empathy and altruistic behavior in our study were not sensitive to gender of empathy targets and participants' sexes.

      image Figure legend: (a) Scores of pain intensity and amount of monetary donations are reported separately for male and female target faces. (b) Scores of pain intensity and amount of monetary donations are reported separately for male and female participants.

      1. I was a little unclear on the motivation for Experiment 4. The authors state "If BOP rather than other processes was necessary for the modulation of empathic neural responses in Experiment 3, the same manipulation procedure to assign different face identities that do not change BOP should change the P2 amplitudes in response to pain expressions." What "other processes" are they referring to? As far as I could tell, the upshot of this study was just to demonstrate that differences in empathy for pain were not a mere consequence of assignment to social groups (e.g., the groups must have some relevance for pain experience). While the data are clear and as predicted, I'm not sure this was an alternate hypothesis that I would have suggested or that needs disconfirming.

      Thanks for this comment! We feel sorry for not being able to make clear the research question in Exp. 4. In the revised Results section (page 27-28) we clarified that the learning and EEG recording procedures in Experiment 3 consisted of multiple processes, including learning, memory, identity recognition, assignment to social groups, etc. The results of Experiment 3 left an open question of whether these processes, even without BOP changes induced through these processes, would be sufficient to result in modulation of the P2 amplitude in response to pain (vs. neutral) expressions of faces with different identities. In Experiment 4 we addressed this issue using the same learning and identity recognition procedures as those in Experiment 3 except that the participants in Experiment 4 had to learn and recognize identities of faces of two baseball teams and that there is no prior difference in BOP associated with faces of beliefs of the two baseball teams. If the processes involved in the learn and reorganization procedures rather than the difference in BOP were sufficient for modulation of the P2 amplitude in response to pain (vs. neutral) expressions of faces, we would expect similar P2 modulations in Experiments 4 and 3. Otherwise, the difference in BOP produced during the learning procedure was necessary for the modulation of empathic neural responses, we would not expect modulations of the P2 amplitude in response to pain (vs. neutral) expressions in Experiment 4. We believe that the goal and rationale of Exp. 4 are clear now.

  2. drive.google.com drive.google.com
    1. Although knowledge, caring, and action are conceptually distinct, in the classroom they are highly interrelated. In my multicultural classes for teacher education students, I use historical and sociological knowledge about the experiences of different ethnic and racial groups to inform as well as to enable the students to examine and clarify their personal attitudes about ethnic diversity.

      I like this model and I think it would do well to implement this into classrooms. Knowing is awareness, caring is the heart, and acting is doing something about that care and conviction. It allows our desire to help and be kind come into fruition, it helps rid us of preconceptions or close-mindedness we may have been subjected to. I think these three are very different (as mentioned), but they all complement each other, allowing us to take a step towards cultivating multicultural education.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Authors’ reply (____Ono et al)

      Review Commons Refereed Preprint #RC-2025-03137

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Ono et al addressed how condensin II and cohesin work to define chromosome territories (CT) in human cells. They used FISH to assess the status of CT. They found that condensin II depletion leads to lengthwise elongation of G1 chromosomes, while double depletion of condensin II and cohesin leads to CT overlap and morphological defects. Although the requirement of condensin II in shortening G1 chromosomes was already shown by Hoencamp et al 2021, the cooperation between condensin II and cohesin in CT regulation is a new finding. They also demonstrated that cohesin and condensin II are involved in G2 chromosome regulation on a smaller and larger scale, respectively. Though such roles in cohesin might be predictable from its roles in organizing TADs, it is a new finding that the two work on a different scale on G2 chromosomes. Overall, this is technically solid work, which reports new findings about how condensin II and cohesin cooperate in organizing G1 and G2 chromosomes.

      We greatly appreciate the reviewer’s supportive comments. The reviewer has accurately recognized our new findings concerning the collaborative roles of condensin II and cohesin in establishing and maintaining interphase chromosome territories.

      Major point:

      They propose a functional 'handover' from condensin II to cohesin, for the organization of CTs at the M-to-G1 transition. However, the 'handover', i.e. difference in timing of executing their functions, was not experimentally substantiated. Ideally, they can deplete condensin II and cohesin at different times to prove the 'handover'. However, this would require the use of two different degron tags and go beyond the revision of this manuscript. At least, based on the literature, the authors should discuss why they think condensin II and cohesin should work at different timings in the CT organization.

      We take this comment seriously, especially because Reviewer #2 also expressed the same concern. 

      First of all, we must admit that the basic information underlying the “handover” idea was insufficiently explained in the original manuscript. Let us make it clear below:

      • Condensin II bound to chromosomes and is enriched along their axes from anaphase through telophase (Ono et al., 2004; Hirota et al., 2004; Walther et al., 2018).
      • In early G1, condensin II is diffusely distributed within the nucleus and does not bind tightly to chromatin, as shown by detergent extraction experiments (Ono et al., 2013).
      • Cohesin starts binding to chromatin when the cell nucleus reassembles (i.e., during the cytokinesis stage shown in Fig. 1B), apparently replacing condensins I and II (Brunner et al., 2025).
      • Condensin II progressively rebinds to chromatin from S through G2 phase (Ono et al., 2013). The cell cycle-dependent changes in chromosome-bound condensin II and cohesin summarized above are illustrated in Fig. 1A. We now realize that Fig. 1B in the original manuscript was inconsistent with Fig. 1A, creating unnecessary confusion, and we sincerely apologize for this. The fluorescence images shown in the original Fig. 1B were captured without detergent extraction prior to fixation, giving the misleading impression that condensin II remained bound to chromatin from cytokinesis through early G1. This was not our intention. To clarify this, we have repeated the experiment in the presence of detergent extraction and replaced the original Fig. 1B with a revised panel. Figs. 1A and 1B are now more consistent with each other. Accordingly, we have modified the correspsonding sentences as follows:

      Although condensin II remains nuclear throughout interphase, its chromatin binding is weak in G1 and becomes robust from S phase through G2 (Ono et al., 2013). Cohesin, in contrast, replaces condensin II in early G1 (Fig. 1 B)(Abramo et al., 2019; Brunner et al., 2025), and establishes topologically associating domains (TADs) in the G1 nucleus (Schwarzer et al., 2017; Wutz et al., 2017)*. *

      While there is a loose consensus in the field that condensin II is replaced by cohesin during the M-to-G1 transition, it remains controversial whether there is a short window during which neither condensin II nor cohesin binds to chromatin (Abramo et al., 2019), or whether there is a stage in which the two SMC protein complexes “co-occupy” chromatin (Brunner et al., 2025). Our images shown in the revised Fig. 1B cannot clearly distinguish between these two possibilities.

      From a functional point of view, the results of our depletion experiments are more readily explained by the latter possibility. If this is the case, the “interplay” or “cooperation” rather than the “handover” may be a more appropriate term to describe the functional collaboration between condensin II and cohesin during the M-to-G1 transition. For this reason, we have avoided the use of the word “handover” in the revised manuscript. It should be emphasized, however, that given their distinct chromosome-binding kinetics, the cooperation of the two SMC complexes during the M-to-G1 transition is qualitatively different from that observed in G2. Therefore, the central conclusion of the present study remains unchanged.

      For example, a sentence in Abstract has been changed as follows:

      a functional interplay between condensin II and cohesin during the mitosis-to-G1 transition is critical for establishing chromosome territories (CTs) in the newly assembling nucleus.

      While the reviewer suggested one experiment, it is clearly beyond the scope of the current study. It should also be noted that even if such a cell line were available, the proposed application of sequential depletion to cells progressing from mitosis to G1 phase would be technically challenging and unlikely to produce results that could be interpreted with confidence.

      Other points:

      Figure 2E: It seems that the chromosome length without IAA is shorter in Rad21-aid cells than H2-aid cells or H2-aid Rad21-aid cells. How can this be interpreted? This comment is well taken. A related comment was made by Reviewer #3 (Major comment #2). Given the substantial genetic manipulations applied to establish multiple cell lines used in the present study, it is, strictly speaking, not straightforward to compare the -IAA controls between different cell lines. Such variations are most prominently observed in Fig. 2E, although they can also be observed to lesser extent in other experiments (e.g., Fig. 3E). This issue is inherently associated with all studies using genetically manipulated cell lines and therefore cannot be completely avoided. For this reason, we focus on the differences between -IAA and +IAA within each cell line, rather than comparing the -IAA conditions across different cell lines. In this sense, a sentence in the original manuscript (lines 178-180) was misleading. In the revised manuscript, we have modified the corresponding and subsequent sentence as follows:

      Although cohesin depletion had a marginal effect on the distance between the two site-specific probes (Fig.2, C and E), double depletion did not result in a significant change (Fig.2, D and E), consistent with the partial restoration of centromere dispersion (Fig. 1G).

      • *

      In addition, we have added a section entitled “Limitations of the study” at the end of the Discussion to address technical issues that are inevitably associated with the current approach.

      Figure 3: Regarding the CT morphology, could they explain further the difference between 'elongated' and 'cloud-like (expanded)'? Is it possible to quantify the frequency of these morphologies? In the original manuscript, we provided data that quantitatively distinguished between the “elongated” and “cloud-like” phenotypes. Specifically, Fig. 2E shows that the distance between two specific loci (Cen 12 and 12q15) is increased in the elongated phenotype but not in the cloud-like phenotype. In addition, the cloud-like morphology was clearly deviated from circularity, as indicated by the circularity index (Fig. 3F). However, because circularity can also decrease in rod-shaped chromosomes, these datasets alone may not be sufficiently convincing, as the reviewer pointed out. We have now included an additional parameter, the aspect ratio, defined as the ratio of an object’s major axis to its minor axis (new Fig. 3F). While this intuitive parameter was altered upon condensin II depletion and double depletion, again, we acknowledge that it is not sufficient to convincingly distinguish between the elongated and cloud-like phenotypes proposed in the original manuscript. For these reasons, in the revised manuscript, we have toned down our statements regarding the differences in CT morphology between the two conditions. Nonetheless, together with the data from Figs. 1 and 2, it is that the Rabl configuration observed upon condensin II depletion is further exacerbated in the absence of cohesin. Accordingly, we have modified the main text and the cartoon (Fig 3H) to more accurately depict the observations summarized above.

      Figure 5: How did they assign C, P and D3 for two chromosomes? The assignment seems obvious in some cases, but not in other cases (e.g. in the image of H2-AID#2 +IAA, two D3s can be connected to two Ps in the other way). They may have avoided line crossing between two C-P-D3 assignments, but can this be justified when the CT might be disorganized e.g. by condensin II depletion? This comment is well taken. As the reviewer suspected, we avoided line crossing between two sets of assignments. Whenever there was ambiguity, such images were excluded from the analysis. Because most chromosome territories derived from two homologous chromosomes are well separated even under the depleted conditions as shown in Fig. 6C, we did not encounter major difficulties in making assignments based on the criteria described above. We therefore remain confident that our conclusion is valid.

      That said, we acknowledge that our assignments of the FISH images may not be entirely objective. We have added this point to the “Limitations of the study” section at the end of the Discussion.

      Figure 6F: The mean is not indicated on the right-hand side graph, in contrast to other similar graphs. Is this an error? We apologize for having caused this confusion. First, we would like to clarify that the right panel of Fig. 6F should be interpreted together with the left panel, unlike the seemingly similar plots shown in Figs. 6G and 6H. In the left panel of Fig. 6F, the percentages of CTs that contact the nucleolus are shown in grey, whereas those that do not are shown in white. All CTs classified in the “non-contact” population (white) have a value of zero in the right panel, represented by the bars at 0 (i.e., each bar corresponds to a collection of dots having a zero value). In contrast, each CT in the “contact” population (grey) has a unique contact ratio value in the right panel. Because the right panel consists of two distinct groups, we reasoned that placing mean or median bars would not be appropriate. This was why no mean or median bars were shown in in the tight panel (The same is true for Fig. S5 A and B).

      That said, for the reviewer’s reference, we have placed median bars in the right panel (see below). In the six cases of H2#2 (-/+IAA), Rad21#2 (-/+IAA), Double#2 (-IAA), and Double#3 (-IAA), the median bars are located at zero (note that in these cases the mean bars [black] completely overlap with the “bars” derived from the data points [blue and magenta]). In the two cases of Double#2 (+IAA) and Double#3 (+IAA), they are placed at values of ~0.15. Statistically significant differences between -IAA and +IAA are observed only in Double#2 and Double#3, as indicated by the P-value shown on the top of the panel. Thus, we are confident in our conclusion that CTs undergo severe deformation in the absence of both condensin II and cohesin.

      Figure S1A: The two FACS profiles for Double-AID #3 Release-2 may be mixed up between -IAA and +IAA. The review is right. This inadvertent error has been corrected.

      The method section explains that 'circularity' shows 'how closely the shape of an object approximates a perfect circle (with a value of 1 indicating a perfect circle), calculated from the segmented regions'. It would be helpful to provide further methodological details about it. We have added further explanations regarding the circularity in Materials and Methods together with a citation (two added sentences are underlined below):

      To analyze the morphology of nuclei, CTs, and nucleoli, we measured “circularity,” a morphological index that quantifies how closely the shape of an object approximates a perfect circle (value =1). Circularity was defined as 4π x Area/Perimeter2, where both the area and perimeter of each segmented object were obtained using ImageJ. This index ranges from 0 to 1, with values closer to 1 representing more circular objects and lower values correspond to elongated or irregular shapes (Chen et al, 2017).

      Chen, B., Y. Wang, S. Berretta and O. Ghita. 2017. Poly Aryl Ether Ketones (PAEKs) and carbon-reinforced PAEK powders for laser sintering. J Mater Sci 52:6004-6019.

      Reviewer #1 (Significance (Required)):

      Ono et al addressed how condensin II and cohesin work to define chromosome territories (CT) in human cells. They used FISH to assess the status of CT. They found that condensin II depletion leads to lengthwise elongation of G1 chromosomes, while double depletion of condensin II and cohesin leads to CT overlap and morphological defects. Although the requirement of condensin II in shortening G1 chromosomes was already shown by Hoencamp et al 2021, the cooperation between condensin II and cohesin in CT regulation is a new finding. They also demonstrated that cohesin and condensin II are involved in G2 chromosome regulation on a smaller and larger scale, respectively. Though such roles in cohesin might be predictable from its roles in organizing TADs, it is a new finding that the two work on a different scale on G2 chromosomes. Overall, this is technically solid work, which reports new findings about how condensin II and cohesin cooperate in organizing G1 and G2 chromosomes.

      See our reply above.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary:

      Ono et al use a variety of imaging and genetic (AID) depletion approaches to examine the roles of condensin II and cohesin in the reformation of interphase genome architecture in human HCT16 cells. Consistent with previous literature, they find that condensin II is required for CENP-A dispersion in late mitosis/early G1. Using in situ FISH at the centromere/q arm of chromosome 12 they then establish that condensin II removal causes lengthwise elongation of chromosomes that, interestingly, can be suppressed by cohesin removal. To better understand changes in whole-chromosome morphology, they then use whole chromosome painting to examine chromosomes 18 and 19. In the absence of condensin II, cells effectively fail to reorganise their chromosomes from rod-like structures into spherical chromosome territories (which may explain why CENP-A dispersion is suppressed). Cohesin is not required for spherical CT formation, suggesting condensin II is the major initial driver of interphase genome structure. Double depletion results in complete disorganisation of chromatin, leading the authors to conclude that a typical cell cycle requires orderly 'handover' from the mitotic to interphase genome organising machinery. The authors then move on to G2 phase, where they use a variety of different FISH probes to assess alterations in chromosome structure at different scales. They thereby establish that perturbation of cohesin or condensin II influences local and longer range chromosome structure, respectively. The effects of condensin II depletion become apparent at a genomic distance of 20 Mb, but are negligible either below or above. The authors repeat the G1 depletion experiment in G2 and now find that condensin II and cohesin are individually dispensable for CT organisation, but that dual depletion causes CT collapse. This rather implies that there is cooperation rather than handover per se. Overall this study is a broadly informative multiscale investigation of the roles of SMC complexes in organising the genome of postmitotic cells, and solidifies a potential relationship between condensin II and cohesin in coordinating interphase genome structure. The deeper investigation of the roles of condensin II in establishing chromosome territories and intermediate range chromosome structure in particular is a valuable and important contribution, especially given our incomplete understanding of what functions this complex performs during interphase.

      We sincerely appreciate the reviewer’s supportive comments. The reviewer has correctly acknowledged both the current gaps in our understanding of the role of condensin II in interphase chromosome organization and our new findings on the collaborative roles of condensin II and cohesin in establishing and maintaining interphase chromosome territories.

      Major comments:

      In general the claims and conclusions of the manuscript are well supported by multiscale FISH labelling. An important absent control is western blotting to confirm protein depletion levels. Currently only fluorescence is used as a readout for the efficiency of the AID depletion, and we know from prior literature that even small residual quantities of SMC complexes are quite effective in organising chromatin. I would consider a western blot a fairly straightforward and important technical control.

      Let me explain why we used immunofluorescence measurements to evaluate the efficiency of depletion. In our current protocol for synchronizing at the M-to-G1 transition, ~60% of control and H2-depleted cells, and ~30% of Rad21-depleted and co-depleted cells, are successfully synchronized in G1 phase. The apparently lower synchronization efficiency in the latter two groups is attributable to the well-documented mitotic delay caused by cohesin depletion. From these synchronized populations, early G1 cells were selected based on their characteristic morphologies (see the legend of Fig. 1C). In this way, we analyzed an early G1 cell population that had completed mitosis without chromosome segregation defects. We acknowledge that this represents a technically challenging aspect of M-to-G1 synchronization in HCT116 cells, whose synchronization efficiency is limited compared with that of HeLa cells. Nevertheless, this approach constitutes the most practical strategy currently available. Hence, immunofluorescence provides the only feasible means to evaluate depletion efficiency under these conditions.

      Although immunoblotting can, in principle, be applied to G2-arrested cell populations, we do not believe that information obtained from such experiments would affect the main conclusions of the current study. Please note that we carefully designed and performed all experiments with appropriate controls: H2 depletion, RAD21 depletion, and double depletion, with outcomes confirmed using independent cell lines (Double-AID#2 and Double-AID#3) whenever deemed necessary.

      We fully acknowledge the technical limitations associated with the AID-mediated depletion techniques, which are now described in the section entitled “Limitations of the study” at the end of the Discussion. Nevertheless, we emphasize that these limitations do not compromise the validity of our findings.

      I find the point on handover as a mechanism for maintaining CT architecture somewhat ambiguous, because the authors find that the dependence simply switches from condensin II to both condensin II and cohesin, between G1 and G2. To me this implies augmented cooperation rather than handover. I have two further suggestions, both of which I would strongly recommend but would consider desirable but 'optional' according to review commons guidelines.

      First of all, we would like to clarify a possible misunderstanding regarding the phrase “handover as a mechanism for maintaining CT architecture somewhat ambiguous”. In the original manuscript, we proposed handover as a mechanism for establishing G1 chromosome territories, not for maintaining CTs.

      That said, we take this comment very seriously, especially because Reviewer #1 also expressed the same concern. Please see our reply to Reviewer #1 (Major point).

      In brief, we agree with the reviewer that the word “handover” may not be appropriate to describe the functional relationship between condensin II and cohesin during the M-to-G1 transition. In the revised manuscript, we have avoided the use of the word “handover”, replacing it with “interplay”. It should be emphasized, however, that given their distinct chromosome-binding kinetics, the cooperation of the two SMC complexes during the M-to-G1 transition is qualitatively different from that observed in G2. Therefore, the central conclusion of the present study remains unchanged.

      For example, a sentence in Abstract has been changed as follows:

      a functional interplay between condensin II and cohesin during the mitosis-to-G1 transition is critical for establishing chromosome territories (CTs) in the newly assembling nucleus.

      Firstly, the depletions are performed at different stages of the cell cycle but have different outcomes. The authors suggest this is because handover is already complete, but an alternative possibility is that the phenotype is masked by other changes in chromosome structure (e.g. duplication/catenation). I would be very curious to see, for example, how the outcome of this experiment would change if the authors were to repeat the depletions in the presence of a topoisomerase II inhibitor.

      The reviewer’s suggestion here is somewhat vague, and it is unclear to us what rationale underlies the proposed experiment or what meaningful outcomes could be anticipated. Does the reviewer suggest that we perform topo II inhibitor experiments both during the M-to-G1 transition and in G2 phase, and then compare the outcomes between the two conditions?

      For the M-to-G1 transition, Hildebrand et at (2024) have already reported such experiments. They used a topo II inhibitor to provided evidence that mitotic chromatids are self-entangled and that the removal of these mitotic entanglements is required to establish a normal interphase nucleus. Our own preliminary experiments (not presented in the current manuscript) showed that ICRF treatment of cells undergoing the M-to-G1 transition did not affect post-mitotic centromere dispersion. The same treatment also had little effect on the suppression of centromere dispersion observed in condensin II-depleted cells.

      Under G2-arrested condition, because chromosome territories are largely individualized, we would expect topo II inhibition to affect only the extent of sister catenation, which is not the focus of our current study. We anticipate that inhibiting topo II in G2 would have only a marginal, if any, effect on the maintenance of chromosome territories detectable by our current FISH approaches.

      In any case, we consider the suggested experiment to be beyond the scope of the present manuscript, which focuses on the collaborative roles of condensin II and cohesin as revealed by multi-scale FISH analyses.

      Secondly, if the author's claim of handover is correct then one (not exclusive) possibility is that there is a relationship between condensin II and cohesin loading onto chromatin. There does seem to be a modest co-dependence (e.g. fig S4 and S7), could the authors comment on this?

      First of all, we wish to point out the reviewer’s confusion between the G2 experiments and the M-to-G1 experiments. Figs. S4 and S7 concern experiments using G2-arrested cells, not M-to-G1 cells in which a possible handover mechanism is discussed. Based on Fig. 1, in which the extent of depletion in M-to-G1 cells was tested, no evidence of “co-dependence” between H2 depletion and RAD21 depletion was observed.

      That said, as the reviewer correctly points out, we acknowledge the presence of marginal yet statistically significant reductions in the RAD21 signal upon H2 depletion (and vice versa) in G2-arrested cells (Figs. S4 and S7).

      Another control experiment here would be to treat fully WT cells with IAA and test whether non-AID labelled H2 or RAD21 dip in intensity. If they do not, then perhaps there's a causal relationship between condensin II and cohesin levels?

      According to the reviewer’s suggestion, we tested whether IAA treatment causes an unintentional decreases in the H2 or RAD21 signals in G2-arrested cells, and found that it is not the case (see the attached figure below).

      Thus, these data indicate that there is a modest functional interdependence between condensin II and cohesin in G2-arrested cells. For instance, condensin II depletion may modestly destabilize chromatin-bound cohesin (and vice versa). However, we note that these effects are minor and do not affect the overall conclusions of the study. In the revised manuscript, we have described these potentially interesting observations briefly as a note in the corresponding figure legends (Fig. S4).

      I recognise this is something considered in Brunner et al 2025 (JCB), but in their case they depleted SMC4 (so all condensins are lost or at least dismantled). Might bear further investigation.

      Methods:

      Data and methods are described in reasonable detail, and a decent number of replicates/statistical analyses have been. Documentation of the cell lines used could be improved. The actual cell line is not mentioned once in the manuscript. Although it is referenced, I'd recommend including the identity of the cell line (HCT116) in the main text when the cells are introduced and also in the relevant supplementary tables. Will make it easier for readers to contextualise the findings.

      We apologize for the omission of important information regarding the parental cell line used in the current study. The information has been added to Materials and Methods as well as the resource table.

      Minor comments:

      Overall the manuscript is well-written and well presented. In the introduction it is suggested that no experiment has established a causal relationship between human condensin II and chromosome territories, but this is not correct, Hoencamp et al 2021 (cell) observed loss of CTs after condensin II depletion. Although that manuscript did not investigate it in as much detail as the present study, the fundamental relationship was previously established, so I would encourage the authors to revise this statement.

      We are somewhat puzzled by this comment. In the original manuscript, we explicitly cited Hoencamp et al (2021) in support of the following sentences:

      • *

      (Lines 78-83 in the original manuscript)

      *Moreover, high-throughput chromosome conformation capture (Hi-C) analysis revealed that, under such conditions, chromosomes retain a parallel arrangement of their arms, reminiscent of the so-called Rabl configuration (Hoencamp et al., 2021). These findings indicate that the loss or impairment of condensin II during mitosis results in defects in post-mitotic chromosome organization. *

      • *

      That said, to make the sentences even more precise, we have made the following revision in the manuscript.

      • *

      (Lines 78- 82 in the revised manuscript)

      *Moreover, high-throughput chromosome conformation capture (Hi-C) analysis revealed that, under such conditions, chromosomes retain a parallel arrangement of their arms, reminiscent of the so-called Rabl configuration (Hoencamp et al., 2021). These findings,together with cytological analyses of centromere distributions, indicate that the loss or impairment of condensin II during mitosis results in defects in post-mitotic chromosome organization. *

      • *

      The following statement was intended to explain our current understanding of the maintenance of chromosome territories. Because Hoencamp et al (2021) did not address the maintenance of CTs, we have kept this sentence unchanged.

      • *

      (Lines 100-102 in the original manuscript)

      Despite these findings, there is currently no evidence that either condensin II, cohesin, or their combined action contributes to the maintenance of CT morphology in mammalian interphase cells (Cremer et al., 2020).

      • *

      • *

      Reviewer #2 (Significance (Required)):

      General assessment:

      Strengths: the multiscale investigation of genome architecture at different stages of interphase allow the authors to present convincing and well-analysed data that provide meaningful insight into local and global chromosome organisation across different scales.

      Limitations:

      As suggested in major comments.

      Advance:

      Although the role of condensin II in generating chromosome territories, and the roles of cohesin in interphase genome architecture are established, the interplay of the complexes and the stage specific roles of condensin II have not been investigated in human cells to the level presented here. This study provides meaningful new insight in particular into the role of condensin II in global genome organisation during interphase, which is much less well understood compared to its participation in mitosis.

      Audience:

      Will contribute meaningfully and be of interest to the general community of researchers investigating genome organisation and function at all stages of the cell cycle. Primary audience will be cell biologists, geneticists and structural biochemists. Importance of genome organisation in cell/organismal biology is such that within this grouping it will probably be of general interest.

      My expertise is in genome organization by SMCs and chromosome segregation.

      We appreciate the reviewer’s supportive comments. As the reviewer fully acknowledges, this study is the first systematic survey of the collaborative role of condensin II and cohesin in establishing and maintaining interphase chromosome territories. In particular, multi-scale FISH analyses have enabled us to clarify how the two SMC protein complexes contribute to the maintenance of G2 chromosome territories through their actions at different genomic scales. As the reviewer notes, we believe that the current study will appeal to a broad readership in cell and chromosome biology. The limitations of the current study mentioned by the reviewer are addressed in our reply above.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary:

      The manuscript “Condensin II collaborates with cohesin to establish and maintain interphase chromosome territories" investigates how condensin II and cohesin contribute to chromosome organization during the M-to-G1 transition and in G2 phase using published auxin-inducible degron (AID) cell lines which render the respective protein complexes nonfunctional after auxin addition. In this study, a novel degron cell line was established that enables the simultaneous depletion of both protein complexes, thereby facilitating the investigation of synergistic effects between the two SMC proteins. The chromosome architecture is studied using fluorescence in situ hybridization (FISH) and light microscopy. The authors reproduce a number of already published data and also show that double depletion causes during the M-to-G1 transition defects on chromosome territories, producing expanded, irregular shapes that obscure condensin II-specific phenotypes. Findings in G2 cells point to a new role of condensin II for chromosome conformation at a scale of ~20Mb. Although individual depletion has minimal effects on large-scale CT morphology in G2, combined loss of both complexes produces marked structural abnormalities, including irregular crescent-shaped CTs displaced toward the nucleolus and increased nucleolus-CT contact. The authors propose that condensin II and cohesin act sequentially and complementarily to ensure proper post-mitotic CT formation and maintain chromosome architecture across genomic scales.

      We greatly appreciate the reviewer’s supportive comments. The reviewer has accurately recognized our new findings concerning the collaborative roles of condensin II and cohesin in the establishment and maintenance of interphase chromosome territories.

      Concenrs about statistics:

      • The authors provide the information on how many cells are analyzed but not the number of independent experiments. My concern is that there might variations in synchronization of the cell population and in the subsequent preparation (FISH) affecting the final result. We appreciate the reviewer’s important comment regarding the biological reproducibility of our experiments. As the reviewer correctly points out, variations in cell-cycle synchronization and FISH sample preparation can occur across experiments. To address this concern, we repeated the key experiments supporting our main conclusions (Figs. 3 and 6) two additional times, resulting in three independent biological replicas in total. All replicate experiments reproduced the major observations from the original analyses. These results further substantiated our original conclusion, despite the inevitable variability arising from cell synchronization or sample preparation in this type of experiments. In the revised manuscript, we have now explicitly indicated the number of biological replicates in the corresponding figures.

      The analyses of chromosome-arm conformation shown in Fig. 5 were already performed in three independent rounds of experiments, as noted in the original submission. In addition, similar results were already obtained in other analyses reported in the manuscript. For example, centromere dispersion was quantified using an alternative centromere detection method (related to Fig. 1), and distances between specific chromosomal sites were measured using different locus-specific probes (related to Figs. 2 and 4). In both cases, the results were consistent with those presented in the manuscript.

      • Statistically the authors analyze the effect of cells with induced degron vs. vehicle control (non-induced). However, the biologically relevant question is whether the data differ between cell lines when the degron system is induced. This is not tested here (cf. major concern 2 and 3). See our reply to major concerns 2 and 3.

      • Some Journal ask for blinded analysis of the data which might make sense here as manual steps are involved in the data analysis (e.g. line 626 / 627the convex hull of the signals was manually delineated, line 635 / 636 Chromosome segmentation in FISH images was performed using individual thresholding). However personally I have no doubts on the correctness of the work. We thank the reviewer for pointing out that some steps in our data analysis were performed manually, such as delineating the convex hull of signals and segmenting chromosomes in FISH and IF images using individual thresholds. These manual steps were necessary because signal intensities vary among cells and chromosomes, making fully automated segmentation unreliable. To ensure objectivity, we confirmed that the results were consistent across two independently established double-depletion cell lines, which produced essentially identical findings. In addition, we repeated the key experiments underpinning our main conclusions (Figs. 3 and 6) two additional times, and the results were fully consistent with the original analyses. Therefore, we are confident that our current data analysis approach does not compromise the validity of our conclusions. Finally, we appreciate the reviewer’s kind remark that there is no doubt regarding the correctness of our work.

      Major concerns:

      • Degron induction appears to delay in Rad21-AID#1 and Double-AID#1 cells the transition from M to G1, as shown in Fig. S1. After auxin treatment, more cells exhibit a G2 phenotype than in an untreated population. What are the implications of this for the interpretation of the experiments? In our protocol shown in Fig. 1C, cells were released into mitosis after G2 arrest, and IAA was added 30 min after release. It is well established that cohesin depletion causes a prometaphase delay due to spindle checkpoint activation (e.g., Vass et al, 2003, Curr Biol; Toyoda and Yanagida, 2006, MBoC; Peters et al, 2008, Genes Dev), which explains why cells with 4C DNA content accumulated, as judged by FACS (Fig. S1). The same was true for doubly depleted cells. However, a fraction of cells that escaped this delay progressed through mitosis and enter the G1 phase of the next cell cycle. We selected these early G1 cells and used them for down-stream analyses. This experimental procedure was explicitly described in the legends of Fig. 1C and Fig. S1A as follows:

      (Lines 934-937; Legend of Fig. 1C)

      From the synchronized populations, early G1cells were selected based on their characteristic morphologies (i.e., pairs of small post-mitotic cells) and subjected to downstream analyses. Based on the measured nuclear sizes (Fig. S2 G), we confirmed that early G1 cells were appropriately selected.

      (Lines 1114-1119; Legend of Fig. S1A)

      In this protocol, ~60% of control and H2-depleted cells, and ~30% of Rad21-depleted and co-depleted cells, were successfully synchronized in G1 phase. The apparently lower synchronization efficiency in the latter two groups is attributable to the well documented mitotic delay caused by cohesin depletion (Hauf et al., 2005; Haarhuis et al., 2013; Perea-Resa et al., 2020). From these synchronized populations, early G1 cells were selected based on their characteristic morphologies (see the legend of Fig. 1 C).

      • *

      Thus, using this protocol, we analyzed an early G1 cell population that had completed mitosis without chromosome segregation defects. We acknowledge that this represents a technically challenging aspect of synchronizing cell-cycle progression from M to G1 in HCT116 cells, whose synchronization efficiency is limited compared with that of HeLa cells. Nevertheless, this approach constitutes the most practical strategy currently available.

      • Line 178 "In contrast, cohesin depletion had a smaller effect on the distance between the two site-specific probes compared to condensin II depletion (Fig. 2, C and E)." The data in Fig. 2 E show both a significant effect of H2 and a significant effect of RAD21 depletion. Whether the absolute difference in effect size between the two conditions is truly relevant is difficult to determine, as the distribution of the respective control groups also appears to be different. This comment is well taken. Reviewer #1 has made a comment on the same issue. See our reply to Reviewer #1 (Other points, Figure 2E).

      In brief, in the current study, we should focus on the differences between -IAA and +IAA within each cell line, rather than comparing the -IAA conditions across different cell lines. In this sense, a sentence in the original manuscript (lines 178-180) was misleading. In the revised manuscript, we have modified the corresponding and subsequent sentence as follows:

      Although cohesin depletion had a marginal effect on the distance between the two site-specific probes (Fig.2, C and E), double depletion did not result in a significant change (Fig.2, D and E), consistent with the partial restoration of centromere dispersion (Fig. 1G).

      • In Figures 3, S3 and related text in the manuscript I cannot follow the authors' argumentation, as H2 depletion alone leads to a significant increase in the CT area (Chr. 18, Chr. 19, Chr. 15). Similar to Fig. 2, the authors argue about the different magnitude of the effect (H2 depletion vs double depletion). Here, too, appropriate statistical tests or more suitable parameters describing the effect should be used. I also cannot fully follow the argumentation regarding chromosome elongation, as double depletion in Chr. 18 and Chr. 19 also leads to a significantly reduced circularity. Therefore, the schematic drawing Fig. 3 H (double depletion) seems very suggestive to me. This comment is related to the comment above (Major comment #2). See our reply to Reviewer #1 (Other points, Figure 2E).

      It should be noted that, in Figure 3 (unlike in Figure 2), we did not compare the different magnitudes of the effect observed between H2 depletion and double depletion. Thus, the reviewer’s comment that “Similar to Fig. 2, the authors argue about the different magnitude of the effect (H2 depletion vs double depletion) ” does not accurately reflected our description.

      Moreover, while the distance between two specific loci (Fig. 2E) and CT circularity (Fig. 3G) are intuitively related, they represent distinct parameters. Thus, it is not unexpected that double depletion resulted in apparently different outcomes for the two measurements. Thus, the reviewer’s counter-argument is not strictly applicable here.

      That said, we agree with the reviewer that our descriptions here need to be clarified.

      The differences between H2 depletion and double depletion are two-fold: (1) centromere dispersion is suppressed upon H2 depletion, but not upon double depletion (Fig 1G); (2) the distance between Cen 12 and 12q15 increased upon H2 depletion, but not upon double depletion (Fig 2E).

      We have decided to remove the “homologous pair overlap” panel (formerly Fig. 3E) from the revised manuscript. Accordingly, the corresponding sentence has been deleted from the main text. Instead, we have added a new panel of “aspect ratio”, defined as the ratio of the major to the minor axis (new Fig. 3F). While this intuitive parameter was altered upon condensin II depletion and double depletion, again, we acknowledge that it is not sufficient to convincingly distinguish between the elongated and cloud-like phenotypes proposed in the original manuscript. For these reasons, in the revised manuscript, we have toned down our statements regarding the differences in CT morphology between the two conditions. Nonetheless, together with the data from Figs. 1 and 2, it is clear that the Rabl configuration observed upon condensin II depletion is further exacerbated in the absence of cohesin. Accordingly, we have modified the main text and the cartoon (Fig 3H) to more accurately depict the observations summarized above.

      • 5 and accompanying text. I agree with the authors that this is a significant and very interesting effect. However, I believe the sharp bends is in most cases an artifact caused by the maximum intensity projection. I tried to illustrate this effect in two photographs: Reviewer Fig. 1, side view, and Reviewer Fig. 2, same situation top view (https://cloud.bio.lmu.de/index.php/s/77npeEK84towzJZ). As I said, in my opinion, there is a significant and important effect; the authors should simply adjust the description. This comment is well taken. We appreciate the reviewer’s effort to help clarify our original observations. We have therefore added a new section entitled “Limitations of the study” to explicitly describe the constrains of our current approach. That said, as the reviewer also acknowledges, our observations remain valid because all experiments were performed with appropriate controls.

      Minor concerns:

      • I would like to suggest proactively discussing possible artifacts that may arise from the harsh conditions during FISH sample preparation. We fully agree with the reviewer’s concerns. For FISH sample preparation, we used relatively harsh conditions, including (1) fixation under a hypotonic condition (0.3x PBS), (2) HCl treatment, and (3) a denaturation step. We recognize that these procedures inevitably affect the preservation of the original structure; however, they are unavoidable in the standard FISH protocol. We also acknowledge that our analyses were limited to 2D structures based on projected images, rather than full 3D reconstructions. These technical limitations are now explicitly described in a new section entitled “Limitations of the study”, and the technical details are provided in Materials and Methods.

      • It would be helpful if the authors could provide the original data (microscopic image stacks) for download. We thank the reviewer for this suggestion and understand that providing the original image stacks could be of interest to readers. We agree that if the nuclei were perfectly spherical, as is the case for example in lymphocytes, 3D image stacks would contain much more information than 2D projections. However, as is typical for adherent cultured cells, including the HCT116-derived cells used in this study, the nuclei are flattened due to cell adhesion to the culture dish, with a thickness of only about one-tenth of the nuclear diameter (10–20 μm). Considering also the inevitable loss of structural preservation during FISH sample preparation, we were concerned that presenting 3D images might confuse rather than clarify. We therefore believe that representing the data as 2D projections, while explicitly acknowledging the technical limitations, provides the clearest and most interpretable presentation of our results. These limitations are now described in a new section of the manuscript.

      • The authors use a blind deconvolution algorithm to improve image quality. It might be helpful to test other methods for this purpose (optional). We thank the reviewer for this valuable suggestion and fully agree that it is a valid point. We recognize that alternative image enhancement methods can offer advantages, particularly for smaller structures or when multiple probes are analyzed simultaneously. In our study, however, the focus was on detecting whole chromosome territories (CTs) and specific chromosomal loci, which can be visualized clearly with our current FISH protocol combined with blind deconvolution. We therefore believe that the image quality we obtained is sufficient to support the conclusions of this manuscript.

      Reviewer #3 (Significance (Required)):

      Advance:

      Ono et al. addresses the important question on how the complex pattern of chromatin is reestablished after mitosis and maintained during interphase. In addition to affinity interactions (1,2), it is known that cohesin plays an important role in the formation and maintenance of chromosome organization interphase (3). However, current knowledge does not explain all known phenomena. Even with complete loss of cohesin, TAD-like structures can be recognized at the single-cell level (4), and higher structures such as chromosome territories are also retained (5). The function of condensin II during mitosis is another important factor that affects chromosome architecture in the following G1 phase (6). Although condensin II is present in the cell nucleus throughout interphase, very little is known about the role of this protein in this phase of the cell cycle. This is where the present publication comes in, with a new double degron cell line in which essential subunits of cohesin AND condensin can be degraded in a targeted manner. I find the data from the experiments in the G2 phase most interesting, as they suggest a previously unknown involvement of condensin II in the maintenance of larger chromatin structures such as chromosome territories.

      The experiments regarding the M-G1 transition are less interesting to me, as it is known that condensin II deficiency in mitosis leads to elongated chromosomes (Rabl configuration)(6), and therefore the double degradation of condensin II and cohesin describes the effects of cohesin on an artificially disturbed chromosome structure.

      For further clarification, we provide below a table summarizing previous studies relevant to the present work. We wish to emphasize three novel aspects of the present study. First, newly established cell lines designed for double depletion enabled us to address questions that had remained inaccessible in earlier studies. Second, to our knowledge, no study has previously reported condensin II depletion, cohesin depletion and double depletion in G2-arrested cells. Third, the present study represents the first systematic comparison of two different stages of the cell cycle using multiscale FISH under distinct depletion conditions. Although the M-to-G1 part of the present study partially overlaps with previous work, it serves as an important prelude to the subsequent investigations. We are confident that the reviewer will also acknowledge this point.

      cell cycle

      cond II depletion

      cohesin depletion

      double depletion

      M-to-G1

      Hoencamp et al (2021); Abramo et al (2019); Brunner et al (2025);

      this study

      Schwarzer et al (2017);

      Wutz et al (2017);

      this study

      this study

      G2

      this study

      this study

      this study

      Hoencamp et al (2021): Hi-C and imaging (CENP-A distribution)

      Abramo et al (2019): Hi-C and imaging

      Brunner et al (2025): mostly imaging (chromatin tracing)

      Schwarzer et al (2017); Wutz et al (2017): Hi-C

      this study: imaging (multi-scale FISH)

      General limitations:

      (1) Single cell imaging of chromatin structure typically shows only minor effects which are often obscured by the high (biological) variability. This holds also true for the current manuscript (cf. major concern 2 and 3).

      See our reply above.

      (2) A common concern are artefacts introduced by the harsh conditions of conventional FISH protocols (7). The authors use a method in which the cells are completely dehydrated, which probably leads to shrinking artifacts. However, differences between samples stained using the same FISH protocol are most likely due to experimental variation and not an artefact (cf. minor concern 1).

      See our reply above.

      • The anisotropic optical resolution (x-, y- vs. z-) of widefield microscopy (and most other light microscopic techniques) might lead to misinterpretation of the imaged 3D structures. This seems to be the cases in the current study (cf. major concern 4). See our reply above.

      • In the present study, the cell cycle was synchronized. This requires the use of inhibitors such as the CDK1 inhibitor RO-3306. However, CDK1 has many very different functions (8), so unexpected effects on the experiments cannot be ruled out. The current approaches involving FISH inevitably require cell cycle synchronization. We believe that the use of the CDK1 inhibitor RO-3306 to arrest the cell cycle at G2 is a reasonable choice, although we cannot rule out unexpected effects arising from the use of the drug. This issue has now been addressed in the new section entitled “Limitations of the study”.

      Audience:

      The spatial arrangement of genomic elements in the nucleus and their (temporal) dynamics are of high general relevance, as they are important for answering fundamental questions, for example, in epigenetics or tumor biology (9,10). The manuscript from Ono et al. addresses specific questions, so its intended readership is more likely to be specialists in the field.

      We are confident that, given the increasing interest in the 3D genome and its role in regulating diverse biological functions, the current manuscript will attract the broad readership of leading journals in cell biology.

      About the reviewer:

      By training I'm a biologist with strong background in fluorescence microscopy and fluorescence in situ hybridization. In recent years, I have been involved in research on the 3D organization of the cell nucleus, chromatin organization, and promoter-enhancer interactions.

      We greatly appreciate the reviewer’s constructive comments on both the technical strengths and limitations of our fluorescence imaging approaches, which have been very helpful in revising the manuscript. As mentioned above, we have decided to add a special paragraph entitled “Limitations of the study” at the end of the Discussion section to discuss these issues.

      All questions regarding the statistics of angularly distributed data are beyond my expertise. The authors do not correct their statistical analyses for "multiple testing". Whether this is necessary, I cannot judge.

      We thank the reviewer for raising this important point. In our study, the primary comparisons were made between -IAA and +IAA conditions within the same cell line. Accordingly, the figures report P-values for these pairwise comparisons.

      For the distance measurements, statistical evaluations were performed in PRISM using ANOVA (Kruskal–Wallis test), and the P-values shown in the figures are based on these analyses (Fig. 1, G and H; Fig. 2 E; Fig. 3 F and G; Fig. 4 F; Fig. 6 F [right]–H; Fig. S2 B and G; Fig. S3 D and H; Fig. S5 A [right] and B [right]; Fig. S8 B). While the manuscript focuses on pairwise comparisons between -IAA and +IAA conditions within the same cell line, we also considered potential differences across cell lines as part of the same ANOVA framework, thereby ensuring that multiple testing was properly addressed. Because cell line differences are not the focus of the present study, the corresponding results are not shown.

      For the angular distribution analyses, we compared -IAA and +IAA conditions within the same cell line using the Mardia–Watson–Wheeler test; these analyses do not involve multiple testing (circular scatter plots; Fig. 5 C–E and Fig. S6 B, C, and E–H). In addition, to determine whether angular distributions exhibited directional bias under each condition, we applied the Rayleigh test to each dataset individually (Fig. 5 F and Fig. S6 I). As these tests were performed on a single condition, they are also not subject to the problem of multiple testing. Collectively, we consider that the statistical analyses presented in our manuscript appropriately account for potential multiple testing issues, and we remain confident in the robustness of the results.

      Literature

      Falk, M., Feodorova, Y., Naumova, N., Imakaev, M., Lajoie, B.R., Leonhardt, H., Joffe, B., Dekker, J., Fudenberg, G., Solovei, I. et al. (2019) Heterochromatin drives compartmentalization of inverted and conventional nuclei. Nature, 570, 395-399. Mirny, L.A., Imakaev, M. and Abdennur, N. (2019) Two major mechanisms of chromosome organization. Curr Opin Cell Biol, 58, 142-152. Rao, S.S.P., Huang, S.C., Glenn St Hilaire, B., Engreitz, J.M., Perez, E.M., Kieffer-Kwon, K.R., Sanborn, A.L., Johnstone, S.E., Bascom, G.D., Bochkov, I.D. et al. (2017) Cohesin Loss Eliminates All Loop Domains. Cell, 171, 305-320 e324. Bintu, B., Mateo, L.J., Su, J.H., Sinnott-Armstrong, N.A., Parker, M., Kinrot, S., Yamaya, K., Boettiger, A.N. and Zhuang, X. (2018) Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells. Science, 362. Cremer, M., Brandstetter, K., Maiser, A., Rao, S.S.P., Schmid, V.J., Guirao-Ortiz, M., Mitra, N., Mamberti, S., Klein, K.N., Gilbert, D.M. et al. (2020) Cohesin depleted cells rebuild functional nuclear compartments after endomitosis. Nat Commun, 11, 6146. Hoencamp, C., Dudchenko, O., Elbatsh, A.M.O., Brahmachari, S., Raaijmakers, J.A., van Schaik, T., Sedeno Cacciatore, A., Contessoto, V.G., van Heesbeen, R., van den Broek, B. et al. (2021) 3D genomics across the tree of life reveals condensin II as a determinant of architecture type. Science, 372, 984-989. Beckwith, K.S., Ødegård-Fougner, Ø., Morero, N.R., Barton, C., Schueder, F., Tang, W., Alexander, S., Peters, J.-M., Jungmann, R., Birney, E. et al. (2023) Nanoscale 3D DNA tracing in single human cells visualizes loop extrusion directly in situ. BioRxiv 8 of 9https://doi.org/10.1101/2021.04.12.439407. Massacci, G., Perfetto, L. and Sacco, F. (2023) The Cyclin-dependent kinase 1: more than a cell cycle regulator. Br J Cancer, 129, 1707-1716. Bonev, B. and Cavalli, G. (2016) Organization and function of the 3D genome. Nat Rev Genet, 17, 661-678. Dekker, J., Belmont, A.S., Guttman, M., Leshyk, V.O., Lis, J.T., Lomvardas, S., Mirny, L.A., O'Shea, C.C., Park, P.J., Ren, B. et al. (2017) The 4D nucleome project. Nature, 549, 219-226.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We will provide the revised manuscript as a PDF with highlighted changes, the Word file with tracked changes linked to reviewer comments, and all updated figures.

      To address the reviewers' suggestions, we have conducted additional experiments that are now incorporated into new figures, or we have added new images to several existing figures where appropriate.

      Please note that all figures have been renumbered to improve clarity and facilitate cross-referencing throughout the text. As recommended by Referee #3, all figure legends have been thoroughly revised to reflect these updates and are now labeled following the standard A-Z panel format, enhancing readability and ensuring easier identification. In addition, all figure legends now include the sample size for each statistical analysis.

      For clarity and ease of reference, we provide below a comprehensive list of all figures included in the revised version. Figures that have undergone modifications are underlined.

      Figure 1____. The first spermatogenesis wave in prepuberal mice.

      This figure now includes amplified images of representative spermatocytes and a summary schematic illustrating the timeline of spermatogenesis. In addition, it now presents the statistical analysis of spermatocyte quantification to support the visual data.

      __Figure 2.____ Cilia emerge across all stages of prophase I in spermatocytes during the first spermatogenesis wave. __

      The images of this figure remain unchanged from the original submission, but all the graphs present now the statistical analysis of spermatocyte quantification.

      Figure 3. Ultrastructure and markers of prepuberal meiotic cilia.

      This figure remains unchanged from the original submission; however, we have replaced the ARL3-labelled spermatocyte image (A) with one displaying a clearer and more representative signal.

      __Figure 4. Testicular tissue presents spermatocyte cysts in prepuberal mice and adult humans. __

      This figure remains unchanged from the original submission.

      __Figure 5. Cilia and flagella dynamics are correlated during prepuberal meiosis. __

      This figure remains unchanged from the original submission.

      __Figure 6. Comparative proteomics identifies potential regulators of ciliogenesis and flagellogenesis. __

      This figure remains unchanged from the original submission.

      Figure 7.____ Deciliation induces persistence of DNA damage in meiosis.

      This figure has been substantially revised and now includes additional experiments analyzing chloral hydrate treatment, aimed at more accurately assessing DNA damage under both control and treated conditions. Images F-I and graph J are new.

      Figure 8____. Aurora kinase A is a regulator of cilia disassembly in meiosis.

      This figure is remodelled as the original version contained a mistake in previous panel II, for this, graph in new Fig.8 I has been corrected. In addition, it now contains additional data of αTubulin staining in arrested ciliated metaphases I after AURKA inhibition (new panel L1´).

      __Figure 9. Schematic representation of the prepuberal versus adult seminiferous epithelium. __

      This figure remains unchanged from the original submission.

      __Supplementary Figure 1. Meiotic stages during the first meiotic wave. __

      This figure remains unchanged from the original submission.

      __Supplementary Figure 2 (new)____. __

      This is a new figure that includes additional data requested by the reviewers. It includes additional markers of cilia in spermatocytes (glutamylated Tubulin/GT335), and the control data of cilia markers in non-ciliated spermatocytes. It also includes now the separated quantification of ciliated spermatocytes for each stage, as requested by reviewers, complementing graphs included in Figure 2.

      Please note that with the inclusion of this new Supplementary Figure 2, the numbering of subsequent supplementary figures has been updated accordingly.

      Supplementary Figure 3 (previously Suppl. Fig. 2)__. Ultrastructure of prophase I spermatocytes. __

      This figure is equal in content to the original submission, but some annotations have been included.

      Supplementary Figure 4 (previously Suppl. Fig. 3).__ Meiotic centrosome under the electron microscope. __

      This figure remains unchanged from the original submission, but additional annotations have been included.

      Supplementary Figure 5 (previously Suppl. Fig. 4)__. Human testis contains ciliated spermatocytes. __

      This figure has been revised and now includes additional H2AX staining to better determine the stage of ciliated spermatocytes and improve their identification.

      Supplementary Figure 6 (previously Suppl. Fig. 5). GLI1 and GLI3 readouts of Hedgehog signalling are not visibly affected in prepuberal mouse testes.

      This figure has been remodeled and now includes the quantification of GLI1 and GLI3 and its corresponding statistical analysis. It also includes the control data for Tubulin, instead of GADPH.

      Supplementary Figure 7 (previously Suppl. Fig. 6)__. CH and MLN8237 optimization protocol. __

      This figure has been remodeled to incorporate control experiments using 1-hour organotypic culture treatment.

      Supplementary Figure 8 (previously Suppl. Fig. 7)__. Tracking first meiosis wave with EdU pulse injection during prepubertal meiosis. __This figure remains unchanged from the original submission.

      Supplementary Figure 9 (previously Suppl. Fig. 8)__. PLK1 and AURKA inhibition in cultured spermatocytes. __

      This figure has been remodeled and now includes additional data on spindle detection in control and AURKA-inhibited spermatocytes (both ciliated and non ciliated).


      __Response to the reviewers __

      We will submit both the PDF version of the revised manuscript and the Word file with tracked changes relative to the original submission. Each modification made in response to reviewers' suggestions is annotated in the Word document within the corresponding section of the text.

      A detailed, point-by-point response to each reviewer's comments is provided in the following section.

      Response to the Referee #1


      In this manuscript by Perez-Moreno et al., titled "The dynamics of ciliogenesis in prepubertal mouse meiosis reveal new clues about testicular maturation during puberty", the authors characterize the development of primary cilia during meiosis in juvenile male mice. The authors catalog a variety of testicular changes that occur as juvenile mice age, such as changes in testis weight and germ cell-type composition. They next show that meiotic prophase cells initially lack cilia, and ciliated meiotic prophase cells are detected after 20 days postpartum, coinciding with the time when post-meiotic spermatids within the developing testes acquire flagella. They describe that germ cells in juvenile mice harbor cilia at all substages of meiotic prophase, in contrast to adults where only zygotene stage meiotic cells harbor cilia. The authors also document that cilia in juvenile mice are longer than those in adults. They characterize cilia composition and structure by immunofluorescence and EM, highlighting that cilia polymerization may initially begin inside the cell, followed by extension beyond the cell membrane. Additionally, they demonstrate ciliated cells can be detected in adult human testes. The authors next perform proteomic analyses of whole testes from juvenile mice at multiple ages, which may not provide direct information about the extremely small numbers of ciliated meiotic cells in the testis, and is lacking follow up experiments, but does serve as a valuable resource for the community. Finally, the authors use a seminiferous tubule culturing system to show that chemical inhibition of Aurora kinase A likely inhibits cilia depolymerization upon meiotic prophase I exit and leads to an accumulation of metaphase-like cells harboring cilia. They also assess meiotic recombination progression using their culturing system, but this is less convincing.

      Author response: We sincerely thank Ref #1 for the thorough and thoughtful evaluation of our manuscript. We are particularly grateful for the reviewer's careful reading and constructive feedback, which have helped us refine several sections of the text and strengthen our discussion. All comments and suggestions have been carefully considered and addressed, as detailed below.


      __Major comments: __

      1. There are a few issues with the experimental set up for assessing the effects of cilia depolymerization on DNA repair (Figure 7-II). First, how were mid pachytene cells identified and differentiated from early pachytene cells (which would have higher levels of gH2AX) in this experiment? I suggest either using H1t staining (to differentiate early/mid vs late pachytene) or the extent of sex chromosome synapsis. This would ensure that the authors are comparing similarly staged cells in control and treated samples. Second, what were the gH2AX levels at the starting point of this experiment? A more convincing set up would be if the authors measure gH2AX immediately after culturing in early and late cells (early would have higher gH2AX, late would have lower gH2AX), and then again after 24hrs in late cells (upon repair disruption the sampled late cells would have high gH2AX). This would allow them to compare the decline in gH2AX (i.e., repair progression) in control vs treated samples. Also, it would be informative to know the starting gH2AX levels in ciliated vs non-ciliated cells as they may vary.

      Response:

      We thank Ref #1 for this valuable comment, which significantly contributed to improving both the design and interpretation of the cilia depolymerization assay.

      Following this suggestion, we repeated the experiment including 1-hour (immediately after culturing), and 24-hour cultures for both control and chloral hydrate (CH)-treated samples (n = 3 biological replicates). To ensure accurate staging, we now employ triple immunolabelling for γH2AX, SYCP3, and H1T, allowing clear distinction of zygotene (H1T−), early pachytene (H1T−), and late pachytene (H1T+) cells. The revised data (Figure 7) now provide a more complete and statistically robust analysis of DNA damage dynamics. These results confirm that CH-induced deciliation leads to persistence of the γH2AX signal at 24 hours, indicating impaired DNA repair progression in pachytene spermatocytes. The new images and graphs are included in the revised Figure 7.

      Regarding the reviewer's final point about the comparison of γH2AX levels between ciliated and non-ciliated cells, we regret that direct comparison of γH2AX levels between ciliated and non-ciliated cells is not technically feasible. To preserve cilia integrity, all cilia-related imaging is performed using the squash technique, which maintains the three-dimensional structure of the cilia but does not allow reliable quantification of DNA damage markers due to nuclear distortion. Conversely, the nuclear spreading technique, used for DNA damage assessment, provides optimal visualization of repair foci but results in the loss of cilia due to cytoplasmic disruption during the hypotonic step. Given that spermatocytes in juvenile testes form developmentally synchronized cytoplasmic cysts, we consider that analyzing a statistically representative number of spermatocytes offers a valid and biologically meaningful measure of tissue-level effects.

      In conclusion, we believe that the additional experiments and clarifications included in revised Figure 7 strengthen our conclusion that cilia depolymerization compromises DNA repair during meiosis. Further functional confirmation will be pursued in future works, since we are currently generating a conditional genetic model for a ciliopathy in our laboratory.

      The authors analyze meiotic progression in cells cultured with/without AURKA inhibition in Figure 8-III and conclude that the distribution of prophase I cells does not change upon treatment. Is Figure 8-III A and B the same data? The legend text is incorrect, so it's hard to follow. Figure 8-III A shows a depletion of EdU-labelled pachytene cells upon treatment. Moreover, the conclusion that a higher proportion of ciliated zygotene cells upon treatment (Figure 8-II C) suggests that AURKA inhibition delays cilia depolymerization (page 13 line 444) does not make sense to me.

      Response:

      We thank Ref#1 for identifying this issue and for the careful examination of Figure 8. We discovered that the submitted version of Figure 8 contained a mismatch between the figure legend and the figure panels. The legend text was correct; however, the figure inadvertently included a non-corresponding graph (previously panel II-A), which actually belonged to Supplementary Figure 7 in the original submission. We apologize for this mistake.

      This error has been corrected in the revised version. The updated Figure 8 now accurately presents the distribution of EdU-labelled spermatocytes across prophase I substages in control and AURKA-inhibited cultures (previously Figure 8-II B, now Figure 8-A). The corrected data show no significant differences in the proportions of EdU-labelled spermatocytes among prophase I substages after 24 hours of AURKA inhibition, confirming that meiotic progression is not delayed and that no accumulation of zygotene cells occurs under this treatment. Therefore, the observed increase in ciliated zygotene spermatocytes upon AURKA inhibition (new Figure 8 H-I) is best explained by a delay in cilia disassembly, rather than by an arrest or slowdown in meiotic progression. The figure legend and main text have been revised accordingly.

      How do the authors know that there is a monopolar spindle in Figure 8-IV treated samples? Perhaps the authors can use a different Tubulin antibody (that does not detect only acetylated Tubulin) to show that there is a monopolar spindle.

      Response:

      We appreciate Ref#1 for this excellent suggestion. In the original submission (lines 446-447), we described that ciliated metaphase I spermatocytes in AURKA-inhibited samples exhibited monopolar spindle phenotypes. This description was based on previous reports showing that AURKA or PLK1 inhibition produces metaphases with monopolar spindles characterized by aberrant yet characteristic SYCP3 patterns, abnormal chromatin compaction, and circular bivalent alignment around non-migrated centrosomes (1). In our study, we observed SYCP3 staining consistent with these characteristic features of monopolar metaphases I.

      However, we agree with Ref #1 that this could be better sustained with data. Following the reviewer's suggestion, we performed additional immunostaining using α-Tubulin, which labels total microtubules rather than only the acetylated fraction. For clarity purposes, the revised Figure 8 now includes α-Tubulin staining in the same ciliated metaphase I cells shown in the original submission, confirming the presence of defective microtubule polymerization and defective spindle organization. For clarity, we now refer to these ciliated metaphases I as "arrested MI". This new data further support our conclusion that AURKA inhibition disrupts spindle bipolarization and prevents cilia depolymerization, indicating that cilia maintenance and bipolar spindle organization are mechanistically incompatible events during male meiosis. The abstract, results, and discussion section has been expanded accordingly, emphasizing that the persistence of cilia may interfere with microtubule polymerization and centrosome separation under AURKA inhibition. The Discussion has been expanded to emphasize that persistence of cilia may interfere with centrosome separation and microtubule polymerization, contrasting with invertebrate systems -e.g. Drosophila (2) and P. brassicae (3)- in which meiotic cilia persist through metaphase I without impairing bipolar spindle assembly.

      1. Alfaro, et al. EMBO Rep 22, (2021). DOI: 15252/embr.202051030 (PMID: 33615693)
      2. Riparbelli et al . Dev Cell (2012) DOI: 1016/j.devcel.2012.05.024 (PMID: 22898783)
      3. Gottardo et al, Cytoskeleton (Hoboken) (2023) DOI: 1002/cm.21755 (PMID: 37036073)

      The authors state in the abstract that they provide evidence suggesting that centrosome migration and cilia depolymerization are mutually exclusive events during meiosis. This is not convincing with the data present in the current manuscript. I suggest amending this statement in the abstract.

      Response:

      We thank Ref#1 for this valuable observation, with which we fully agree. To avoid overstatement, the original statement has been removed from the Abstract, Results, and Discussion, and replaced with a more accurate formulation indicating that cilia maintenance and bipolar spindle formation are mutually exclusive events during mouse meiosis.

      This revised statement is now directly supported by the new data presented in Figure 8, which demonstrate that AURKA inhibition prevents both spindle bipolarization and cilia depolymerization. We are grateful to the reviewer for highlighting this important clarification.


      Minor comments:

      The presence of cilia in all stages of meiotic prophase I in juvenile mice is intriguing. Why is the cellular distribution and length of cilia different in prepubertal mice compared to adults (where shorter cilia are present only in zygotene cells)? What is the relevance of these developmental differences? Do cilia serve prophase I functions in juvenile mice (in leptotene, pachytene etc.) that are perhaps absent in adults?

      Related to the above point, what is the relevance of the absence of cilia during the first meiotic wave? If cilia serve a critical function during prophase I (for instance, facilitating DSB repair), does the lack of cilia during the first wave imply differing cilia (and repair) requirements during the first vs latter spermatogenesis waves?

      In my opinion, these would be interesting points to discuss in the discussion section.

      Response:

      We thank the reviewer for these thoughtful observations, which we agree are indeed intriguing.

      We believe that our findings likely reflect a developmental role for primary cilia during testicular maturation. We hypothesize that primary cilia at this stage might act as signaling organelles, receiving cues from Sertoli cells or neighboring spermatocytes and transmitting them through the cytoplasmic cysts shared by spermatocytes. Such intercellular communication could be essential for coordinating tissue maturation and meiotic entry during puberty. Although speculative, this hypothesis aligns with the established role of primary cilia as sensory and signaling hubs for GPCR and RTK pathways regulating cell differentiation and developmental patterning in multiple tissues (e.g., 1, 2). The Discussion section has been expanded to include these considerations.

      1. Goetz et al, Nat Rev Genet (2010)- DOI: 1038/nrg2774 (PMID: 20395968)
      2. Naturky et al , Cell (2019) DOI: 1038/s41580-019-0116-4 (PMID: 30948801) Our study focuses on the first spermatogenic wave, which represents the transition from the juvenile to the reproductive phase. It is therefore plausible that the transient presence of longer cilia during this period reflects a developmental requirement for external signaling that becomes dispensable in the mature testis. Given that this is only the second study to date examining mammalian meiotic cilia, there remains a vast area of research to explore. We plan to address potential signaling cascades involved in these processes in future studies.

      On the other hand, while we cannot confirm that the cilia observed in zygotene spermatocytes persist until pachytene within the same cell, it is reasonable to speculate that they do, serving as longer-lasting signaling structures that facilitate testicular development during the critical pubertal window. In addition, the observation of ciliated spermatocytes at all prophase I substages at 20 dpp, together with our proteomic data, supports the idea that the emergence of meiotic cilia exerts a significant developmental impact on testicular maturation.

      In summary, although we cannot yet define specific prophase I functions for meiotic cilia in juvenile spermatocytes, our data demonstrate that the first meiotic wave differs from later waves in cilia dynamics, suggesting distinct regulatory requirements between puberty and adulthood. These findings underscore the importance of considering developmental context when using the first meiotic wave as a model for studying spermatogenesis.

      The authors state on page 9 lines 286-288 that the presence of cytoplasmic continuity via intercellular bridges (between developmentally synchronous spermatocytes) hints towards a mechanism that links cilia and flagella formation. Please clarify this statement. While the correlation between the timing of appearance of cilia and flagella in cells that are located within the same segment of the seminiferous tubule may be hinting towards some shared regulation, how would cytoplasmic continuity participate in this regulation? Especially since the cytoplasmic continuity is not between the developmentally distinct cells acquiring the cilia and flagella?

      Response:

      We thank Ref#1 for this excellent question and for the opportunity to clarify our statement.

      The presence of intercellular bridges between spermatocytes is well known and has long been proposed to support germ cell communication and synchronization (1,2) as well as sharing mRNA (3) and organelles (4). A classic example is the Akap gene, located on the X chromosome and essential for the formation of the sperm fibrous sheath; cytoplasmic continuity through intercellular bridges allows Akap-derived products to be shared between X- and Y-bearing spermatids, thereby maintaining phenotypic balance despite transcriptional asymmetry (5). In addition, more recent work has further demonstrated that these bridges are critical for synchronizing meiotic progression and for processes such as synapsis, double-strand break repair, and transposon repression (6).

      In this context, and considering our proteomic data (Figure 6), our statement did not intend to imply direct cytoplasmic exchange between ciliated and flagellated cells. Although our current methods do not allow comprehensive tracing of cytoplasmic continuity from the basal to the luminal compartment of the seminiferous epithelium, we plan to address this limitation using high-resolution 3D and ultrastructural imaging approaches in future studies.

      Based on our current data, we propose that cytoplasmic continuity within developmentally synchronized spermatocyte cysts could facilitate the coordinated regulation of ciliogenesis, and similarly enable the sharing of regulatory factors controlling flagellogenesis within spermatid cysts. This coordination may occur through the diffusion of centrosomal or ciliary proteins, mRNAs, or signaling intermediates involved in the regulation of microtubule dynamics. However, we cannot exclude the possibility that such cytoplasmic continuity extends across all spermatocytes derived from the same spermatogonial clone, potentially providing a larger regulatory network.]] This mechanism could help explain the temporal correlation we observe between the appearance of meiotic cilia and the onset of flagella formation in adjacent spermatids within the same seminiferous segment.

      We have revised the Discussion to explicitly clarify this interpretation and to note that, although hypothetical, it is consistent with established literature on cytoplasmic continuity and germ cell coordination.

      1. Dym, et al. * Reprod.*(1971) DOI: 10.1093/biolreprod/4.2.195 (PMID: 4107186)
      2. Braun et al. Nature. (1989) DOI: 1038/337373a0 (PMID: 2911388)
      3. Greenbaum et al. * Natl. Acad. Sci. USA*(2006). DOI: 10.1073/pnas.0505123103 (PMID: 16549803)
      4. Ventelä et al. Mol Biol Cell. (2003) DOI: 1091/mbc.e02-10-0647 (PMID: 12857863)
      5. Turner et al. Journal of Biological Chemistry (1998). DOI: 1074/jbc.273.48.32135 (PMID: 9822690)
      6. Sorkin, et al. Nat Commun (2025). DOI: 1038/s41467-025-56742-9 (PMID: 39929837)
      7. *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.*

      Individual germ cells in H&E-stained testis sections in Figure 1-II are difficult to see. I suggest adding zoomed-in images where spermatocytes/round spermatids/elongated spermatids are clearly distinguishable.

      Response:

      Ref#1 is very right in this suggestion. We have revised Figure 1 to improve the quality of the H&E-stained testis sections and have added zoomed-in panels where spermatocytes, round spermatids, and elongated spermatids are clearly distinguishable. These additions significantly enhance the clarity and interpretability of the figure.

      In Figure 2-II B, the authors document that most ciliated spermatocytes in juvenile mice are pachytene. Is this because most meiotic cells are pachytene? Please clarify. If the data are available (perhaps could be adapted from Figure 1-III), it would be informative to see a graph representing what proportions of each meiotic prophase substages have cilia.

      Response:

      We thank the reviewer for this valuable observation. Indeed, the predominance of ciliated pachytene spermatocytes reflects the fact that most meiotic cells in juvenile testes are at the pachytene stage (Figure 1). We have clarified this point in the text and have added a new supplementary figure (Supplementary Figure 2, new figure) presenting a graph showing the proportion of spermatocytes at each prophase I substage that possess primary cilia. This visualization provides a clearer quantitative overview of ciliation dynamics across meiotic substages.

      I suggest annotating the EM images in Sup Figure 2 and 3 to make it easier to interpret.

      Response:

      We thank the reviewer for this helpful suggestion. We have now added annotations to the EM images in Supplementary Figures 3 and 4 to facilitate their interpretation. These visual guides help readers more easily identify the relevant ultrastructural features described in the text.

      The authors claim that the ratio between GLI3-FL and GLI3-R is stable across their analyzed developmental window in whole testis immunoblots shown in Sup Figure 5. Quantifying the bands and normalizing to the loading control would help strengthen this claim as it hard to interpret the immunoblot in its current form.

      Response:

      We thank the reviewer for this valuable suggestion. Following this recommendation, Supplementary Figure 5 has been revised to include quantification of GLI1 and GLI3 protein levels, normalized to the loading control.

      After quantification, we observed statistically significant differences across developmental stages. Specifically, GLI1 expression is slightly higher at 21 dpp compared to 8 dpp. For GLI3, we performed two complementary analyses:

      • Total GLI3 protein (sum of full-length and repressor forms normalized to loading control) shows a progressive decrease during development, with the lowest levels at 60 dpp (Supplementary Figure 5D).
      • GLI3 activation status, assessed as the GLI3-FL/GLI3-R ratio, is highest during the 19-21 dpp window, compared to 8 dpp and 60 dpp. Although these results suggest a possible transient activation of GLI3 during testicular maturation, we caution that this cannot automatically be attributed to increased Hedgehog signaling, as GLI3 processing can also be affected by other processes, such as changes in ciliogenesis. Furthermore, because the analysis was performed on whole-testis protein extracts, these changes cannot be specifically assigned to ciliated spermatocytes.

      We have expanded the Discussion to address these findings and to highlight the potential involvement of the Desert Hedgehog (DHH) pathway, which plays key roles in testicular development, Sertoli-germ cell communication, and spermatogenesis (1, 2, 3). We plan to investigate these pathways further in future studies.

      1. Bitgood et al. Curr Biol. (1996). DOI: 1016/s0960-9822(02)00480-3 (PMID: 8805249)
      2. Clark et al. Biol Reprod. (2000) DOI: 1095/biolreprod63.6.1825 (PMID: 11090455)
      3. O'Hara et al. BMC Dev Biol. (2011) DOI: 1186/1471-213X-11-72 (PMID: 22132805) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      There are a few typos throughout the manuscript. Some examples: page 5 line 172, Figure 3-I legend text, Sup Figure 5-II callouts, Figure 8-III legend, page 15 line 508, page 17 line 580, page 18 line 611.

      Response:

      We thank the reviewer for detecting this. All typographical errors have been corrected, and figure callouts have been reviewed for consistency.

      __ ____Response to the Referee #2__

      __ __This study focuses on the dynamic changes of ciliogenesis during meiosis in prepubertal mice. It was found that primary cilia are not an intrinsic feature of the first wave of meiosis (initiating at 8 dpp); instead, they begin to polymerize at 20 dpp (after the completion of the first wave of meiosis) and are present in all stages of prophase I. Moreover, prepubertal cilia (with an average length of 21.96 μm) are significantly longer than adult cilia (10 μm). The emergence of cilia coincides temporally with flagellogenesis, suggesting a regulatory association in the formation of axonemes between the two. Functional experiments showed that disruption of cilia by chloral hydrate (CH) delays DNA repair, while the AURKA inhibitor (MLN8237) delays cilia disassembly, and centrosome migration and cilia depolymerization are mutually exclusive events. These findings represent the first detailed description of the spatiotemporal regulation and potential roles of cilia during early testicular maturation in mice. The discovery of this phenomenon is interesting; however, there are certain limitations in functional research.

      We thank Ref#2 for taking the time to evaluate our manuscript and for summarizing its main findings. We regret that the reviewer did not find the study sufficiently compelling, but we respectfully clarify that the strength of our work lies precisely in addressing a largely unexplored aspect of mammalian meiosis for which virtually no prior data exist. Given the extremely limited number of studies addressing cilia in mammalian meiosis (only five to date, including our own previous publication on adult mouse spermatogenesis) (1-5), we consider that the present work provides the first robust and integrative evidence on the emergence, morphology, and potential roles of primary cilia during prepubertal testicular development. The study combines histology, high-resolution microscopy, proteomics, and pharmacological perturbations, supported by quantitative analyses, thereby establishing a solid and much-needed reference framework for future functional studies.

      We emphasize that this manuscript constitutes the first comprehensive characterization of ciliogenesis during prepubertal mouse meiosis, complemented by functional in vitro assays that begin to address potential roles of these cilia. For this reason, we want to underscore the importance of this study in providing a solid framework that will support and guide future research

      Major points:

      1. The prepubertal cilia in spermatocytes discovered by the authors lack specific genetic ablation to block their formation, making it impossible to evaluate whether such cilia truly have functions. Because neither in the first wave of spermatogenesis nor in adult spermatogenesis does this type of cilium seem to be essential. In addition, the authors also imply that the formation of such cilia appears to be synchronized with the formation of sperm flagella. This suggests that the production of such cilia may merely be transient protein expression noise rather than a functionally meaningful cellular structure.

      Response:

      We agree that a genetic ablation model would represent the ideal approach to directly test cilia function in spermatogenesis. However, given the complete absence of prior data describing the dynamics of ciliogenesis during testis development, our priority in this study was to establish a rigorous structural and temporal characterization of this process in the main mammalian model organism, the mouse. This systematic and rigorous phenotypic characterization is a necessary first step before any functional genetics could be meaningfully interpreted.

      To our knowledge, this study represents the first comprehensive analysis of ciliogenesis during prepubertal mouse meiosis, extending our previous work on adult spermatogenesis (1). Beyond these two contributions, only four additional studies have addressed meiotic cilia-two in zebrafish (2, 3), with Mytlys et al. also providing preliminary observations relevant to prepubertal male meiosis that we discuss in the present work, one in Drosophila (4) and a recent one in butterfly (5). No additional information exists for mammalian gametogenesis to date.

      1. López-Jiménez et al. Cells (2022) DOI: 10.3390/cells12010142 (PMID: 36611937)
      2. Mytlis et al. Science (2022) DOI: 10.1126/science.abh3104 (PMID: 35549308)
      3. Xie et al. J Mol Cell Biol (2022) DOI: 10.1093/jmcb/mjac049 (PMID: 35981808)
      4. Riparbelli et al . Dev Cell (2012) DOI: 10.1016/j.devcel.2012.05.024 (PMID: 22898783)
      5. Gottardo et al, Cytoskeleton (Hoboken) (2023) DOI: 10.1002/cm.21755 (PMID: 37036073) We therefore consider this descriptive and analytical foundation to be essential before the development of functional genetic models. Indeed, we are currently generating a conditional genetic model for a ciliopathy in our laboratory. These studies are ongoing and will directly address the type of mechanistic questions raised here, but they extend well beyond the scope and feasible timeframe of the present manuscript.

      We thus maintain that the present work constitutes a necessary and timely contribution, providing a robust reference dataset that will facilitate and guide future functional studies in the field of cilia and meiosis.

      Taking this into account, we would be very pleased to address any additional, concrete suggestions from Ref#2 that could further strengthen the current version of the manuscript

      The high expression of axoneme assembly regulators such as TRiC complex and IFT proteins identified by proteomic analysis is not particularly significant. This time point is precisely the critical period for spermatids to assemble flagella, and TRiC, as a newly discovered component of flagellar axonemes, is reasonably highly expressed at this time. No intrinsic connection with the argument of this paper is observed. In fact, this testicular proteomics has little significance.

      Response:

      We appreciate this comment but respectfully disagree with the reviewer's interpretation of our proteomic data. To our knowledge, this is the first proteomic study explicitly focused on identifying ciliary regulators during testicular development at the precise window (19-21 dpp) when both meiotic cilia and spermatid flagella first emerge.

      While Piprek et al (1) analyzed the expression of primary cilia in developing gonads, proteomic data specifically covering the developmental transition at 19-21 dpp were not previously available. Furthermore, a recent cell-sorting study (2), detected expression of cilia proteins in pachytene spermatocytes compared to round spermatids, but did not explore their functional relevance or integrate these data with developmental timing or histological context.

      In contrast, our dataset integrates histological staging, high-resolution microscopy, and quantitative proteomics, revealing a set of candidate regulators (including DCAF7, DYRK1A, TUBB3, TUBB4B, and TRiC) potentially involved in cilia-flagella coordination. We view this as a hypothesis-generating resource that outlines specific proteins and pathways for future mechanistic studies on both ciliogenesis and flagellogenesis in the testis.

      Although we fully agree that proteomics alone cannot establish causal function, we believe that dismissing these data as having little significance overlooks their value as the first molecular map of the testis at the developmental window when axonemal structures arise. Our dataset provides, for the first time, an integrated view of proteins associated with ciliary and flagellar structures at the developmental stage when both axonemal organelles first appear. We thus believe that our proteomic dataset represents an important and novel contribution to the understanding of testicular development and ciliary biology.

      Considering this, we would again welcome any specific suggestions from Ref#2 on additional analyses or clarifications that could make the relevance of this dataset even clearer to readers.

      1. Piprek et al. Int J Dev Biol. (2019) doi: 10.1387/ijdb.190049rp (PMID: 32149371).
      2. Fang et al. Chromosoma. (1981) doi: 10.1007/BF00285768 (PMID: 7227045).

      Response to the Referee #3

      In "The dynamics of ciliogenesis in prepubertal mouse meiosis reveals new clues about testicular development" Pérez-Moreno, et al. explore primary cilia in prepubertal mouse spermatocytes. Using a combination of microscopy, proteomics, and pharmacological perturbations, the authors carefully characterize prepubertal spermatocyte cilia, providing foundational work regarding meiotic cilia in the developing mammalian testis.

      Response: We sincerely thank Ref#3 for their positive assessment of our work and for the thoughtful suggestions that have helped us strengthen the manuscript. We are pleased that the reviewer recognizes both the novelty and the relevance of our study in providing foundational insights into meiotic ciliogenesis during prepubertal testicular development. All specific comments have been carefully considered and addressed as detailed below.


      Major concerns:

      1. The authors provide evidence consistent with cilia not being present in a larger percentage of spermatocytes or in other cells in the testis. The combination of electron microscopy and acetylated tubulin antibody staining establishes the presence of cilia; however, proving a negative is challenging. While acetylated tubulin is certainly a common marker of cilia, it is not in some cilia such as those in neurons. The authors should use at least one additional cilia marker to better support their claim of cilia being absent.

      Response:

      We thank the reviewer for this helpful suggestion. In the revised version, we have strengthened the evidence for cilia identification by including an additional ciliary marker, glutamylated tubulin (GT335), in combination with acetylated tubulin and ARL13B (which were included in the original submission). These data are now presented in the new Supplementary Figure 2, which also includes an example of a non-ciliated spermatocyte showing absence of both ARL13B and AcTub signals.

      Taken together, these markers provide a more comprehensive validation of cilia detection and confirm the absence of ciliary labelling in non-ciliated spermatocytes.

      The conclusion that IFT88 localizes to centrosomes is premature as key controls for the IFT88 antibody staining are lacking. Centrosomes are notoriously "sticky", often sowing non-specific antibody staining. The authors must include controls to demonstrate the specificity of the staining they observe such as staining in a genetic mutant or an antigen competition assay.

      Response:

      We appreciate the reviewer's concern and fully agree that antibody specificity is critical when interpreting centrosomal localization. The IFT88 antibody used in our study is commercially available and has been extensively validated in the literature as both a cilia marker (1, 2), and a centrosome marker in somatic cells (3). Labelling of IFT88 in centrosomes has also been previously described using other antibodies (4, 5). In our material, the IFT88 signal consistently appears at one of the duplicated centrosomes and at both spindle poles-patterns identical to those reported in somatic cells. We therefore consider the reported meiotic IFT88 staining as specific and biologically reliable.

      That said, we agree that genetic validation would provide the most definitive confirmation. We would like to inform that we are currently since we are currently generating a conditional genetic model for a ciliopathy in our laboratory that will directly assess both antibody specificity and functional consequences of cilia loss during meiosis. These experiments are in progress and will be reported in a follow-up study.

      1. Wong et al. Science (2015). DOI: 1126/science.aaa5111 (PMID: 25931445)
      2. Ocbina et al. Nat Genet (2011). DOI: 1038/ng.832 (PMID: 21552265)
      3. Vitre et al. EMBO Rep (2020). DOI: 15252/embr.201949234 (PMID: 32270908)
      4. Robert A. et al. J Cell Sci (2007). DOI: 1242/jcs.03366 (PMID: 17264151)
      5. Singla et al, Developmental Cell (2010). DOI: 10.1016/j.devcel.2009.12.022 (PMID: 20230748) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      There are many inconsistent statements throughout the paper regarding the timing of the first wave of spermatogenesis. For example, the authors state that round spermatids can be detected at 21dpp on line 161, but on line 180, say round spermatids can be detected a 19dpp. Not only does this lead to confusion, but such discrepancies undermine the validity of the rest of the paper. A summary graphic displaying key events and their timing in the first wave of spermatogenesis would be instrumental for reader comprehension and could be used by the authors to ensure consistent claims throughout the paper.

      Response:

      We thank the reviewer for identifying this inconsistency and apologize for the confusion. We confirm that early round spermatids first appear at 19 dpp, as shown in the quantitative data (Figure 1J). This can be detected in squashed spermatocyte preparations, where individual spermatocytes and spermatids can be accurately quantified. The original text contained an imprecise reference to the histological image of 21 dpp (previous line 161), since certain H&E sections did not clearly show all cell types simultaneously. However, we have now revised Figure 1, improving the image quality and adding a zoomed-in panel highlighting early round spermatids. Image for 19 dpp mice in Fig 1D shows early, yet still aflagellated spermatids. The first ciliated spermatocytes and the earliest flagellated spermatids are observed at 20 dpp. This has been clarified in the text.

      In addition, we also thank the reviewer for the suggestion of adding a summary graphic, which we agree greatly facilitates reader comprehension. We have added a new schematic summary (Figure 1K) illustrating the key stages and timing of the first spermatogenic wave.

      In the proteomics experiments, it is unclear why the authors assume that changes in protein expression are predominantly due to changes within the germ cells in the developing testis. The analysis is on whole testes including both the somatic and germ cells, which makes it possible that protein expression changes in somatic cells drive the results. The authors need to justify why and how the conclusions drawn from this analysis warrant such an assumption.

      Response:

      We agree with the reviewer that our proteomic analysis was performed on whole testis samples, which contain both germ and somatic cells. Although isolation of pure spermatocyte populations by FACS would provide higher resolution, obtaining sufficient prepubertal material for such analysis would require an extremely large number of animals. To remain compliant with the 3Rs principle for animal experimentation, we therefore used whole-testis samples from three biological replicates per age.

      We acknowledge that our assumption-that the main differences arise from germ cells-is a simplification. However, germ cells constitute the vast majority of testicular cells during this developmental window and are the population undergoing major compositional changes between 15 dpp and adulthood. It is therefore reasonable to expect that a substantial fraction of the observed proteomic changes reflects alterations in germ cells. We have clarified this point in the revised text and have added a statement noting that changes in somatic cells could also contribute to the proteomic profiles.

      The authors should provide details on how proteins were categorized as being involved in ciliogenesis or flagellogenesis, specifically in the distinction criteria. It is not clear how the categorizations were determined or whether they are valid. Thus, no one can repeat this analysis or perform this analysis on other datasets they might want to compare.

      Response:

      We thank the reviewer for this opportunity to clarify our approach. The categorization of protein as being involved in ciliogenesis or flagellogenesis was based on their Gene Ontology (GO) cellular component annotations obtained from the PANTHER database (Version 19.0), using the gene IDs of the Differentially Expressed Proteins (DEPs). Specifically, we used the GO terms cilium (GO:0005929) and motile cilium (GO:0031514). Since motile cilium is a subcategory of cilium, proteins annotated only with the general cilium term, but not included under motile cilium, were considered to be associated with primary cilia or with shared structural components common to different types of cilia. These GO terms are represented in the bottom panel of the Figure 6.

      This information has been added to the Methods section and referenced in the Results for transparency and reproducibility.

      In the pharmacological studies, the authors conclude that the phenotypes they observe (DNA damage and reduced pachytene spermatocytes) are due to loss of or persistence of cilia. This overinterprets the experiment. Chloral hydrate and MLN8237 certainly impact ciliation as claimed, but have additional cellular effects. Thus, it is possible that the observed phenotypes were not a direct result of cilia manipulation. Either additional controls must address this or the conclusions need to be more specific and toned down.

      Response:

      We thank the reviewer for this fair observation and have taken steps to strengthen and refine our interpretation. In the revised version, we now include data from 1-hour and 24-hour cultures for both control and chloral hydrate (CH)-treated samples (n = 3 biological replicates). The triple immunolabelling with γH2AX, SYCP3, and H1T allows accurate staging of zygotene (H1T⁻), early pachytene (H1T⁻), and late pachytene (H1T⁺) spermatocytes.

      The revised Figure 7 now provides a more complete and statistically supported analysis of DNA damage dynamics, confirming that CH-induced deciliation leads to persistent γH2AX signal at 24 hours, indicative of delayed or defective DNA repair progression. We have also toned down our interpretation in the Discussion, acknowledging that CH could affect other cellular pathways.

      As mentioned before, the conditional genetic model that we are currently generating will allow us to evaluate the role of cilia in meiotic DNA repair in a more direct and specific way.

      Assuming the conclusions of the pharmacological studies hold true with the proper controls, the authors still conflate their findings with meiotic defects. Meiosis is not directly assayed, which makes this conclusion an overstatement of the data. The conclusions need to be rephrased to accurately reflect the data.

      Response:

      We agree that this aspect required clarification. As noted above, we have refined both the Results and Discussion sections to make clear that our assays specifically targeted meiotic spermatocytes.

      We now present data for meiotic stages at zygotene, early pachytene and late pachytene. This is demonstrated with the labelling for SYCP3 and H1T, both specific marker for meiosis that are not detectable in non meiotic cells. We believe that this is indeed a way to assay the meiotic cells, however, we have specified now in the text that we are analysing potential defects in meiosis progression. We are sorry if this was not properly explained in the original manuscript: it is now rephrased in the new version both in the results and discussion section.

      It is not clear why the authors chose not to use widely accepted assays of Hedgehog signaling. Traditionally, pathway activation is measured by transcriptional output, not GLI protein expression because transcription factor expression does not necessarily reflect transcription levels of target genes.

      Response:

      We agree with the reviewer that measuring mRNA levels of Hedgehog pathway target genes, typically GLI1 and PTCH1, is the most common method for measuring pathway activation, and is widely accepted by researchers in the field. However, the methods we use in this manuscript (GLI1 and GLI3 immunoblots) are also quite common and widely accepted:

      Regarding GLI1 immunoblot, many articles have used this method to monitor Hedgehog signaling, since GLI1 protein levels have repeatedly been shown to also go up upon pathway activation, and down upon pathway inhibition, mirroring the behavior of GLI1 mRNA. Here are a few publications that exemplify this point:

      • Banday et al. 2025 Nat Commun. DOI: 10.1038/s41467-025-56632-0 (PMID: 39894896)
      • Shi et al 2022 JCI Insight DOI: 10.1172/jci.insight.149626 (PMID: 35041619)
      • Deng et al. 2019 eLife, DOI: 10.7554/eLife.50208 (PMID: 31482846)
      • Zhu et al. 2019 Nat Commun, DOI: 10.1038/s41467-019-10739-3 (PMID: 31253779)
      • Caparros-Martin et al 2013 Hum Mol Genet, DOI: 10.1093/hmg/dds409 (PMID: 23026747) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      As for GLI3 immunoblot, Hedgehog pathway activation is well known to inhibit GLI3 proteolytic processing from its full length form (GLI3-FL) to its transcriptional repressor (GLI3-R), and such processing is also commonly used to monitor Hedgehog signal transduction, of which the following are but a few examples:

      • Pedraza et al 2025 eLife, DOI: 10.7554/eLife.100328 (PMID: 40956303)
      • Somatilaka et al 2020 Dev Cell, DOI: 10.1016/j.devcel.2020.06.034 (PMID: 32702291)
      • Infante et al 2018, Nat Commun, DOI: 10.1038/s41467-018-03339-0 (PMID: 29515120)
      • Wang et al 2017 Dev Biol DOI: 10.1016/j.ydbio.2017.08.003 (PMID: 28800946)
      • Singh et al 2015 J Biol Chem DOI: 10.1074/jbc.M115.665810 (PMID: 26451044)
      • *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.*

      In summary, we think that we have used two well established markers to look at Hedgehog signaling (three, if we include the immunofluorescence analysis of SMO, which we could not detect in meiotic cilia).

      These Hh pathway analyses did not provide any convincing evidence that the prepubertal cilia we describe here are actively involved in this pathway, even though Hh signaling is cilia-dependent and is known to be active in the male germline (Sahin et al 2014 Andrology PMID: 24574096; Mäkelä et al 2011 Reproduction PMID: 21893610; Bitgood et al 1996 Curr Biol. PMID: 8805249).

      That said, we fully agree that our current analyses do not allow us to draw definitive conclusions regarding Hedgehog pathway activity in meiotic cilia, and we now state this explicitly in the revised Discussion.

      Also in the Hedgehog pathway experiment, it is confusing that the authors report no detection of SMO yet detect little to no expression of GLIR in their western blot. Undetectable SMO indicates Hedgehog signaling is inactive, which results in high levels of GLIR. The impact of this is that it is not clear what is going on with Hh signaling in this system.

      Response:

      It is true that, when Hh signaling is inactive (and hence SMO not ciliary), the GLI3FL/GLI3R ratio tends to be low.

      Although our data in prepuberal mouse testes show a strong reduction in total GLI3 protein levels (GLI3FL+GLI3R) as these mice grow older, this downregulation of total GLI3 occurs without any major changes in the GLI3FL/GLI3R ratio, which is only modestly affected (suppl. Figure 6).

      Hence, since it is the ratio that correlates with Hh signaling rather than total levels, we do not think that the GLI3R reduction we see is incompatible with our non-detection of SMO in cilia: it seems more likely that overall GLI3 expression is being downregulated in developing testes via a Hh-independent mechanism.

      Also potentially relevant here is the fact that some cell types depend more on GLI2 than on GLI3 for Hh signaling. For instance, in mouse embryos, Hh-mediated neural tube patterning relies more heavily on GLI2 processing into a transcriptional activator than on the inhibition of GLI3 processing into a repressor. In contrast, the opposite is true during Hh-mediated limb bud patterning (Nieuwenhuis and Hui 2005 Clin Genet. PMID: 15691355). We have not looked at GLI2, but it is conceivable that it could play a bigger role than GLI3 in our model.

      Moreover, several forms of GLI-independent non-canonical Hh signaling have been described, and they could potentially play a role in our model, too (Robbins et al 2012 Sci Signal. PMID: 23074268).

      We have revised the discussion to clarify some of these points.

      All in all, we agree that our findings regarding Hh signaling are not conclusive, but we still think they add important pieces to the puzzle that will help guide future studies.

      There are multiple instances where it is not clear whether the authors performed statistical analysis on their data, specifically when comparing the percent composition of a population. The authors need to include appropriate statistical tests to make claims regarding this data. While the authors state some impressive sample sizes, once evaluated in individual categories (eg specific cell type and age) the sample sizes of evaluated cilia are as low as 15, which is likely underpowered. The authors need to state the n for each analysis in the figures or legends.

      We thank the reviewer for highlighting this important issue. We have now included the sample size (n) for every analysis directly in the figure legends. Although this adds length, it improves transparency and reproducibility.

      Regarding the doubts of Ref#3 about the different sample sizes, the number of spermatocytes quantified in each stage is in agreement with their distribution in meiosis (example, pachytene lasts for 10 days this stage is widely represented in the preparations, while its is much difficult to quantify metaphases I that are less present because the stage itself lasts for less than 24hours). Taking this into account, we ensured that all analyses remain statistically valid and representative, applying the appropriate statistical tests for each dataset. These details are now clearly indicated in the revised figures and legends.

      Minor concerns:

      1. The phrase "lactating male" is used throughout the paper and is not correct. We assume this term to mean male pups that have yet to be weaned from their lactating mother, but "lactating male" suggests a rare disorder requiring medical intervention. Perhaps "pre-weaning males" is what the authors meant.

      Response:

      We thank the reviewer for noticing this terminology error. The expression has been corrected to "pre-weaning males" throughout the manuscript.

      The convention used to label the figures in this paper is confusing and difficult to read as there are multiple panels with the same letter in the same figure (albeit distinct sections). Labeling panels in the standard A-Z format is preferred. "Panel Z" is easier to identify than "panel III-E".

      Response:

      We thank the reviewer for this suggestion. All figures have been relabelled using the standard A-Z panel format, ensuring consistency and easier readability across the manuscript.

    1. Author Response:

      We thank all reviewers for their time and effort to carefully review our paper and for the constructive comments on our manuscript. Below we outline our planned revisions to the public reviews of the three reviewers.

      In our revision, we will include more details regarding our ABR measurements (including temperature, animal metadata), analysis (including filter settings) and lay out a much more detailed motivation for our ABR signal design. Furthermore, we will provide a more detailed discussion on the caveats of the technique and the interpretation of ABR data in general and our data specifically. Furthermore, we will add more discussion on differences between ABR based audiograms and behavioural data. The authors have extensive experience with the ABR technique and are well aware of its limitations, but also its strengths for use in animals that cannot be trained on behavioural tasks such as the very young zebra finches in this study. These additions will strengthen our paper. We think our conclusions remain justified by our data.

      Reviewer #1 and #2:

      We thank both reviewers for their positive words and suggested improvements. The planned general improvements listed above will take care of all suggestions and comments in the public review.

      Reviewer #3:

      We thank the reviewer for the detailed critique of our manuscript and many suggestions for improvement. The planned general improvements listed above will take care of many of the suggestions and comments listed in the public review. Here we will highlight a few first responses that we will address in detail in our resubmission.

      The reviewer’s major critiques can be condensed to the following four points.

      (1) ABR cannot be done in such small animals.

      This critique is unfounded. ABR measures the summed activity in the auditory pathway, and with smaller distance from brainstem to electrodes in small animals, the ABR signals are expected to have higher amplitude and consequently better SNR.  Thus, smaller animals should lead to higher amplitude ABR signals. We have successfully recorded ABR in animals smaller than 2 DPH zebra finches to support this claim (zebrafish (Jørgensen et al., 2012), 10 mm froglets (Goutte et al., 2017) and 5 mm salamanders (Capshaw et al., 2020). It is more surprising the technique still provides robust signals even in very large animals such as Minke whales (Houser et al., 2024).

      (2) The ABR methods used does not follow protocol for other published work in birds. Particularly the 25 ms long duration tone bursts may have underestimated high frequency hearing.

      There is no fixed protocol for ABR measurements, and several studies of bird ABR have used as long or even longer durations. Longer-duration signals were chosen deliberately and are necessary to have a sufficient number of cycles and avoid frequency splatter at our lowest frequencies used (see Lauridsen et al., 2021).

      (3) Sensitivity data should be corrected from ABR to behavioural data.

      We present the results of our measurements on hearing sensitivity using ABR, and ABR based thresholds are generally less sensitive than thresholds based on behavioural studies (presented in Fig 2c). Correcting for these measurements to behavioural thresholds is of course possible, but presenting only the corrected thresholds would be a misrepresentation of our sensitivity data. Even so it should be done only within species and age group and such data is currently not available. In our revision, we will include elaborate discussion on this topic.

      (4) Results are inconsistent with papers in developing songbirds.

      We agree that our results do not support and even question the claims in earlier work. These papers however do either 1) not measure hearing physiology or 2) do so in different species. To our best knowledge there is presently no data published on the auditory physiology development in songbird embryos. Our data are consistent with what is known about the physiology of auditory development in all birds studied so far. We will provide a detailed discussion on this topic in our revision.

      References

      Capshaw et al. (2020) J Exp Biol 223: jeb236489

      Goutte et al. (2017) Sci Rep 7: 12121, doi 10.1038/s41598-017-12145-5

      Houser et al. (2024) Science 386, 902-906. DOI:10.1126/science.ado7580).

      Jørgensen et al. (2012) Adv Exp Med Biol 730: 117-119

      Lauridsen et al (2021) J Exp Biol 224: jeb237313. https://doi.org/10.1242/jeb.237313

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This Reviewer was positive about the study, stating ‘The findings are interesting and important to increase the understanding both of the synaptic transmissions in the main olfactory bulb and the DA neuron diversity.’ They provided a number of helpful suggestions for improving the paper, which we have incorporated as follows:

      (1) It is known that there are two types of DA neurons in the glomerular layer with different diameters and capacitances (Kosaka and Kosaka, 2008; Pignatelli et al., 2005; Angela Pignatelli and Ottorino Belluzzi, 2017). In this manuscript, the authors need to articulate better which layer the imaging and ephys recordings took place, all glomerular layers or with an exception. Meanwhile, they have to report the electrophysiological properties of their recordings, including capacitances, input resistance, etc.

      We thank the Reviewer for this clarification. Indeed, the two dopaminergic cell types we study here correspond directly to the subtypes previously identified based on cell size. Our previous work showed that axon-bearing OB DA neurons have significantly larger somas than their anaxonic neighbours (Galliano et al. 2018), and we replicate this important result in the present study (Figure 3D). In terms of electrophysiological correlates of cell size, we now provide full details of passive membrane properties in the new Supplementary Figure 4, as requested. Axon-bearing DA neurons have significantly lower input resistance and show a non-significant trend towards higher cell capacitance. Both features are entirely consistent with the larger soma size in this subtype. We apologise for the oversight in not fully describing previous categorisations of OB DA neurons, and have now added this information and the appropriate citations to the Introduction (lines 56 to 59 of the revised manuscript). 

      In terms of cell location, all cells in this study were located in the OB glomerular layer. We sampled the entire glomerular layer in all experiments, including the glomerular/EPL border where the majority of axon-bearing neurons are located (Galliano et al. 2018). This is now clarified in the Materials and Methods section (lines 535 to 537 and 614 to 616 of the revised manuscript).

      (2) It is understandable that recording the DA neurons in the glomerular layer is not easy. However, the authors still need to increase their n's and repeat the experiments at least three times to make their conclusion more solid. For example (but not limited to), Fig 3B, n=2 cells from 1 mouse. Fig.4G, the recording only has 3 cells.

      Despite the acknowledged difficulty of these experiments, we have now added substantial extra data to the study as requested. We have increased the number of cells and animals to further support the following findings:

      Fig 3B: we now have n=5 cells from N=3 mice. We have created a new Supplementary Figure 1 to show all the examples.

      Figure 4G: we now have n=6 cells from N=4 mice.

      Figure 5G: we now have n=3 cells from N=3 mice.

      The new data now provide stronger support for our original conclusions. In the case of auto-evoked inhibition after the application of D1 and D2 receptor antagonists, a nonsignificant trend in the data suggests that, while dopamine is clearly not necessary for the response, it may play a small part in its strength. We have now included this consideration in the Results section (lines 256 to 264 of the revised manuscript).

      (3) The statistics also use pseudoreplicates. It might be better to present the biology replicates, too.

      Indeed, in a study focused on the structural and functional properties of individual neurons, we performed all comparisons with cell as the unit of analysis. This did often (though not always) involve obtaining multiple data points from individual mice, but in these low-throughput experiments n was never hugely bigger than N. The potential impact of pseudoreplicates and their associated within-animal correlations was therefore low. We checked this in response to the Reviewer’s comment by running parallel nested analyses for all comparisons that returned significant differences in the original submission. These are the cases in which we would be most concerned about potential false positive results arising from intra-animal correlations, which nested tests specifically take into account (Aarts et al., 2013). In every instance we found that the nested tests also reported significant differences between anaxonic and axonbearing cell types, thus fully validating our original statistical approach. We now report this in the relevant section of the Materials and Methods (lines 686 to 691 of the revised manuscript).

      (4) In Figure 4D, the authors report the values in the manuscript. It is recommended to make a bar graph to be more intuitive.

      This plot does already exist in the original manuscript. We originally describe these data to support the observation that an auto-evoked inhibition effect exists in anaxonic neurons (corresponding to now lines 240 to 245 of the revised manuscript). We then show them visually in their entirety when we compare them to the lack of response in axon-bearing neurons, depicted in Figure 5C. We still believe that this order of presentation is most appropriate for the flow of information in the paper, so have maintained it in our revised submission.

      (5) In Figure 4F and G, although the data with three cells suggest no phenotype, the kinetics looked different. So, the authors might need to explore that aside from increasing the n.

      We thank the Reviewer for this suggestion. To quantify potential changes in the autoevoked inhibition response kinetics, we fitted single exponential functions and compared changes in the rate constant (k; Methods, lines 650 to 652 of the revised manuscript). Overall, we observed no consistent or significant change in rate constant values after adding DA receptor antagonists. This finding is now reported in the Results section (lines 260 to 263 of the revised manuscript) and shown in a new Supplementary Figure 3.

      (6) Similarly, for Figure 4I and J, L and M, it is better to present and analyze it like F and G, instead of showing only the after-antagonist effect.

      We agree that the ideal scenario would have been to perform the experiments in Figure 4J and 4M the same way as those in Figure 4G, with a before vs after comparison. Unfortunately, however, this was not practically possible. 

      When attempting to apply carbenoxelone to already-patched cells, we found that this drug highly disrupted the overall health and stability of our recordings immediately after its application. This is consistent with previous reports of similar issues with this compound (e.g. Connors 2012, Epilepsy Currents; Tovar et al., 2009, Journal of Neurophysiology). After many such attempts, the total yield of this experiment was one single cell from one animal. Even so, as shown in the traces below, we were able to show that the auto-evoked inhibition response was not eliminated in this specific case:

      Author response image 1.

      Traces of an AEI response recorded before (magenta) and after (green) the application of carbenoxolone (n=1 cell from N=1 mouse).

      In light of these issues, we instead followed published protocols in applying the carbenoxolone directly in the bath without prior recording for 20 minutes (following Samailova et al., 2003, Journal of Neurochemistry) and ran the protocol after that time. Given that our main question was to ask whether gap junctions were strictly necessary for the presence of any auto-evoked inhibition response, our positive findings in these experiments still allowed us to draw clear conclusions.

      In contrast, the issue with the NKCC1 antagonist bumetanide was time. As acknowledged by this Reviewer, obtaining and maintaining high-quality patch recordings from OB DA neurons is technically challenging. Bumetanide is a slow-acting drug when used to modify neuronal chloride concentrations, because in addition to the time it takes to reach the neurons and effectively block NKCC1, the intracellular levels of chloride subsequently change slowly. Studies using this drug in slice physiology experiments typically use an incubation time of at least 20 minutes (e.g. Huberfeld et al., 2007, Journal of Neuroscience), which was incompatible with productive data collection in OB DA neurons. Again, after many unsuccessful efforts, we were forced instead to include bumetanide in the bath without prior recording for 20-30 minutes. As with the carbenoxolone experiment, our goal here was to establish whether autoevoked inhibition was in any way retained in the presence of this drug, so our positive result again allowed us to draw clear conclusions.

      Reviewer #1 (Recommendations for the authors):

      (1) I suggest the authors reconsider the terminology. For example, they use "strikingly" in their title. The manuscript reported two different transmitter release strategies but not the mechanisms, and the word "strikingly" is not professional, either.

      We appreciate the Reviewer’s attention to clarity and tone in the manuscript title, and have nevertheless decided to retain the original wording. The almost all-or-nothing differences between closely related cell types shown in structural and functional properties here (Figures 3F & 5C) are pronounced, extremely clear and easily spotted – all properties appropriate for the word ‘striking.’ In addition, we note that the use of this term is not at all unprofessional, with a PubMed search for ‘strikingly’ in the title of publications returning over 200 hits.

      (2) Similarly, almost all confocal scopes are 3D because images can be taken at stacks. So "3D confocal" is misleading.

      We understand that this is misleading. We have now replaced the sentence ‘Example snapshot of a 3D confocal stack of…’ by ‘Example confocal images of…’ in all the figure legends that apply.

      (3) It is recommended to present the data in bar graphs with data dots instead of showing the numbers in the manuscript directly.

      We agree entirely, and now present data plots for all comparisons reported in the study (Supplementary Figures 2, 4 and 5).

      Reviewer #2 (Recommendations for the authors):

      (1) Several experiments report notably small sample sizes, such as in Figures 3B and 5G, where data from only 2 cells derived from 1-2 mice are presented. Figures 4E-G also report the experimental result only from 3 cells derived from 3 mice. To enhance the statistical robustness and reliability of the findings, these experiments should be replicated with larger sample sizes.

      As per our response to Reviewer 1’s comment #2 above, and to directly address the concern that some evidence was ‘incomplete’, we have now added significant extra data and analysis to this revised submission (Figures 4 and 5; and Supplementary Figure 1). We believe that this has further enhanced the robustness and reliability of our findings, as requested.

      (2) The authors utilize vGAT-Cre for Figures 1-3 and DAT-tdTomato for Figures 4-5, raising concerns about consistency in targeting the same population of dopaminergic neurons. It remains unclear whether all OB DA neurons express vGAT and release GABA. Clarification and additional evidence are needed to confirm whether the same neuronal population was studied across these experiments.

      Although we indeed used different mouse lines to investigate structural and functional aspects of transmitter release, we can be very confident that both approaches allowed us to study the same two distinct DA cell types being compared in this paper. Existing data to support this position are already clear and strong, so in this revision we have focused on the Reviewer’s suggestion to clarify the approaches we chose.

      First, it is well characterised that in mouse and many other species all OB DA neurons are also GABAergic. This has been demonstrated comprehensively at the level of neurochemical identity and in terms of dopamine/GABA co-release, and is true across both small-soma/anaxonic and large-soma/axon-bearing subclasses (Kosaka & Kosaka 2008; 2016; Maher & Westbrook 2008; Borisovska et al., 2013; Vaaga et al., 2016; Liu et al. 2013). To specifically confirm vGAT expression, we have also now provided additional single-cell RNAseq data and immunohistochemical label in a revised Figure 1 (see also Panzanelli et al., 2007, now referenced in the paper, who confirmed endogenous vGAT colocalisation in TH-positive OB neurons). Most importantly, by using vGAT-cre mice here we were able to obtain sufficient numbers of both anaxonic and axon-bearing DA neurons among the vGAT-cre-expressing OB population. We could unambiguously identify these cells as dopaminergic because of their expression of TH protein which, due to the absence of noradrenergic neurons in the OB, is a specific and comprehensive marker for dopaminergic cells in this brain region (Hokfelt et al., 1975; Rosser et al., 1986; Kosaka & Kosaka 2016). Crucially, both axon-bearing and anaxonic OB DA subtypes strongly express TH (Galliano et al., 2018, 2021). We have now added additional text to the relevant Results section (lines 99 to 108 of the revised manuscript) to clarify these reasons for studying vGAT-cre mice here.

      We were also able to clearly identify and sample both subtypes of OB DA neuron using DAT-tdT mice. Our previous published work has thoroughly characterised this exact mouse line at the exact ages studied in the present paper (Galliano et al., 2018; Byrne et al., 2022). We know that DAT-tdT mice provide rather specific label for TH-expressing OB DA neurons (75% co-localisation; Byrne et al., 2022), but most importantly we know which non-DA neurons are labelled in this mouse line and how to avoid them. All nonTH-expressing but tdT-positive cells in juvenile DAT-tdT mice are small, dimly fluorescent and weakly spiking neurons of the calretinin-expressing glomerular subtype (Byrne et al., 2022). These cells are easily detected during physiological recordings, and were excluded from our study here. This information is now provided in the relevant Methods section (lines 616 to 619 of the revised manuscript, also referenced in lines 236 to 240 of the results section), and we apologise for its previous omission. Finally, we have shown both structurally and functionally that both axon-bearing and anaxonic OB DA subtypes are labelled in DAT-tdT mice (Galliano et al., 2018, Tufo et al., 2025; present study). Overall, these additional clarifications firmly establish that the same neuronal populations were indeed studied across our experiments.

      (3) The low TH+ signal in Figure 1D raises questions regarding the successful targeting of OB DA neurons. Further validation, such as additional staining, is required to ensure that the targeted neurons are accurately identified.

      As noted in our response to the previous comment, TH is a specific marker for dopaminergic neurons in the mouse OB, and is widely used for this purpose. Labelling for TH in our tissue is extremely reliable, and in fact gives such strong signal that we were forced to reduce the primary antibody concentration to 1:50,000 to prevent bleedthrough into other acquisition channels. Even at this concentration it was extremely straightforward to unambiguously identify TH-positive cells based on somatic immunofluorescence. We recognise, however, that the original example image in Figure 1D was not sufficiently clear, and have now provided a new example which illustrates the TH-based identification of these cells much more effectively. 

      (4) Estimating the total number of dopaminergic neurons in the olfactory bulb, along with the relative proportions of anaxonic and axon-bearing neuron subtypes, would provide valuable context for the study. Presenting such data is crucial to underscore the biological significance of the findings.

      This information has already been well characterised in previous studies. Total dopaminergic cell number in the OB is ~90,000 (Maclean & Shipley, 1988; Panzanelli et al., 2007; Parrish-Aungst et al., 2007). In terms of proportions, anaxonic neurons make up the vast majority of these cells, with axon-bearing neurons representing only ~2.5% of all OB dopaminergic neurons at P28 (Galliano et al., 2018). Of course, the relatively low number of the axon-bearing subtype does not preclude its having a potentially large influence on glomerular networks and sensory processing, as demonstrated by multiple studies showing the functional effects of inter-glomerular inhibition (Kosaka & Kosaka, 2008; Liu et al., 2013; Whitesell et al., 2013; Banerjee et al., 2015). This information has now been added to the Introduction (line 47 and lines 59 to 62 of the revised manuscript).

      (5) The authors report that in-utero injection was performed based on the premise that the two subclasses of dopaminergic neurons in the olfactory bulb are generated during embryonic development. However, it remains unclear whether in-utero injection is essential for distinguishing between these two subclasses. While the manuscript references a relevant study, the explanation provided is insufficient. A more detailed justification for employing in-utero injection would enhance the manuscript's clarity and methodological rigor.

      We apologise for the lack of clarity in explaining the approach. In utero injection is not absolutely essential for distinguishing between the two subclasses, but it does have two major advantages. 1) Because infection happens before cells migrate to their final positions, it produces sparse labelling which permits later unambiguous identification of individual cells’ processes; and 2) Because both subclasses are generated embryonically (compared to the postnatal production of only anaxonic DA neurons), it allows effective targeting of both cell types. We have now expanded the relevant section of the Results to explain the rationale for our approach in more detail (lines 109 to 116 of the revised manuscript).

      (6) In Figures 1A and 4A, it appears that data from previously published studies were utilized to illustrate the differential mRNA expression in dopaminergic neurons of the olfactory bulb. However, the Methods section and the manuscript lack a detailed description of how these dopaminergic neurons were classified or analyzed. Given that these figures contribute to the primary dataset, providing additional explanation and context is essential to ensure clarity of the findings.

      We apologise for the lack of clarity. We have now extended the part of the methods referring to the RNAseq data analysis (lines 666 to 678 of the revised manuscript). 

      (7) In Figure 2C, anaxonic dopamine neurons display considerable variability in the number of neurotransmitter release sites, with some neurons exhibiting sparse sites while others exhibit numerous sites. The authors should address the potential biological or methodological reasons for this variability and discuss its significance.

      We thank the Reviewer for highlighting this feature of our data. We have now outlined potential methodological reasons for the variability, whilst also acknowledging that it is consistent with previous reports of presynaptic site distributions in these cells (Kiyokage et al., 2017; Results, lines 169 to 172 of the revised manuscript). We have also added a brief discussion of the potential biological significance (Discussion, lines 446 to 450).

      (8) In the images used to differentiate anaxonic and axon-bearing neurons, the soma, axons, and dendrites are intermixed, making it difficult to distinguish structures specific to each subclass. Employing subclass-specific labeling or sparse labeling techniques could enhance clarity and accuracy in identifying these structures.

      Distinguishing these structures is indeed difficult, and was the main reason we used viral label to produce sparse labelling (see response to comment #5 above). In all cases we were extremely careful, including cells only when we could be absolutely certain of their anaxonic or axon-bearing identity, and could also be certain of the continuity of all processes. Crucially, while the 2D representations we show in our figures may suggest a degree of intermixing, we performed all analyses on 3D image stacks, significantly improving our ability to accurately assign structures to individual cells. We have now added extra descriptions of this approach in the relevant Methods section (lines 546 to 548 of the revised manuscript).

      (9) In Figure 3, the soma area and synaptophysin puncta density are compared between axon-bearing and anaxonic neurons. However, the figure only presents representative images of axon-bearing neurons. To ensure a fair and accurate comparison, representative images of both neuron subtypes should be included.

      The original figures did include example images of puncta density (or lack of puncta) in both cell types (Figure 2B and Figure 3E). For soma area, we have now included representative images of axon-bearing and anaxonic neurons with an indication of soma area measurement in a new Supplementary Figure 2A.

      (10) In Figure 4B, the authors state that gephyrin and synaptophysin puncta are in 'very close proximity.' However, it is unclear whether this proximity is sufficient to suggest the possibility of self-inhibition. Quantifying the distance between gephyrin and synaptophysin puncta would provide critical evidence to support this claim. Additionally, analyzing the distribution and proportion of gephyrinsynaptophysin pairs in close proximity would offer further clarity and strengthen the interpretation of these findings.

      We thank the Reviewer for raising this issue. We entirely agree that the example image previously shown did not constitute sufficient evidence to claim either close proximity of gephyrin and synaptophysin puncta, nor the possibility of self-inhibition. We are not in a position to perform a full quantitative analysis of these spatial distributions, nor do we think this is necessary given previous direct evidence for auto-evoked inhibition in OB dopaminergic cells (Smith and Jahr, 2002; Murphy et al., 2005; Maher and Westbrook, 2008; Borisovska et al., 2013) and our own demonstration of this phenomenon in anaxonic neurons (Figure 4). We have therefore removed the image and the reference to it in the text. 

      (11) In Figures 4J and 4M, the effects of the drugs are presented without a direct comparison to the control group (baseline control?). Including these baseline control data is essential to provide a clear context for interpreting the drug effects and to validate the conclusions drawn from these experiments.

      We appreciate the Reviewer’s attention to this important point. As this concern was also raised by Reviewer 1 (their point #6), we have provided a detailed response fully addressing it in our replies to Reviewer 1 above. 

      (12) In Lines 342-344, the authors claim that VMAT2 staining is notoriously difficult. However, several studies (e.g., Weihe et al., 2006; Cliburn et al., 2017) have successfully utilized VMAT2 staining. Moreover, Zhang et al., 2015 - a reference cited by the authors - demonstrates that a specific VMAT2 antibody effectively detects VMAT2. Providing evidence of VMAT2 expression in OB DA neurons would substantiate the claim that these neurons are GABA-co-releasing DA neurons and strengthen the study's conclusions.

      As noted in response to this Reviewer’s comment #2 above, there is clear published evidence that OB DA neurons are GABA- and dopamine-releasing cells. These cells are also known to express VMAT2 (Cave et al., 2010; Borisovska et al., 2013; Vergaña-Vera et al., 2015). We do not therefore believe that additional evidence of VMAT2 expression is necessary to strengthen our study’s conclusions. We did make every effort to label VMAT2-positive release sites in our neurons, but unfortunately all commercially available antibodies were ineffective. The successful staining highlighted by the Reviewer was either performed in the context of virally driven overexpression (Zhang et al., 2015) or was obtained using custom-produced antibodies (Weihe et al., 2006; Cliburn et al., 2017). We have now modified the Discussion text to provide more clarification of these points (lines 393 to 395 of the revised manuscript).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review)

      Summary:

      This study by Park and colleagues uses longitudinal saliva viral load data from two cohorts (one in the US and one in Japan from a clinical trial) in the pre-vaccine era to subset viral shedding kinetics and then use machine learning to attempt to identify clinical correlates of different shedding patterns. The stratification method identifies three separate shedding patterns discriminated by peak viral load, shedding duration, and clearance slope. The authors also assess micro-RNAs as potential biomarkers of severity but do not identify any clear relationships with viral kinetics.

      Strengths:

      The cohorts are well developed, the mathematical model appears to capture shedding kinetics fairly well, the clustering seems generally appropriate, and the machine learning analysis is a sensible, albeit exploratory approach. The micro-RNA analysis is interesting and novel.

      Weaknesses:

      The conclusions of the paper are somewhat supported by the data but there are certain limitations that are notable and make the study's findings of only limited relevance to current COVID-19 epidemiology and clinical conditions.

      We sincerely appreciate the reviewer’s thoughtful and constructive comments, which have been invaluable in improving the quality of our study. We have carefully revised the manuscript to address all points raised.

      (1) The study only included previously uninfected, unvaccinated individuals without the omicron variant. It has been well documented that vaccination and prior infection both predict shorter duration shedding. Therefore, the study results are no longer relevant to current COVID-19 conditions. This is not at all the authors' fault but rather a difficult reality of much retrospective COVID research.

      Thank you for your comment. We agree with the review’s comment that some of our results could not provide insight into the current COVID-19 conditions since most people have either already been infected with COVID-19 or have been vaccinated. We revised our manuscript to discuss this (page 22, lines 364-368). Nevertheless, we believe it is novel that we have extensively investigated the relationship between viral shedding patterns in saliva and a wide range of clinical and microRNA data, and that developing a method to do so remains important. This is important for providing insight into early responses to novel emerging viral diseases in the future. Therefore, we still believe that our findings are valuable.

      (2) The target cell model, which appears to fit the data fairly well, has clear mechanistic limitations. Specifically, if such a high proportion of cells were to get infected, then the disease would be extremely severe in all cases. The authors could specify that this model was selected for ease of use and to allow clustering, rather than to provide mechanistic insight. It would be useful to list the AIC scores of this model when compared to the model by Ke.

      Thank you for your feedback and suggestion regarding our mathematical model. As the reviewer pointed out, in this study, we adopted a simple model (target cell-limited model) to focus on reconstruction of viral dynamics and stratification of shedding patterns rather than exploring the mechanism of viral infection in detail. Nevertheless, we believe that the target cell-limited model provides reasonable reconstructed viral dynamics as it has been used in many previous studies. We revised manuscript to clarify this point (page 10, lines 139-144). Also, we revised our manuscript to provide more detailed description of the model comparison along with information about AIC (page 10, lines 130-135).

      (3) Line 104: I don't follow why including both datasets would allow one model to work better than the other. This requires more explanation. I am also not convinced that non-linear mixed effects approaches can really be used to infer early model kinetics in individuals from one cohort by using late viral load kinetics in another (and vice versa). The approach seems better for making populationlevel estimates when there is such a high amount of missing data.

      Thank you for your feedback. We recognized that our explanation was insufficient by your comment. We intended to describe that, rather than comparing performance of the two models, data fitting can be performed with same level for both models by including both datasets. We revised the manuscript to clarify this point (page 10, lines 135-139).

      Additionally, we agree that nonlinear mixed effects models are a useful approach for performing population-level estimates of missing data. On the other hand, in addition, the nonlinear mixed effects model has the advantage of making the reasonable parameter estimation for each individual with not enough data points by considering the distribution of parameters of other individuals. Paying attention to these advantages, we adopted a nonlinear mixed effects model in our study. We also revised the manuscript to clarify this (page 27, lines 472-483).

      (4) Along these lines, the three clusters appear to show uniform expansion slopes whereas the NBA cohort, a much larger cohort that captured early and late viral loads in most individuals, shows substantial variability in viral expansion slopes. In Figure 2D: the upslope seems extraordinarily rapid relative to other cohorts. I calculate a viral doubling time of roughly 1.5 hours. It would be helpful to understand how reliable of an estimate this is and also how much variability was observed among individuals.

      We appreciate your detailed feedback on the estimated up-slope of viral dynamics. As the reviewer noted, the pattern differs from that observed in the NBA cohort, which may be due to their measurement of viral load from upper respiratory tract swabs. In our estimation, the mean and standard deviation of the doubling time (defined as ln2/(𝛽𝑇<sub>0</sub>𝑝𝑐<sup>−1</sup> − 𝛿)) were 1.44 hours and 0.49 hours, respectively. Although direct validation of these values is challenging, several previous studies, including our own, have reported that viral loads in saliva increase more rapidly than in the upper respiratory tract swabs, reaching their peak sooner. Thus, we believe that our findings are consistent with those of previous studies. We revised our manuscript to discuss this point with additional references (page 20, lines 303-311).

      (5) A key issue is that a lack of heterogeneity in the cohort may be driving a lack of differences between the groups. Table 1 shows that Sp02 values and lab values that all look normal. All infections were mild. This may make identifying biomarkers quite challenging.

      Thank you for your comment regarding heterogeneity in the cohort. Although the NFV cohort was designed for COVID-19 patients who were either mild or asymptomatic, we have addressed this point and revised the manuscript to discuss it (page 21, lines 334-337).

      (6) Figure 3A: many of the clinical variables such as basophil count, Cl, and protein have very low pre-test probability of correlating with virologic outcome.

      Thank you for your comment regarding some clinical information we used in our study. We revised our manuscript to discuss this point (page 21, lines 337-338).

      (7) A key omission appears to be micoRNA from pre and early-infection time points. It would be helpful to understand whether microRNA levels at least differed between the two collection timepoints and whether certain microRNAs are dynamic during infection.

      Thank you for your comment regarding the collection of micro-RNA data. As suggested by the reviewer, we compared micro-RNA levels between two time points using pairwise t-tests and Mann-Whitney U tests with FDR correction. As a result, no micro-RNA showed a statistically significant difference. This suggests that micro-RNA levels remain relatively stable during the course of infection, at least for mild or asymptomatic infection, and may therefore serve as a biomarker independent of sampling time. We have revised the manuscript to include this information (page 17, lines 259-262).

      (8) The discussion could use a more thorough description of how viral kinetics differ in saliva versus nasal swabs and how this work complements other modeling studies in the field.

      We appreciate the reviewer’s thoughtful feedback. As suggested, we have added a discussion comparing our findings with studies that analyzed viral dynamics using nasal swabs, thereby highlighting the differences between viral dynamics in saliva and in the upper respiratory tract. To ensure a fair and rigorous comparison, we referred to studies that employed the same mathematical model (i.e., Eqs.(1-2)). Accordingly, we revised the manuscript and included additional references (page 20, lines 303-311).

      Furthermore, we clarified the significance of our study in two key aspects. First, it provides a detailed analysis of viral dynamics in saliva, reinforcing our previous findings from a single cohort by extending them across multiple cohorts. Second, this study uniquely examines whether viral dynamics in saliva can be directly predicted by exploring diverse clinical data and micro-RNAs. Notably, cohorts that have simultaneously collected and reported both viral load and a broad spectrum of clinical data from the same individuals, as in our study, are exceedingly rare. We revised the manuscript to clarify this point (page 20, lines 302-311).

      (9) The most predictive potential variables of shedding heterogeneity which pertain to the innate and adaptive immune responses (virus-specific antibody and T cell levels) are not measured or modeled.

      Thank you for your comment. We agree that antibody and T cell related markers may serve as the most powerful predictors, as supported by our own study [S. Miyamoto et al., PNAS (2023), ref. 24] as well as previous reports. While this point was already discussed in the manuscript, we have revised the text to make it more explicit (page 21, lines 327-328).

      (10) I am curious whether the models infer different peak viral loads, duration, expansion, and clearance slopes between the 2 cohorts based on fitting to different infection stage data.

      Thank you for your comment. We compared features between 2 cohorts as reviewer suggested. As a result, a statistically significant difference between the two cohorts (i.e., p-value ≤ 0.05 from the t-test) was observed only at the peak viral load, with overall trends being largely similar. At the peak, the mean value was 7.5 log<sub>10</sub> (copies/mL) in the Japan cohort and 8.1 log<sub>10</sub> (copies/mL) in the Illinois cohort, with variances of 0.88 and 0.87, respectively, indicating comparable variability.

      Reviewer #2 (Public review)

      Summary:

      This study argues it has found that it has stratified viral kinetics for saliva specimens into three groups by the duration of "viral shedding"; the authors could not identify clinical data or microRNAs that correlate with these three groups.

      Strengths:

      The question of whether there is a stratification of viral kinetics is interesting.

      Weaknesses:

      The data underlying this work are not treated rigorously. The work in this manuscript is based on PCR data from two studies, with most of the data coming from a trial of nelfinavir (NFV) that showed no effect on the duration of SARS-CoV-2 PCR positivity. This study had no PCR data before symptom onset, and thus exclusively evaluated viral kinetics at or after peak viral loads. The second study is from the University of Illinois; this data set had sampling prior to infection, so has some ability to report the rate of "upswing." Problems in the analysis here include:

      We are grateful to the reviewer for the constructive feedback, which has greatly enhanced the quality of our study. In response, we have carefully revised the manuscript to address all comments.

      The PCR Ct data from each study is treated as equivalent and referred to as viral load, without any reports of calibration of platforms or across platforms. Can the authors provide calibration data and justify the direct comparison as well as the use of "viral load" rather than "Ct value"? Can the authors also explain on what basis they treat Ct values in the two studies as identical?

      Thank you for your comment regarding description of viral load data. We recognized the lack of explanation for the integration of viral load data by reviewer's comment. We calculated viral load from Ct value using linear regression equations between Ct and viral load for each study's measurement method, respectively. We revised the manuscript to clarify this point in the section of Saliva viral load data in Methods.

      The limit of detection for the NFV PCR data was unclear, so the authors assumed it was the same as the University of Illinois study. This seems a big assumption, as PCR platforms can differ substantially. Could the authors do sensitivity analyses around this assumption?

      Thank you for your comment regarding the detection limit for viral load data. As reviewer suggested, we conducted sensitivity analysis for assumption of detection limit for the NFV dataset. Specifically, we performed data fitting in the same manner for two scenarios: when the detection limit of NFV PCR was lower (0 log<sub>10</sub> copies/mL) or higher (2 log<sub>10</sub> copies/mL) than that of the Illinois data (1.08 log<sub>10</sub> copies/mL), and compared the results.

      As a result, we obtained largely comparable viral dynamics in most cases (Supplementary Fig 6). When comparing the AIC values, we observed that the AIC for the same censoring threshold was 6836, whereas it increased to 7403 under the low censoring threshold and decreased to 6353 under the higher censoring threshold. However, this difference may be attributable to the varying number of data points treated as below the detection limit. Specifically, when the threshold is set higher, more data are treated as below the detection limit, which may result in a more favorable error calculation. To discuss this point, we have added a new figure (Supplementary Fig 6) and revised the manuscript accordingly (page 25, lines 415-418).

      The authors refer to PCR positivity as viral shedding, but it is viral RNA detection (very different from shedding live/culturable virus, as shown in the Ke et al. paper). I suggest updating the language throughout the manuscript to be precise on this point.

      We appreciate the reviewer’s feedback regarding the terminology used for viral shedding. In response, we have revised all instances of “viral shedding” to “viral RNA detection” throughout the manuscript as suggested.

      Eyeballing extended data in Figure 1, a number of the putative long-duration infections appear to be likely cases of viral RNA rebound (for examples, see S01-16 and S01-27). What happens if all the samples that look like rebound are reanalyzed to exclude the late PCR detectable time points that appear after negative PCRs?

      We sincerely thank the reviewer for the valuable suggestion. In response, we established a criterion to remove data that appeared to exhibit rebound and subsequently performed data fitting

      (see Author response image 1 below). The criterion was defined as: “any data that increase again after reaching the detection limit in two measurements are considered rebound and removed.” As a result, 15 out of 144 cases were excluded due to insufficient usable data, leaving 129 cases for analysis. Using a single detection limit as the criterion would have excluded too many data points, while defining the criterion solely based on the magnitude of increase made it difficult to establish an appropriate “threshold for increase.”

      The fitting result indicates that the removal of rebound data may influence the fitting results; however, direct comparison of subsequent analyses, such as clustering, is challenging due to the reduced sample size. Moreover, the results can vary substantially depending on the criterion used to define rebound, and establishing a consistent standard remains difficult. Accordingly, we retained the current analysis and have added a discussion of rebound phenomena in the Discussion section as a limitation (page 22, lines 355-359). We once again sincerely appreciate the reviewer’s insightful and constructive suggestion.

      Author response image 1.

      Comparison of model fits before and after removing data suspected of rebound. Black dots represent observed measurements, and the black and yellow curves show the fitted viral dynamics for the full dataset and the dataset with rebound data removed, respectively.

      There's no report of uncertainty in the model fits. Given the paucity of data for the upslope, there must be large uncertainty in the up-slope and likely in the peak, too, for the NFV data. This uncertainty is ignored in the subsequent analyses. This calls into question the efforts to stratify by the components of the viral kinetics. Could the authors please include analyses of uncertainty in their model fits and propagate this uncertainty through their analyses?

      We sincerely appreciate the reviewer’s detailed feedback on model uncertainty. To address this point, we revised Extended Fig 1 (now renumbered as Supplementary Fig 1) to include 95% credible intervals computed using a bootstrap approach. In addition, to examine the potential impact of model uncertainty on stratified analyses, we reconstructed the distance matrix underlying stratification by incorporating feature uncertainty. Specifically, for each individual, we sampled viral dynamics within the credible interval and averaged the resulting feature, and build the distance matrix using it. We then compared this uncertainty-adjusted matrix with the original one using the Mantel test, which showed a strong correlation (r = 0.72, p < 0.001). Given this result, we did not replace the current stratification but revised the manuscript to provide this information through Result and Methods sections (page 11, lines 159-162 and page 28, lines 512-519). Once again, we are deeply grateful for this insightful comment.

      The clinical data are reported as a mean across the course of an infection; presumably vital signs and blood test results vary substantially, too, over this duration, so taking a mean without considering the timing of the tests or the dynamics of their results is perplexing. I'm not sure what to recommend here, as the timing and variation in the acquisition of these clinical data are not clear, and I do not have a strong understanding of the basis for the hypothesis the authors are testing.

      We appreciate the reviewers' feedback on the clinical data. We recognized that the manuscript lacked description of the handling of clinical data by your comment. In this research, we focused on finding “early predictors” which could provide insight into viral shedding patterns. Thus, we used clinical data measured in the earliest time (date of admission) for each patient. Another reason is that the date of admission is the almost only time point at which complete clinical data without any missing values are available for all participants. We revised our manuscript to clarify this point (page 5, lines 90-95).

      It's unclear why microRNAs matter. It would be helpful if the authors could provide more support for their claims that (1) microRNAs play such a substantial role in determining the kinetics of other viruses and (2) they play such an important role in modulating COVID-19 that it's worth exploring the impact of microRNAs on SARS-CoV-2 kinetics. A link to a single review paper seems insufficient justification. What strong experimental evidence is there to support this line of research?

      We appreciate the reviewer’s comments regarding microRNA. Based on this feedback, we recognized the need to clarify our rationale for selecting microRNAs as the analyte. The primary reason was that our available specimens were saliva, and microRNAs are among the biomarkers that can be reliably measured in saliva. At the same time, previous studies have reported associations between microRNAs and various diseases, which led us to consider the potential relevance of microRNAs to viral dynamics, beyond their role as general health indicators. To better reflect this context, we have added supporting references (page 17, lines 240-243).

      Reviewer #3 (Public review)

      The article presents a comprehensive study on the stratification of viral shedding patterns in saliva among COVID-19 patients. The authors analyze longitudinal viral load data from 144 mildly symptomatic patients using a mathematical model, identifying three distinct groups based on the duration of viral shedding. Despite analyzing a wide range of clinical data and micro-RNA expression levels, the study could not find significant predictors for the stratified shedding patterns, highlighting the complexity of SARS-CoV-2 dynamics in saliva. The research underscores the need for identifying biomarkers to improve public health interventions and acknowledges several limitations, including the lack of consideration of recent variants, the sparsity of information before symptom onset, and the focus on symptomatic infections. 

      The manuscript is well-written, with the potential for enhanced clarity in explaining statistical methodologies. This work could inform public health strategies and diagnostic testing approaches. However, there is a thorough development of new statistical analysis needed, with major revisions to address the following points:

      We sincerely appreciate the thoughtful feedback provided by Reviewer #3, particularly regarding our methodology. In response, we conducted additional analyses and revised the manuscript accordingly. Below, we address the reviewer’s comments point by point.

      (1) Patient characterization & selection: Patient immunological status at inclusion (and if it was accessible at the time of infection) may be the strongest predictor for viral shedding in saliva. The authors state that the patients were not previously infected by SARS-COV-2. Was Anti-N antibody testing performed? Were other humoral measurements performed or did everything rely on declaration? From Figure 1A, I do not understand the rationale for excluding asymptomatic patients. Moreover, the mechanistic model can handle patients with only three observations, why are they not included? Finally, the 54 patients without clinical data can be used for the viral dynamics fitting and then discarded for the descriptive analysis. Excluding them can create a bias. All the discarded patients can help the virus dynamics analysis as it is a population approach. Please clarify. In Table 1 the absence of sex covariate is surprising.

      We appreciate the detailed feedback from the reviewer regarding patient selection. We relied on the patient's self-declaration to determine the patient's history of COVID-19 infection and revised the manuscript to specify this (page 6, lines 83-84).

      In parameter estimation, we used the date of symptom onset for each patient so that we establish a baseline of the time axis as clearly as possible, as we did in our previous works. Accordingly, asymptomatic patients who do not have information on the date of symptom onset were excluded from the analysis. Additionally, in the cohort we analyzed, for patients excluded due to limited number of observations (i.e., less than 3 points), most patients already had a viral load close to the detection limit at the time of the first measurement. This is due to the design of clinical trial, as if a negative result was obtained twice in a row, no further follow-up sampling was performed. These patients were excluded from the analysis because it hard to get reasonable fitting results. Also, we used 54 patients for the viral dynamics fitting and then only used the NFV cohort for clinical data analysis. We acknowledge that our description may have confused readers. We revised our manuscript to clarify these points regarding patient selecting for data fitting (page 6, lines 96-102, page 24, lines 406-407, and page 7, lines 410-412). In addition, we realized, thanks to the reviewer’s comment, that gender information was missing in Table 1. We appreciate this observation and have revised the table to include gender (we used gender in our analysis). 

      (2) Exact study timeline for explanatory covariates: I understand the idea of finding « early predictors » of long-lasting viral shedding. I believe it is key and a great question. However, some samples (Figure 4A) seem to be taken at the end of the viral shedding. I am not sure it is really easier to micro-RNA saliva samples than a PCR. So I need to be better convinced of the impact of the possible findings. Generally, the timeline of explanatory covariate is not described in a satisfactory manner in the actual manuscript. Also, the evaluation and inclusion of the daily symptoms in the analysis are unclear to me.

      We appreciate the reviewer’s feedback regarding the collection of explanatory variables. As noted, of the two microRNA samples collected from each patient, one was obtained near the end of viral shedding. This was intended to examine potential differences in microRNA levels between the early and late phases of infection. No significant differences were observed between the two time points, and using microRNA from either phase alone or both together did not substantially affect predictive accuracy for stratified groups. Furthermore, microRNA collection was motivated primarily by the expectation that it would be more sensitive to immune responses, rather than by ease of sampling. We have revised the manuscript to clarify these points regarding microRNA (page 17, lines 243-245 and 259-262).

      Furthermore, as suggested by the reviewer, we have also strengthened the explanation regarding the collection schedule of clinical information and the use of daily symptoms in the analysis (page 6, lines 90-95, page 14, lines 218-220,).

      (3) Early Trajectory Differentiation: The model struggles to differentiate between patients' viral load trajectories in the early phase, with overlapping slopes and indistinguishable viral load peaks observed in Figures 2B, 2C, and 2D. The question arises whether this issue stems from the data, the nature of Covid-19, or the model itself. The authors discuss the scarcity of pre-symptom data, primarily relying on Illinois patients who underwent testing before symptom onset. This contrasts earlier statements on pages 5-6 & 23, where they claim the data captures the full infection dynamics, suggesting sufficient early data for pre-symptom kinetics estimation. The authors need to provide detailed information on the number or timing of patient sample collections during each period.

      Thank you for the reviewer’s thoughtful comments. The model used in this study [Eqs.(1-2)] has been employed in numerous prior studies and has successfully identified viral dynamics at the individual level. In this context, we interpret the rapid viral increase observed across participants as attributable to characteristics of SARS-CoV-2 in saliva, an interpretation that has also been reported by multiple previous studies. We have added the relevant references and strengthened the corresponding discussion in the manuscript (page 20, lines 303-311).

      We acknowledge that our explanation of how the complementary relationship between the two cohorts contributes to capturing infection dynamics was not sufficiently clear. As described in the manuscript, the Illinois cohort provides pre-symptomatic data, whereas the NFV cohort offers abundant end-phase data, thereby compensating for each other’s missing phases. By jointly analyzing the two cohorts with a nonlinear mixed-effects model, we estimated viral dynamics at the individual-level. This approach first estimates population-level parameters (fixed effects) using data from all participants and then incorporates random effects to account for individual variability, yielding the most plausible parameter values.

      Thus, even when early-phase data are lacking in the NFV cohort, information from the Illinois cohort allows us to infer most reasonable dynamics, and the reverse holds true for the end phase. In this context, we argued that combining the two cohorts enables mathematical modeling to capture infection dynamics at the individual level. Recognizing that our earlier description could be misleading, we have carefully reinforced the relevant description (page 27, lines 472-483). In addition, as suggested by the reviewer, we have added information on the number of data samples available for each phase in both cohorts (page 7, lines 106-109).

      (4) Conditioning on the future: Conditioning on the future in statistics refers to the problematic situation where an analysis inadvertently relies on information that would not have been available at the time decisions were made or data were collected. This seems to be the case when the authors create micro-RNA data (Figure 4A). First, when the sampling times are is something that needs to be clarified by the authors (for clinical outcomes as well). Second, proper causal inference relies on the assumption that the cause precedes the effect. This conditioning on the future may result in overestimating the model's accuracy. This happens because the model has been exposed to the outcome it's supposed to predict. This could question the - already weak - relation with mir-1846 level.

      We appreciate the reviewer’s detailed feedback. As noted in Reply to Comments 2, we collected micro-RNA samples at two time points, near the peak of infection dynamics and at the end stage, and found no significant differences between them. This suggests that micro-RNA levels are not substantially affected by sampling time. Indeed, analyses conducted using samples from the peak, late stage, or both yielded nearly identical results in relation to infection dynamics. To clarify this point, we revised the manuscript by integrating this explanation with our response in Reply to Comments 2 (page 17, lines 259-262). In addition, now we also revised manuscript to clarify sampling times of clinical information and micro-RNA (page 6, lines 90-95).

      (5) Mathematical Model Choice Justification and Performance: The paper lacks mention of the practical identifiability of the model (especially for tau regarding the lack of early data information). Moreover, it is expected that the immune effector model will be more useful at the beginning of the infection (for which data are the more parsimonious). Please provide AIC for comparison, saying that they have "equal performance" is not enough. Can you provide at least in a point-by-point response the VPC & convergence assessments?

      We appreciate the reviewer’s detailed feedback regarding the mathematical model. We acknowledge the potential concern regarding the practical identifiability of tau (incubation period), particularly given the limited early-phase data. In our analysis, however, the nonlinear mixed-effects model yielded a population-level estimate of 4.13 days, which is similar with previously reported incubation periods for COVID-19. This concordance suggests that our estimate of tau is reasonable despite the scarcity of early data.

      For model comparison, first, we have added information on the AIC of the two models to the manuscript as suggested by the reviewer (page 10, lines 130-135). One point we would like to emphasize is that we adopted a simple target cell-limited model in this study, aiming to focus on reconstruction of viral dynamics and stratification of shedding patterns rather than exploring the mechanism of viral infection in detail. Nevertheless, we believe that the target cell-limited model provides reasonable reconstructed viral dynamics as it has been used in many previous studies. We revised manuscript to clarify this (page 10, lines 135-144). 

      Furthermore, as suggested, we have added the VPC and convergence assessment results for both models, together with explanatory text, to the manuscript (Supplementary Fig 2, Supplementary Fig 3, and page 10, lines 130-135). In the VPC, the observed 5th, 50th, and 95th percentiles were generally within the corresponding simulated prediction intervals across most time points. Although minor deviations were noted in certain intervals, the overall distribution of the observed data was well captured by the models, supporting their predictive performance (Supplementary Fig 2). In addition, the log-likelihood and SAEM parameter trajectories stabilized after the burn-in phase, confirming appropriate convergence (Supplementary Fig 3).

      (6) Selected features of viral shedding: I wonder to what extent the viral shedding area under the curve (AUC) and normalized AUC should be added as selected features.

      We sincerely appreciate the reviewer’s valuable suggestion regarding the inclusion of additional features. Following this recommendation, we considered AUC (or normalized AUC) as an additional feature when constructing the distance matrix used for stratification. We then evaluated the similarity between the resulting distance matrix and the original one using the Mantel test, which showed a very high correlation (r = 0.92, p < 0.001). This indicates that incorporating AUC as an additional feature does not substantially alter the distance matrix. Accordingly, we have decided to retain the current stratification analysis, and we sincerely thank the reviewer once again for this interesting suggestion.

      (7) Two-step nature of the analysis: First you fit a mechanistic model, then you use the predictions of this model to perform clustering and prediction of groups (unsupervised then supervised). Thus you do not propagate the uncertainty intrinsic to your first estimation through the second step, ie. all the viral load selected features actually have a confidence bound which is ignored. Did you consider a one-step analysis in which your covariates of interest play a direct role in the parameters of the mechanistic model as covariates? To pursue this type of analysis SCM (Johnson et al. Pharm. Res. 1998), COSSAC (Ayral et al. 2021 CPT PsP), or SAMBA ( Prague et al. CPT PsP 2021) methods can be used. Did you consider sampling on the posterior distribution rather than using EBE to avoid shrinkage?

      Thank you for the reviewer’s detailed suggestions regarding our analysis. We agree that the current approach does not adequately account for the impact of uncertainty in viral dynamics on the stratified analyses. As a first step, we have revised Extended Data Fig 1 (now renumbered as Supplementary Fig 1) to include 95% credible intervals computed using a bootstrap approach, to present the model-fitting uncertainty more explicitly. Then, to examine the potential impact of model uncertainty on stratified analyses, we reconstructed the distance matrix underlying stratification by incorporating feature uncertainty. Specifically, for each individual, we sampled viral dynamics within the credible interval and averaged the resulting feature, and build the distance matrix using it. We then compared this uncertainty-adjusted matrix with the original one using the Mantel test, which showed a strong correlation (r = 0.72, p < 0.001). Given this result, we did not replace the current stratification but revised the manuscript to provide this information (page 11, lines 159-162 and page 28, 512-519).

      Furthermore, we carefully considered the reviewer’s proposed one-step analysis. However, implementation was constrained by data-fitting limitations. Concretely, clinical information is available only in the NFV cohort. Thus, if these variables are to be entered directly as covariates on the parameters, the Illinois cohort cannot be included in the data-fitting process. Yet the NFV cohort lacks any pre-symptomatic observations, so fitting the model to that cohort alone does not permit a reasonable (well-identified/robust) fitting result. While we were unable to implement the suggestion under the current data constraints, we sincerely appreciate the reviewer’s thoughtful and stimulating proposal.

      (8) Need for advanced statistical methods: The analysis is characterized by a lack of power. This can indeed come from the sample size that is characterized by the number of data available in the study. However, I believe the power could be increased using more advanced statistical methods. At least it is worth a try. First considering the unsupervised clustering, summarizing the viral shedding trajectories with features collapses longitudinal information. I wonder if the R package « LongituRF » (and associated method) could help, see Capitaine et al. 2020 SMMR. Another interesting tool to investigate could be latent class models R package « lcmm » (and associated method), see ProustLima et al. 2017 J. Stat. Softwares. But the latter may be more far-reached.

      Thank you for the reviewer’s thoughtful suggestions regarding our unsupervised clustering approach. The R package “LongitiRF” is designed for supervised analysis, requiring a target outcome to guide the calculation of distances between individuals (i.e., between viral dynamics). In our study, however, the goal was purely unsupervised clustering, without any outcome variable, making direct application of “LongitiRF” challenging.

      Our current approach (summarizing each dynamic into several interpretable features and then using Random Forest proximities) allows us to construct a distance matrix in an unsupervised manner. Here, the Random Forest is applied in “proximity mode,” focusing on how often dynamics are grouped together in the trees, independent of any target variable. This provides a practical and principled way to capture overall patterns of dynamics while keeping the analysis fully unsupervised.

      Regarding the suggestion to use latent class mixed models (R package “lcmm”), we also considered this approach. In our dataset, each subject has dense longitudinal measurements, and at many time points, trajectories are very similar across subjects, resulting in minimal inter-individual differences. Consequently, fitting multi-class latent class mixed models (ng ≥ 2) with random effects or mixture terms is numerically unstable, often producing errors such as non-positive definite covariance matrices or failure to generate valid initial values. Although one could consider using only the time points with the largest differences, this effectively reduces the analysis to a feature-based summary of dynamics. Such an approach closely resembles our current method and contradicts the goal of clustering based on full longitudinal information.

      Taken together, although we acknowledge that incorporating more longitudinal information is important, we believe that our current approach provides a practical, stable, and informative solution for capturing heterogeneity in viral dynamics. We would like to once again express our sincere gratitude to the reviewer for this insightful suggestion.

      (9) Study intrinsic limitation: All the results cannot be extended to asymptomatic patients and patients infected with recent VOCs. It definitively limits the impact of results and their applicability to public health. However, for me, the novelty of the data analysis techniques used should also be taken into consideration.

      We appreciate your positive evaluation of our research approach and acknowledge that, as noted in the Discussion section as our first limitation, our analysis may not provide valid insights into recent VOCs or all populations, including asymptomatic individuals. Nonetheless, we believe it is novel that we extensively investigated the relationship between viral shedding patterns in saliva and a wide range of clinical and micro-RNA data. Our findings contribute to a deeper and more quantitative understanding of heterogeneity in viral dynamics, particularly in saliva samples. To discuss this point, we revised our manuscript (page 22, lines 364-368).

      Strengths are:

      Unique data and comprehensive analysis.

      Novel results on viral shedding.

      Weaknesses are:

      Limitation of study design.

      The need for advanced statistical methodology.

      Reviewer #1 (Recommendations For The Authors):

      Line 8: In the abstract, it would be helpful to state how stratification occurred.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 2, lines 8-11).

      Line 31 and discussion: It is important to mention the challenges of using saliva as a specimen type for lab personnel.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, lines 36-41).

      Line 35: change to "upper respiratory tract".

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, line 35).

      Line 37: "Saliva" is not a tissue. Please hazard a guess as to which tissue is responsible for saliva shedding and if it overlaps with oral and nasal swabs.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, lines 42-45).

      Line 42, 68: Please explain how understanding saliva shedding dynamics would impact isolation & screening, diagnostics, and treatments. This is not immediately intuitive to me.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, lines 48-50).

      Line 50: It would be helpful to explain why shedding duration is the best stratification variable.

      We thank the reviewer for the feedback. We acknowledge that our wording was ambiguous. The clear differences in the viral dynamics patterns pertain to findings observed following the stratification, and we have revised the manuscript to make this explicit (page 4, lines 59-61).

      Line 71: Dates should be listed for these studies.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 6, lines 85-86).

      Reviewer #2 (Recommendations For The Authors):

      Please make all code and data available for replication of the analyses.

      We appreciate the suggestion. Due to ethical considerations, it is not possible to make all data and code publicly available. We have clearly stated in the manuscript about it (Data availability section in Methods).

      Reviewer #3 (Recommendations For The Authors):

      Here are minor comments / technical details:

      (1) Figure 1B is difficult to understand.

      Thank you for the comment. We updated Fig 1B to incorporate more information to aid interpretation.

      (2) Did you analyse viral load or the log10 of viral load? The latter is more common. You should consider it. SI Figure 1 please plot in log10 and use a different point shape for censored data. The file quality of this figure should be improved. State in the material and methods if SE with moonlit are computed with linearization or importance sampling.

      Thank you for the comment. We conducted our analyses using log10-transformed viral load. Also, we revised Supplementary Fig 1 (now renumbered as Supplementary Fig 4) as suggested. We also added Supplementary Fig 3 and clarified in the Methods that standard errors (SE) were obtained in Monolix from the Fisher information matrix using the linearization method (page 28, lines 498-499).

      (3) Table 1 and Figure 3A could be collapsed.

      Thank you for the comment, and we carefully considered this suggestion. Table 1 summarizes clinical variables by category, whereas Fig 3A visualizes them ordered by p-value of statistical analysis. Collapsing these into a single table would make it difficult to apprehend both the categorical summaries and the statistical ranking at a glance, thereby reducing readability. We therefore decided to retain the current layout. We appreciate the constructive feedback again. 

      (4) Figure 3 legend could be clarified to understand what is 3B and 3C.

      We thank the reviewer for the feedback and have reinforced the description accordingly.

      (5) Why use AIC instead of BICc?

      Thank you for your comment. We also think BICc is a reasonable alternative. However, because our objective is predictive adequacy (reconstruction of viral dynamics), we judged AIC more appropriate. In NLMEM settings, the effective sample size required by BICc is ambiguous, making the penalty somewhat arbitrary. Moreover, since the two models reconstruct very similar dynamics, our conclusions are not sensitive to the choice of criterion.

      (6) Bibliography. Most articles are with et al. (which is not standard) and some are with an extended list of names. Provide DOI for all.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly.

      (7) Extended Table 1&2 - maybe provide a color code to better highlight some lower p-values (if you find any interesting).

      We thank the reviewer for the feedback. Since no clinical information and micro-RNAs other than mir-1846 showed low p-values, we highlighted only mir-1846 with color to make it easier to locate.

      (8) Please make the replication code available.

      We appreciate the suggestion. Due to ethical considerations, it is not possible to make all data and code publicly available. We have clearly stated in the manuscript about it (Data availability section in Methods).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this work, van Paassen et al. have studied how CD8 T cell functionality and levels predict HIV DNA decline. The article touches on interesting facets of HIV DNA decay, but ultimately comes across as somewhat hastily done and not convincing due to the major issues. 

      (1) The use of only 2 time points to make many claims about longitudinal dynamics is not convincing. For instance, the fact that raw data do not show decay in intact, but do for defective/total, suggests that the present data is underpowered. The authors speculate that rising intact levels could be due to patients who have reservoirs with many proviruses with survival advantages, but this is not the parsimonious explanation vs the data simply being noisy without sufficient longitudinal follow-up. n=12 is fine, or even reasonably good for HIV reservoir studies, but to mitigate these issues would likely require more time points measured per person. 

      (1b) Relatedly, the timing of the first time point (6 months) could be causing a number of issues because this is in the ballpark for when the HIV DNA decay decelerates, as shown by many papers. This unfortunate study design means some of these participants may already have stabilized HIV DNA levels, so earlier measurements would help to observe early kinetics, but also later measurements would be critical to be confident about stability. 

      The main goal of the present study was to understand the relationship of the HIV-specific CD8 T-cell responses early on ART with the reservoir changes across the subsequent 2.5-year period on suppressive therapy. We have revised the manuscript in order to clarify this.  We chose these time points because the 24 week time point is past the initial steep decline of HIV DNA, which takes place in the first weeks after ART initiation. It is known that HIV DNA continues to decay for years after (Besson, Lalama et al. 2014, Gandhi, McMahon et al. 2017). 

      (2) Statistical analysis is frequently not sufficient for the claims being made, such that overinterpretation of the data is problematic in many places. 

      (2a) First, though plausible that cd8s influence reservoir decay, much more rigorous statistical analysis would be needed to assert this directionality; this is an association, which could just as well be inverted (reservoir disappearance drives CD8 T cell disappearance). 

      To correlate different reservoir measures between themselves and with CD8+ T-cell responses at 24 and 156 weeks, we now performed non-parametric (Spearman) correlation analyses, as they do not require any assumptions about the normal distribution of the independent and dependent variables. Benjamini-Hochberg corrections for multiple comparisons (false discovery rate, 0.25) were included in the analyses and did not change the results. 

      Following this comment we would like to note that the association between the T-cell response at 24 weeks and the subsequent decrease in the reservoir cannot be bi-directional (that can only be the case when both variables are measured at the same time point). Therefore, to model the predictive value of T-cell responses measured at 24 weeks for the decrease in the reservoir between 24 and 156 weeks, we fitted generalized linear models (GLM), in which we included age and ART regimen, in addition to three different measures of HIV-specific CD8+ T-cell responses, as explanatory variables, and changes in total, intact, and total defective HIV DNA between 24 and 156 weeks ART as dependent variables.

      (2b) Words like "strong" for correlations must be justified by correlation coefficients, and these heat maps indicate many comparisons were made, such that p-values must be corrected appropriately. 

      We have now used Spearman correlation analysis, provided correlation coefficients to justify the wording, and adjusted the p-values for multiple comparisons (Fig. 1, Fig 3., Table 2). Benjamini-Hochberg corrections for multiple comparisons (false discovery rate, 0.25) were included in the analyses and did not change the results.  

      (3) There is not enough introduction and references to put this work in the context of a large/mature field. The impacts of CD8s in HIV acute infection and HIV reservoirs are both deep fields with a lot of complexity. 

      Following this comment we have revised and expanded the introduction to put our work more in the context of the field (CD8s in acute HIV and HIV reservoirs). 

      Reviewer #2 (Public review): 

      Summary: 

      This study investigated the impact of early HIV specific CD8 T cell responses on the viral reservoir size after 24 weeks and 3 years of follow-up in individuals who started ART during acute infection. Viral reservoir quantification showed that total and defective HIV DNA, but not intact, declined significantly between 24 weeks and 3 years post-ART. The authors also showed that functional HIV-specific CD8⁺ T-cell responses persisted over three years and that early CD8⁺ T-cell proliferative capacity was linked to reservoir decline, supporting early immune intervention in the design of curative strategies. 

      Strengths: 

      The paper is well written, easy to read, and the findings are clearly presented. The study is novel as it demonstrates the effect of HIV specific CD8 T cell responses on different states of the HIV reservoir, that is HIV-DNA (intact and defective), the transcriptionally active and inducible reservoir. Although small, the study cohort was relevant and well-characterized as it included individuals who initiated ART during acute infection, 12 of whom were followed longitudinally for 3 years, providing unique insights into the beneficial effects of early treatment on both immune responses and the viral reservoir. The study uses advanced methodology. I enjoyed reading the paper. 

      Weaknesses: 

      All participants were male (acknowledged by the authors), potentially reducing the generalizability of the findings to broader populations. A control group receiving ART during chronic infection would have been an interesting comparison. 

      We thank the reviewer for their appreciation of our study. Although we had indeed acknowledged the fact that all participants were male, we have clarified why this is a limitation of the study (Discussion, lines 296-298). The reviewer raises the point that it would be useful to compare our data to a control group. Unfortunately, these samples are not yet available, but our study protocol allows for a control group (chronic infection) to ensure we can include a control group in the future.

      Reviewer #1 (Recommendations for the authors): 

      Minor: 

      On the introduction: 

      (1) One large topic that is mostly missing completely is the emerging evidence of selection on HIV proviruses during ART from the groups of Xu Yu and Matthias Lichterfeld, and Ya Chi Ho, among others. 

      Previously, it was only touched upon in the Discussion. Now we have also included this in the Introduction (lines 77-80).

      (2) References 4 and 5 don't quite match with the statement here about reservoir seeding; we don't completely understand this process, and certainly, the tissue seeding aspect is not known. 

      Line 61-62: references were changed and this paragraph was rewritten to clarify.

      (3) Shelton et al. showed a strong relationship with HIV DNA size and timing of ART initiation across many studies. I believe Ananwaronich also has several key papers on this topic. 

      References by Ananwaronich are included (lines 91-94).

      (4) "the viral levels decline within weeks of AHI", this is imprecise, there is a peak and a decline, and an equilibrium. 

      We agree and have rewritten the paragraph accordingly.

      (5) The impact of CD8 cells on viral evolution during primary infection is complex and likely not relevant for this paper. 

      We have left viral evolution out of the introduction in order to keep a focus on the current subject.

      (6) The term "reservoir" is somewhat polarizing, so it might be worth mentioning somewhere exactly what you think the reservoir is, I think, as written, your definition is any HIV DNA in a person on ART? 

      Indeed, we refer to the reservoir when we talk about the several aspects of the reservoir that we have quantified with our assays (total HIV DNA, unspliced RNA, intact and defective proviral DNA, and replication-competent virus). In most instances we try to specify which measurement we are referring to. We have added additional reservoir explanation to clarify our definition to the introduction (lines 55-58).

      (7) I think US might be used before it is defined. 

      We thank the reviewer for this notification, we have now also defined it in the Results section (line 131).

      (8) In Figure 1 it's also not clear how statistics were done to deal with undetectable values, which can be tricky but important. 

      We have now clarified this in the legend to Figure 2 (former Figure 1). Paired Wilcoxon tests were performed to test the significance of the differences between the time points. Pairs where both values were undetectable were always excluded from the analysis. Pairs where one value was undetectable and its detection limit was higher than the value of the detectable partner, were also excluded from the analysis. Pairs where one value was undetectable and its detection limit was lower than the value of the detectable partner, were retained in the analysis.

      In the discussion: 

      (1) "This confirms that the existence of a replication-competent viral reservoir is linked to the presence of intact HIV DNA." I think this statement is indicative of many of the overinterpretations without statistical justification. There are 4 of 12 individuals with QVOA+ detectable proviruses, which means there are 8 without. What are their intact HIV DNA levels? 

      We thank the reviewer for the question that is raised here. We have now compared the intact DNA levels (measured by IPDA) between participants with positive vs. negative QVOA output, and observed a significant difference. We rephrased the wording as follows: “We compared the intact HIV DNA levels at the 24-week timepoint between the six participants, from whom we were able to isolate replicating virus, and the fourteen participants, from whom we could not. Participants with positive QVOA had significantly higher intact HIV DNA levels than those with negative QVOA (p=0.029, Mann-Whitney test; Suppl. Fig. 3). Five of six participants with positive QVOA had intact DNA levels above 100 copies/106 PBMC, while thirteen of fourteen participants with negative QVOA had intact HIV DNA below 100 copies/106 PBMC (p=0.0022, Fisher’s exact test). These findings indicate that recovery of replication-competent virus by QVOA is more likely in individuals with higher levels of intact HIV DNA in IPDA, reaffirming a link between the two measurements.”

      (2) "To determine whether early HIV-specific CD8+ T-cell responses at 24 weeks were predictive for the change in reservoir size". This is a fundamental miss on correlation vs causation... it could be the inverse. 

      We thank the reviewer for the remark. We have calculated the change in reservoir size (the difference between the reservoir size at 24 weeks and 156 weeks ART) and analyzed if the HIVspecific CD8+ T-cell response at 24 weeks ART are predictive for this change. We do not think it can be inverse, as we have a chronological relationship (CD8+ responses at week 24 predict the subsequent change in the reservoir).

      (3) "This may suggest that active viral replication drives the CD8+ T-cell response." I think to be precise, you mean viral transcription drives CD8s, we don't know about the full replication cycle from these data. 

      We agree with the reviewer and have changed “replication” to “transcription” (line 280).

      (4) "Remarkably, we observed that the defective HIV DNA levels declined significantly between 24 weeks and 3 years on ART. This is in contrast to previous observations in chronic HIV infection (30)". I don't find this remarkable or in contrast: many studies have analyzed and/or modeled defective HIV DNA decay, most of which have shown some negative slope to defective HIV DNA, especially within the first year of ART. See White et al., Blankson et al., Golob et al., Besson et al., etc In addition, do you mean in long-term suppressed? 

      The point we would like to make is that,  compared to other studies, we found a significant, prominent decrease in defective DNA (and not intact DNA) over the course of 3 years, which is in contrast to other studies (where usually the decrease in intact is significant and the decrease in defective less prominent). We have rephrased the wording (lines 227-230) as follows:

      “We observed that the defective HIV DNA levels decreased significantly between 24 and 156 weeks of ART. This is different from studies in CHI, where no significant decrease during the first 7 years of ART (Peluso, Bacchetti et al. 2020, Gandhi, Cyktor et al. 2021), or only a significant decrease during the first 8 weeks on ART, but not in the 8 years thereafter, was observed (Nühn, Bosman et al. 2025).”

      Reviewer #2 (Recommendations for the authors): 

      (1) Page 4, paragraph 2 - will be informative to report the statistics here. 

      (2) Page 4, paragraph 4 - "General phenotyping of CD4+ (Suppl. Fig. 3A) and CD8+ (Supplementary Figure 3B) T-cells showed no difference in frequencies of naïve, memory or effector CD8+ T-cells between 24 and 156 weeks." - What did the CD4+ phenotyping show? 

      We thank the reviewer for the remark. Indeed, there were also no differences in frequencies of naïve, memory or effector CD4+ T-cells between 24 and 156 weeks. We have added this to the paragraph (now Suppl. Fig 4), lines 166-168.

      (3) Page 5, paragraph 3 - "Similarly, a broad HIV-specific CD8+ T-cell proliferative response to at least three different viral proteins was observed in the majority of individuals at both time points" - should specify n=? for the majority of individuals. 

      At time point 24 weeks, 6/11 individuals had a response to env, 10/11 to gag, 5/11 to nef, and 4/11 to pol. At 156 weeks, 8/11 to env, 10/11 to gag, 8/11 to nef and 9/11 to pol. We have added this to the text (lines 188-191).

      (4) Seven of 22 participants had non-subtype B infection. Can the authors explain the use of the IPDA designed by Bruner et. al. for subtype B HIV, and how this may have affected the quantification in these participants? 

      Intact HIV DNA was detectable in all 22 participants. We cannot completely exclude influence of primer/probe-template mismatches on the quantification results, however such mismatches could also have occurred in subtype B participants, and droplet digital PCR that IPDA is based on is generally much less sensitive to these mismatches than qPCR.

      (5) Page 7, paragraph 2 - the authors report a difference in findings from a previous study ("a decline in CD8 T cell responses over 2 years" - reference 21), but only provide an explanation for this on page 9. The authors should consider moving the explanation to this paragraph for easier understanding. 

      We agree with the reviewer that this causes confusion. Therefore, we have revised and changed the order in the Discussion.

      (6) Page 7, paragraph 2 - Following from above, the previous study (21) reported this contradicting finding "a decline in CD8 T cell responses over 2 years" in a CHI (chronic HIV) treated cohort. The current study was in an acute HIV treated cohort. The authors should explain whether this may also have resulted in the different findings, in addition to the use of different readouts in each study.

      We thank the reviewer for this attentiveness. Indeed, the study by Takata et al. investigates the reservoir and HIV-specific CD8+ T-cell responses in both the RV254/ SEARCH010 study who initiated ART during AHI and the RV304/ SEARCH013 who initiated ART during CHI. We had not realized that the findings of the decline in CD8 T cell responses were solely found in the RV304/ SEARCH013 (CHI cohort). It appears functional HIV specific immune responses were only measured in AHI at 96 weeks, so we have clarified this in the Discussion. 

      Besson, G. J., C. M. Lalama, R. J. Bosch, R. T. Gandhi, M. A. Bedison, E. Aga, S. A. Riddler, D. K. McMahon, F. Hong and J. W. Mellors (2014). "HIV-1 DNA decay dynamics in blood during more than a decade of suppressive antiretroviral therapy." Clin Infect Dis 59(9): 1312-1321.

      Gandhi, R. T., J. C. Cyktor, R. J. Bosch, H. Mar, G. M. Laird, A. Martin, A. C. Collier, S. A. Riddler, B. J. Macatangay, C. R. Rinaldo, J. J. Eron, J. D. Siliciano, D. K. McMahon and J. W. Mellors (2021). "Selective Decay of Intact HIV-1 Proviral DNA on Antiretroviral Therapy." J Infect Dis 223(2): 225-233.

      Gandhi, R. T., D. K. McMahon, R. J. Bosch, C. M. Lalama, J. C. Cyktor, B. J. Macatangay, C. R. Rinaldo, S. A. Riddler, E. Hogg, C. Godfrey, A. C. Collier, J. J. Eron and J. W. Mellors (2017). "Levels of HIV-1 persistence on antiretroviral therapy are not associated with markers of inflammation or activation." PLoS Pathog 13(4): e1006285.

      Nühn, M. M., K. Bosman, T. Huisman, W. H. A. Staring, L. Gharu, D. De Jong, T. M. De Kort, N. Buchholtz, K. Tesselaar, A. Pandit, J. Arends, S. A. Otto, E. Lucio De Esesarte, A. I. M. Hoepelman, R. J. De Boer, J. Symons, J. A. M. Borghans, A. M. J. Wensing and M. Nijhuis (2025). "Selective decline of intact HIV reservoirs during the first decade of ART followed by stabilization in memory T cell subsets." Aids 39(7): 798-811.

      Peluso, M. J., P. Bacchetti, K. D. Ritter, S. Beg, J. Lai, J. N. Martin, P. W. Hunt, T. J. Henrich, J. D. Siliciano, R. F. Siliciano, G. M. Laird and S. G. Deeks (2020). "Differential decay of intact and defective proviral DNA in HIV-1-infected individuals on suppressive antiretroviral therapy." JCI Insight 5(4).

    1. Author response:

      Reviewer #1 (Public Review):

      Summary

      We thank the reviewer for the constructive and thoughtful evaluation of our work. We appreciate the recognition of the novelty and potential implications of our findings regarding UPR activation and proteasome activity in germ cells.

      (1) The microscopy images look saturated, for example, Figure 1a, b, etc. Is this a normal way to present fluorescent microscopy?

      The apparent saturation was not present in the original images, but likely arose from image compression during PDF generation. While the EMA granule was still apparent, in the revised submission, we will provide high-resolution TIFF files to ensure accurate representation of fluorescence intensity and will carefully optimize image display settings to avoid any saturation artifacts.

      (2) The authors should ensure that all claims regarding enrichment/lower vs. lower values have indicated statistical tests.

      We fully agree. In the revised version, we will correct any quantitative comparisons where statistical tests were not already indicated, with a clear statement of the statistical tests used, including p-values in figure legends and text.

      (a) In Figure 2f, the authors should indicate which comparison is made for this test. Is it comparing 2 vs. 6 cyst numbers?

      We acknowledge that the description was not sufficiently detailed. Indeed, the test was not between 2 vs 6 cyst numbers, but between all possible ways 8-cell cysts or the larger cysts studied could fragment randomly into two pieces, and produce by chance 6-cell cysts in 13 of 15 observed examples. We will expand the legend and main text to clarify that a binomial test was used to determine that the proportion of cysts producing 6-cell fragments differed very significantly from chance.

      Revised text:

      “A binomial test was used to assess whether the observed frequency of 6-cell cyst products differed from random cyst breakage. Production of 6-cell cysts was strongly preferred (13/15 cysts; ****p < 0.0001).”

      (b) Figures 4d and 4e do not have a statistical test indicated.

      We will include the specific statistical test used and report the corresponding p-values directly in the figure legends.

      (3) Because the system is developmentally dynamic, the major conclusions of the work are somewhat unclear. Could the authors be more explicit about these and enumerate them more clearly in the abstract?

      We will revise the abstract to better clarify the findings of this study. We will also replace the term Visham with mouse fusome to reflect its functional and structural analogy to the Drosophila and Xenopus fusomes, making the narrative more coherent and conclusive.

      (4) The references for specific prior literature are mostly missing (lines 184-195, for example).

      We appreciate this observation of a problem that occurred inadvertently when shortening an earlier version.  We will add 3–4 relevant references to appropriately support this section.

      (5) The authors should define all acronyms when they are first used in the text (UPR, EGAD, etc).

      We will ensure that all acronyms are spelled out at first mention (e.g., Unfolded Protein Response (UPR), Endosome and Golgi-Associated Degradation (EGAD)).

      (6)  The jumping between topics (EMA, into microtubule fragmentation, polarization proteins, UPR/ERAD/EGAD, GCNA, ER, balbiani body, etc) makes the narrative of the paper very difficult to follow.

      We are not jumping between topics, but following a narrative relevant to the central question of whether female mouse germ cells develop using a fusome.  EMA, microtubule fragmentation, polarization proteins, ER, and balbiani body are all topics with a known connection to fusomes. This is explained in the general introduction and in relevant subsections. We appreciate this feedback that further explanations of these connections would be helpful. In the revised manuscript, use of the unified term mouse fusome will also help connect the narrative across sections.  UPR/ERAD/EGAD are processes that have been studied in repair and maintenance of somatic cells and in yeast meiosis.  We show that the major regulator XbpI is found in the fusome, and that the fusome and these rejuvenation pathway genes are expressed and maintained throughout oogenesis, rather than only during limited late stages as suggested in previous literature.

      (7) The heading title "Visham participates in organelle rejuvenation during meiosis" in line 241 is speculative and/or not supported. Drawing upon the extensive, highly rigorous Drosophila literature, it is safe to extrapolate, but the claim about regeneration is not adequately supported.

      We believe this statement is accurate given the broad scope of the term "participates." It is supported by localization of the UPR regulator XbpI to the fusome. XbpI is the ortholog of HacI a key gene mediating UPR-mediated rejuvenation during yeast meiosis.  We also showed that rejuvenation pathway genes are expressed throughout most of meiosis (not previously known) and expanded cytological evidence of stage-specific organelle rejuvenation later in meiosis, such as mitochondrial-ER docking, in regions enriched in fusome antigens. However, we recognize the current limitations of this evidence in the mouse, and want to appropriately convey this, without going to what we believe would be an unjustified extreme of saying there is no evidence. 

      Reviewer #2 (Public Review):

      We thank the reviewer for the comprehensive summary and for highlighting both the technical achievement and biological relevance of our study. We greatly appreciate the thoughtful suggestions that have helped us refine our presentation and terminology.

      (1) Some titles contain strong terms that do not fully match the conclusions of the corresponding sections.

      (1a) Article title “Mouse germline cysts contain a fusome-like structure that mediates oocyte development”

      We will change the statement to: “Mouse germline cysts contain a fusome that supports germline cyst polarity and rejuvenation.”

      (1b) Result title “Visham overlaps centrosomes and moves on microtubules” We acknowledge that “moves” implies dynamics. We will include additional supplementary images showing small vesicular components of the mouse fusome on spindle-derived microtubule tracks.

      (1c) Result title “Visham associates with Golgi genes involved in UPR beginning at the onset of cyst formation”

      We will revise this title to: “The mouse fusome associates with the UPR regulatory protein Xbp1 beginning at the onset of cyst formation” to reflect the specific UPR protein that was immunolocalized. 

      (1d) Result title “Visham participates in organelle rejuvenation during meiosis”

      We will revise this to: “The mouse fusome persists during organelle rejuvenation in meiosis.”

      (2) The authors aim to demonstrate that Visham is a fusome-like structure. I would suggest simply referring to it as a "fusome-like structure" rather than introducing a new term, which may confuse readers and does not necessarily help the authors' goal of showing the conservation of this structure in Drosophila and Xenopus germ cells. Interestingly, in a preprint from the same laboratory describing a similar structure in Xenopus germ cells, the authors refer to it as a "fusome-like structure (FLS)" (Davidian and Spradling, BioRxiv, 2025).

      We appreciate the reviewer’s insightful comment. To maintain conceptual clarity and align with existing literature, we will refer to the structure as the mouse fusome throughout the manuscript, avoiding introduction of a new term.

      Reviewer #3 (Public Review):

      We thank the reviewer for emphasizing the importance of our study and for providing constructive feedback that will help us clarify and strengthen our conclusions.

      (1) Line 86 - the heading for this section is "PGCs contain a Golgi-rich structure known as the EMA granule" 

      We agree that the enrichment of Golgi within the EMA PGCs was not shown until the next section. We will revise this heading to:

      “PGCs contain an asymmetric EMA granule.”

      (2)  Line 105-106, how do we know if what's seen by EM corresponds to the EMA1 granule?

      We will clarify that this identification is based on co-localization with Golgi markers (GM130 and GS28) and response to Brefeldin A treatment, which will be included as supplementary data. These findings support that the mouse fusome is Golgi-derived and can therefore be visualized by EM. The Golgi regions in E13.5 cyst cells move close together and associate with ring canals as visualized by EM (Figure 1E), the same as the mouse fusomes identified by EMA.

      (3) Line 106-107-states "Visham co-stained with the Golgi protein Gm130 and the recycling endosomal protein Rab11a1". This is not convincing as there is only one example of each image, and both appear to be distorted.

      Space is at a premium in these figures, but we have no limitation on data documenting this absolutely clear co-localization. We will replace the existing images with high-resolution, non-compressed versions for the final figures to clearly illustrate the co-staining patterns for GM130 and Rab11a1.

      (4) Line 132-133---while visham formation is disrupted when microtubules are disrupted, I am not convinced that visham moves on microtubules as stated in the heading of this section.

      We will include additional supplementary data showing small mouse fusome vesicles aligned along microtubules.

      (5) Line 156 - the heading for this section states that Visham associates with polarity and microtubule genes, including pard3, but only evidence for pard3 is presented.

      We agree and will revise the heading to: “Mouse fusome associates with the polarity protein Pard3.” We are adding data showing association of small fusome vesicles on microtubules.  

      (6)  Lines 196-210 - it's strange to say that UPR genes depend on DAZ, as they are upregulated in the mutants. I think there are important observations here, but it's unclear what is being concluded.

      UPR genes are not upregulated in DAZ in the sense we have never documented them increasing. We show that UPR genes during this time behave like pleuripotency genes and normally decline, but in DAZ mutants their decline is slowed.  We will rephrase the paragraph to clarify that Dazl mutation partially decouples developmental processes that are normally linked, which alters UPR gene expression relative to cyst development.

      (7) Line 257-259-wave 1 and 2 follicles need to be explained in the introduction, and how these fits with the observations here clarified.

      Follicle waves are too small a focus of the current study to explain in the introduction, but we will request readers to refer to the cited relevant literature (Yin and Spradling, 2025) for further details.

      We sincerely thank all reviewers for their insightful and constructive feedback. We believe that the planned revisions—particularly the refined terminology, improved image quality, clarified statistics, and restructured abstract—will substantially strengthen the manuscript and enhance clarity for readers.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      In this paper, the authors conduct both experiments and modeling of human cytomegalovirus (HCMV) infection in vitro to study how the infectivity of the virus (measured by cell infection) scales with the viral concentration in the inoculum. A naïve thought would be that this is linear in the sense that doubling the virus concentration (and thus the total virus) in the inoculum would lead to doubling the fraction of infected cells. However, the authors show convincingly that this is not the case for HCMV, using multiple strains, two different target cells, and repeated experiments. In fact, they find that for some regimens (inoculum concentration), infected cells increase faster than the concentration of the inoculum, which they term "apparent cooperativity". The authors then provided possible explanations for this phenomenon and constructed mathematical models and simulations to implement these explanations. They show that these ideas do help explain the cooperativity, but they can't be conclusive as to what the correct explanation is. In any case, this advances our knowledge of the system, and it is very important when quantitative experiments involving MOI are performed.

      Strengths:

      Careful experiments using state-of-the-art methodologies and advancing multiple competing models to explain the data.

      Weaknesses:

      There are minor weaknesses in explaining the implementation of the model. However, some specific assumptions, which to this reviewer were unclear, could have a substantial impact on the results. For example, whether cell infection is independent or not. This is expanded below.

      Suggestions to clarify the study:

      (1) Mathematically, it is clear what "increase linearly" or "increase faster than linearly" (e.g., line 94) means. However, it may be confusing for some readers to then look at plots such as in Figure 2, which appear linear (but on the log-log scale) and about which the authors also say (line 326) "data best matching the linear relationship on a log-log scale". 

      This is a good point. In our revision, we will include a clarification to indicate that linear on the log-log scale relationship does not imply linear relationship on the linear-linear scale.

      (2) One of the main issues that is unclear to me is whether the authors assume that cell infection is independent of other cells. This could be a very important issue affecting their results, both when analyzing the experimental data and running the simulations. One possible outcome of infection could be the generation of innate mediators that could protect (alter the resistance) of nearby cells. I can imagine two opposite results of this: i) one possibility is that resistance would lead to lower infection frequencies and this would result in apparent sub-linear infection (contrary to the observations); or ii) inoculums with more virus lead to faster infection, which doesn't allow enough time for the "resistance" (innate effect) to spread (potentially leading to results similar to the observations, supra-linear infection). 

      In our models we assumed cells to be independent of each other (see also responses to other similar points). Because we measure infection in individual cells, assuming cells are independent is a reasonable first approximation. However, the reviewer makes an excellent point that there may be some between-cell signaling happening in the culture that “alerts” or “conditions” cells to change their “resistance”. It is also possible that at higher genome/cell numbers, exposure of cells to virions or virion debris may change the state of cells in the culture, and more cells become “susceptible” to infection. This is a good point that we will list in Limitations subsection of Discussion; it is a good hypothesis to test in our future experiments.

      (3) Another unclear aspect of cell infection is whether each cell only has one chance to be infected or multiple chances, i.e., do the authors run the simulation once over all the cells or more times? 

      Each cell has only one chance to be infected. Algorithm 1 clearly states that; we will add an extra sentence in “Agent-based simulations” to indicate this point.

      (4) On the other hand, the authors address the complementary issue of the virus acting independently or not, with their clumping model (which includes nice experimental measurements). However, it was unclear to me what the assumption of the simulation is in this case. In the case of infection by a clump of virus or "viral compensation", when infection is successful (the cell becomes infected), how many viruses "disappear" and what happens to the rest? For example, one of the viruses of the clump is removed by infection, but the others are free to participate in another clump, or they also disappear. The only thing I found about this is the caption of Figure S10, and it seems to indicate that only the infected virus is removed. However, a typical assumption, I think, is that viruses aggregate to improve infection, but then the whole aggregate participates in infection of a single cell, and those viruses in the clump can't participate in other infections. Viral cooperativity with higher inocula in this case would be, perhaps, the result of larger numbers of clumps for higher inocula. This seems in agreement with Figure S8, but was a little unclear in the interpretation provided. 

      This is a good point. We did not remove the clump if one of the virions in the clump manages to infect a cell, and indeed, this could be the reason why in some simulations we observe apparent cooperativity when modeling viral clumping. This is something we will explore in our revision.

      (5) In algorithm 1, how does P_i, as defined, relate to equation 1? 

      These are unrelated because eqn.(1) is a phenomenological model that links infection per cell to genomes per cell. P_i in algorithm 1 is “physics-inspired” potential barrier.

      (6) In line 228, and several other places (e.g., caption of Table S2), the authors refer to the probability of a single genome infecting a cell p(1)=exp(-lambda), but shouldn't it be p(1)=1-exp(-lambda) according to equation 1?

      Indeed, it was a typo, p(1)=1-exp(-lambda) per eqn 1. Thank you, it will be corrected in the revised paper.

      (7) In line 304, the accrued damage hypothesis is defined, but it is stated as a triggering of an antiviral response; one would assume that exposure to a virion should increase the resistance to infection. Otherwise, the authors are saying that evolution has come up with intracellular viral resistance mechanisms that are detrimental to the cell. As I mentioned above, this could also be a mechanism for non-independent cell infection. For example, infected cells signal to neighboring cells to "become resistance" to infection. This would also provide a mechanism for saturation at high levels. 

      We do not know how exposure of a cell to one virion would change its “antiviral state”, i.e., to become more or less resistant to the next infection. If a cell becomes more resistant, there is no possibility to observe apparent cooperativity in infection of cells, so this hypothesis cannot explain our observations with n>1. Whether this mechanism plays a role in saturation of cell infection rate at lower than 1 value when genome/cell is large is unclear but is a possibility. We will add this point to Discussion in revision.

      (8) In Figure 3, and likely other places, t-tests are used for comparisons, but with only an n=5 (experiments). Many would prefer a non-parametric test. 

      We repeated the analyses in Fig 3 with Mann-Whitney test, results were the same, so we would like to keep results from the t-test in the paper.

      Reviewer #2 (Public review):

      In their article, Peterson et al. wanted to show to what extent the classical "single hit" model of virion infection, where one virion is required to infect a cell, does not match empirical observations based on human cytomegalovirus in vitro infection model, and how this would have practical impacts in experimental protocols.

      They first used a very simple experimental assay, where they infected cells with serially diluted virions and measured the proportion of infected cells with flow cytometry. From this, they could elegantly show how the proportion of infected cells differed from a "single hit" model, which they simulated using a simple mathematical model ("powerlaw model"), and better fit a model where virions need to cooperate to infect cells. They then explore which mechanism could explain this apparent cooperation:

      (1) Stochasticity alone cannot explain the results, although I am unsure how generalizable the results are, because the mathematical model chosen cannot, by design, explain such observations only by stochasticity. 

      Our null model simulations are not just about stochasticity; they also include variability in virion infectivity and cell resistance to infection. We agree that simulations cannot truly prove that such variability cannot result in apparent cooperativity; however, we also provide a mathematical proof that increase in frequency of infected cells should be linear with virion concentration at small genome/cell numbers.

      (2) Virion clumping seemed not to be enough either to generally explain such a pattern. For that, they first use a mathematical model showing that the apparent cooperation would be small. However, I am unsure how extreme the scenario of simulated virion clumping is. They then used dynamic light scattering to measure the distribution of the sizes of clumps. From these estimates, they show that virion clumps cannot reproduce the observed virion cooperation in serial dilution assays. However, the authors remain unprecise on how the uncertainty of these clumps' size distribution would impact the results, as most clumps have a size smaller than a single virion, leaving therefore a limited number of clumps truly containing virions. 

      As we stated in the paper, clumping may explain apparent cooperativity in simulations depending on how stock dilution impacts distribution of virions/clump. This could be explored further, however, better experimental measurements of virions/clump would be highly informative (but we do not have resources to do these experiments at present). Our point is that the degree of apparent cooperativity is dependent on the target cell used (n is smaller on epithelial cells than on fibroblasts) that is difficult to explain by clumping which is a virion property. Per comment by reviewer 1, we will do some more analyses of the clumping model to investigate importance of clump removal per successful infection on the detected degree of apparent cooperativity.

      The two models remain unidentifiable from each other but could explain the apparent virion cooperativity: either due to an increase in susceptibility of the cell each time a virion tries to infect it, or due to viral compensation, where lesser fit viruses are able to infect cells in co-infection with a better fit virion. Unfortunately, the authors here do not attempt to fit their mathematical model to the experimental data but only show that theoretical models and experimental data generate similar patterns regarding virion apparent cooperation. 

      In the revision we will provide examples of simulations that “match” experimental data with a relatively high degree of apparent cooperativity; we have done those before but excluded them from the current version since they are a bit messy. Fitting simulations to data may be an overkill.

      Finally, the authors show that this virions cooperation could make the relationship between the estimated multiplicity of infection and viruses/cell deviate from the 1:1 relationship. Consequently, the dilution of a virion stock would lead to an even stronger decrease in infectivity, as more diluted virions can cooperate less for infection.

      Overall, this work is very valuable as it raises the general question of how the estimate of infectivity can be biased if extrapolated from a single virus titer assay. The observation that HCMV virions often cooperate and that this cooperation varies between contexts seems robust. The putative biological explanations would require further exploration.

      This topic is very well known in the case of segmented viruses and the semi-infectious particles, leading to the idea of studying "sociovirology", but to my knowledge, this is the first time that it was explored for a nonsegmented virus, and in the context of MOI estimation. 

      Thank you.

      Reviewer #3 (Public review): 

      Summary:

      The authors dilute fluorescent HCMV stocks in small steps (df ≈ 1.3-1.5) across 23 points, quantify infections by flow cytometry at 3 dpi, and fit a power-law model to estimate a cooperativity parameter n (n > 1 indicates apparent cooperativity). They compare fibroblasts vs epithelial cells and multiple strains/reporters, and explore alternative mechanisms (clumping, accrued damage, viral compensation) via analytical modeling and stochastic simulations. They discuss implications for titer/MOI estimation and suggest a method for detecting "apparent cooperativity," noting that for viruses showing this behavior, MOI estimation may be biased.

      Strengths:

      (1) High-resolution titration & rigor: The small-step dilution design (23 serial dilutions; tailored df) improves dose-response resolution beyond conventional 10× series.

      (2) Clear quantitative signal: Multiple strain-cell pairs show n > 1, with appropriate model fitting and visualization of the linear regime on log-log axes.

      (3) Mechanistic exploration: Side-by-side modeling of clumping vs accrued damage vs compensation frames testable hypotheses for cooperativity. 

      Thank you.

      Weaknesses:

      (1) Secondary infection control: The authors argue that 3 dpi largely avoids progeny-mediated secondary infection; this claim should be strengthened (e.g., entry inhibitors/control infections) or add sensitivity checks showing results are robust to a small secondary-infection contribution. 

      This is an important point. We do believe that the current knowledge about HCMV virion production time – it takes 3-4 days to make virions per multiple papers (see Fig 7 in Vonka and Benyesh-Melnick JB 1966; Fig 3B in Stanton et al JCI 2010; and Fig 1A in Li et al. PNAS 2015) – is sufficient to justify our experimental design but we do agree that an additional control to block novel infections with would be useful. We had previously performed experiments with a HCMV TB-gL-KO that cannot make infectious virions (but the stock virions can be made from complemented target cells). We will investigate if our titration experiments with this virus strain have sufficient resolution to detect apparent cooperativity. However, at present we do not have the resources to perform novel experiments.  

      (2) Discriminating mechanisms: At present, simulations cannot distinguish between accrued damage and viral compensation. The authors should propose or add a decisive experiment (e.g., dual-color coinfection to quantify true coinfection rates versus "priming" without coinfection; timed sequential inocula) and outline expected signatures for each mechanism. 

      Excellent suggestion. Because infection of a cell is a result of the joint viral infectivity and cell resistance, it may be hard to discriminate between these alternatives unless we specify them as particular molecular mechanisms. But we will try our best and list potential future experiments in the revised version of the paper.

      (3) Decline at high genomes/cell: Several datasets show a downturn at high input. Hypotheses should be provided (cytotoxicity, receptor depletion, and measurement ceiling) and any supportive controls. 

      Another good point. We do not have a good explanation, but we do not believe this is because of saturation of available target cells.  It seemed to only happen (or was most pronounced) with the ME stocks, which are typically lower in titer and so the higher MOI were nearly undiluted stock. It may be the effect of the conditioned medium.  Or perhaps there are non-infectious particles like dense bodies (enveloped particles that lack a capsid and genome) and non-infectious, enveloped particles (NIEPs) that compete for receptors or otherwise damage cells and these don’t get diluted out at the higher doses.  We plan to include these points in Discussion of the revised version of the paper.

      (4) Include experimental data: In Figure 6, please include the experimentally measured titers (IU/mL), if available. 

      This is a model-simulated scenario, and as such, there is no measured titers.

      (5) MOI guidance: The practical guidance is important; please add a short "best-practice box" (how to determine titer at multiple genomes/cell and cell densities; when single-hit assumptions fail) for end-users. 

      Good suggestion. We will include best-practice box using guidelines developed in Ryckman lab over the years in the revised version of the paper.

      Overall note to all reviews: We have deposited our codes and the data on github; yet, none of the reviewers commented on it.

    1. Two Formulas for Paragraph Structure We have looked at the basic parts of your essay, and now we have a sample formula to help you expand your ideas about your evidence. Between the Introduction (and thesis) and the Conclusion (and reflection on the thesis) comes the body of the essay. For your essay’s body to be solid and focused, it needs to have clear, well-developed paragraphs. Even paragraphs need to have a beginning, middle, and end. To help you think about paragraph organization, think about TEAR: T = Topic Sentence This is like a little thesis for your paragraph. It tells the reader what that paragraph is all about. If your reader were only to read the topic sentences in your essay, he/or she should have a general idea of what you’re talking about. Of course, he/she can’t get a complete picture unless you provide… E = Evidence This is the “how do you know?” part of your paragraph. Evidence comes from the real world. You may present your evidence in the form of statistics, direct quotes, summaries, or paraphrases from a source, or your own observations. Evidence is available to us all. What your reader needs is for you to make sense of that evidence so that s/he understands what all this has to do with your thesis or claim. That is why you provide… A = Analysis This is the ‘so what?’ part of your paragraph. You say what is important and why. This isn’t just personal taste or opinion. You have to provide good reasons to support your conclusions. And just to make sure you’re still on track, you… R = Reflection This sentence concludes the paragraph and relates to the topic sentence and the thesis. Ideally, it should also prepare us for the next paragraph. Note Transitions are like the mortar between the bricks. Transitions hold our ideas together and move us gracefully from point to point. Some common transition words or phrases may include although, therefore, because, in fact, for example, on the other hand, while, in addition, in contrast, then again, furthermore, but back to our main point… To help you think about TEAR, imagine your snarky little brother looking over your shoulder as you compose, asking you: T = “What’s all this about?” E = “How do you know?” A = “Why should I care?” R = “What does this have to do with anything?” You may be thinking, I’ve heard this before, but it wasn’t called TEAR. It was called…. PIE What does PIE stand for? P = Point. This is the point of the paragraph, or the topic sentence. I = Illustration. This is where you illustrate your point with evidence E = Explanation. This is where you explain how that evidence supports your point. This is your analysis. Why give you two ways to think of this? Because you may find that to fully develop your paragraph, you’ll need to add a little more evidence and analysis. And it looks a little funny to write TEAEAR. So, you can think of PIE-IE-IE will always love you.

      TEAR PIE

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This study builds on previous work demonstrating that several beta connexins (Cx26, Cx30, and Cx32) have a carbamylation motif which renders them sensitive to CO<sub>2</sub>. In response to CO<sub>2</sub>, hemichannels composed of these connexins open, enabling diffusion of small molecules (such as ATP) between the cytosol and extracellular environment. Here, the authors have identified that an alpha connexin, Cx43, also contains a carbamylation motif, and they demonstrate that CO<sub>2</sub> opens Cx43 hemichannels. Most of the study involves using transfected cells expressing wildtype and mutant Cx43 to define amino acids required for CO<sub>2</sub> sensitivity. Hippocampal tissue slices in culture were used to show that CO<sub>2</sub>-induced synaptic transmission was affected by Cx43 hemichannels, providing a physiological context. The authors point out that the Cx43 gene significantly diverges from the beta connexins that are CO<sub>2</sub> sensitive, suggesting that the conserved carbamylation motif was present before the alpha and beta connexin genes diverged. 

      Strengths: 

      (1) The molecular analysis defining the amino acids that contribute to the CO<sub>2</sub> sensitivity of Cx43 is a major strength of the study. The rigor of analysis was strengthened by using three independent assays for hemichannel opening: dye uptake, patch clamp channel measurements, and ATP secretion. The resulting analysis identified key lysines in Cx43 that were required for CO<sub>2</sub>-mediated hemichannel opening. A double K to E Cx43 mutant produced a construct that produced hemichannels that were constitutively open, which further strengthened the analysis. 

      (2) Using hippocampal tissue sections to demonstrate that CO<sub>2</sub> can influence field excitatory postsynaptic potentials (fEPSPs) provides a native context for CO<sub>2</sub> regulation of Cx43 hemichannels. Cx43 mutations associated with Oculodentodigital Dysplasia (ODDD) inhibited CO<sub>2</sub>-induced hemichannel opening, although the mechanism by which this occurs was not elucidated. 

      Weaknesses: 

      (1) Cx43 channels are sensitive to cytosolic pH, which will be affected by CO<sub>2</sub>. Cytosolic pH was not measured, and how this affects CO<sub>2</sub>-induced Cx43 hemichannel activity was not addressed. 

      We have now addressed this with intracellular pH measurements and removal of the C-terminal pH sensor from Cx43 -the hemichannel remains CO<sub>2</sub> sensitive.

      (2) Cultured cells are typically grown in incubators containing 5% CO<sub>2</sub>, which is ~40 mmHg. It is unclear how cells would be viable if Cx43 hemichannels are open at this PCO2. 

      The cells look completely healthy with normal morphology and no sign of excessive cell death in the cultures. Presumably they have ways of compensating for the effects of partially open Cx43 hemichannels.

      (3) Experiments using Gap26 to inhibit Cx43 hemichannels in fEPSP measurements used a scrambled peptide as a control. Analysis should also include Gap peptides specifically targeting Cx26, Cx30, and Cx32 as additional controls. 

      We don’t feel this is necessary given the extensive prior literature in hippocampus showing the effect of ATP release via open Cx43 hemichannels on fEPSP amplitude that used astrocytic specific knockout of Cx43 and Gap26 (doi: 10.1523/jneurosci.0015-14.2014).

      (4) The mechanism by which ODDD mutations impair CO2-mediated hemichannel opening was not addressed. Also, the potential roles for inhibiting Cx43 hemichannels in the pathology of ODDD are unclear. 

      These pathological mutations that alter CO<SUB>2</SUB> sensitivity are similar to pathological mutation in Cx26 and Cx32, which also remove CO<SUB>2</SUB> sensitivity. Our cryo-EM studies on Cx26 give clues as to why these mutations have this effect -they alter conformational mobility of the channel (Brotherton et al 2022 doi: 10.1016/j.str.2022.02.010 and Brotherton et al 2024 doi: 10.7554/eLife.93686). We assume that similar considerations apply to Cx43, but this requires improved cryoEM structures of Cx43 hemichannels at differing levels of PCO<SUB>2</SUB>.

      We agree that the link between loss of CO<SUB>2</SUB> sensitivity of Cx43 and ODDD is not established and have revised the text to make this clear.

      (5) CO2 has no effect on Cx43-mediated gap junctional communication as opposed to Cx26 gap junctions, which are inhibited by CO2. The molecular basis for this difference was not determined. 

      Cx26 gap junction channels are so far unique amongst CO<SUB>2</SUB> sensitive connexins in being closed by CO<SUB>2</SUB>. We have addressed the mechanism by which this occurs in Nijjar et al 2025 DOI: 10.1113/JP285885 -the requirement of carbamylation of K108 in Cx26 (in addition to K125) for GJC closure.

      (6) Whether there are other non-beta connexins that have a putative carbamylation motif was not addressed. Additional discussion/analysis of how the evolutionary trajectory for Cx43 maintaining a carbamylation motif is unique for non-beta connexins would strengthen the study. 

      We have performed a molecular phylogenetic survey to show that the carbamylation motif occurs across the alpha connexin clade and have shown that Cx50 is indeed CO<SUB>2</SUB> sensitive (doi: 10.1101/2025.01.23.634273). This is now in Fig 12.

      Reviewer #2 (Public review): 

      Summary: 

      This paper examines the CO<SUB>2</SUB>  sensitivity of Cx43 hemichannels and gap junctional channels in transiently transfected Hela cells using several different assays, including ethidium dye uptake, ATP release, whole cell patch clamp recordings, and an imaging assay of gap junctional dye transfer. The results show that raising pCO<sub>2</sub> from 20 to 70 mmHg (at a constant pH of 7.3) causes an increase in opening of Cx43 hemichannels but does not block Cx43 gap junctions. This study also showed that raising pCO<SUB>2</SUB> from 20 to 35 mm Hg resulted in an increase in synaptic strength in hippocampal rat brain slices, presumably due to downstream ATP release, suggesting that the CO<SUB>2</SUB> sensitivity of Cx43 may be physiologically relevant. As a further test of the physiological relevance of the CO<sub>2</sub> sensitivity of Cx43, it was shown that two pathological mutations of Cx43 that are associated with ODDD caused loss of Cx43 CO<sub>2</sub>-sensitivity. Cx43 has a potential carbamylation motif that is homologous to the motif in Cx26. To understand the structural changes involved in CO<SUB>2</SUB> sensitivity, a number of mutations were made in Cx43 sites thought to be the equivalent of those known to be involved in the CO<SUB>2</SUB> sensitivity of Cx26, and the CO<SUB>2</SUB> sensitivity of these mutants was investigated. 

      Strengths: 

      This study shows that the apparent lack of functional Cx43 hemichannels observed in a number of previous in vitro function studies may be due to the use of HEPES to buffer the external pH. When Cx43 hemichannels were studied in external solutions in which CO<SUB>2</SUB>/bicarbonate was used to buffer pH instead of HEPES, Cx43 hemichannels showed significantly higher levels of dye uptake, ATP release, and ionic conductance. These findings may have major physiological implications since Cx43 hemichannels are found in many organs throughout the body, including the brain, heart, and immune system. 

      Weaknesses: 

      (1) Interpretation of the site-directed mutation studies is complicated. Although Cx43 has a potential carbamylation motif that is homologous to the motif in Cx26, the results of site-directed mutation studies were inconsistent with a simple model in which K144 and K105 interact following carbamylation to cause the opening of Cx43 hemichannels. 

      The mechanism of opening of Cx43 is more complex than that of Cx26, Cx32 and Cx50 and involves more Lys residues. The 4 Lys residues in Cx43 that are involved in opening the hemichannel have their equivalents in Cx26, but in Cx26 these additional residues seem to be involved in the closing of the GJC rather than opening of the hemichannel (see above). Cx50 is simpler and involves only two Lys residues (doi: 10.1101/2025.01.23.634273), which are equivalent to those in Cx26.

      (2) Secondly, although it is shown that two Cx43 ODDD-associated mutations show a loss of CO<sub>2</sub> sensitivity, there is no evidence that the absence of CO2 sensitivity is involved in the pathology of ODD

      We agree, but this is probably because this has not been directly tested by experiment, as the CO<Sub>2</sub> sensitivity of Cx43 was not previously known. As mentioned above we have revised the text to ensure that this is clear.

      Reviewer #3 (Public review): 

      In this paper, the authors aimed to investigate carbamylation effects on the function of Cx43-based hemichannels. Such effects have previously been characterized for other connexins, e.g., for Cx26, which display increased hemichannel (HC) opening and closure of gap junction channels upon exposure to increased CO<sub>2</sub> partial pressure (accompanied by increased bicarbonate to keep pH constant). 

      The authors used HeLa cells transiently transfected with Cx43 to investigate CO<sub>2</sub> dependent carbamylation effects on Cx43 HC function. In contrast to Cx43-based gap junction channels that are reported here to be insensitive to PCO<sub>2</sub> alterations, they provide evidence that Cx43 HC opening is highly dependent on the PCO2 pressure in the bath solution, over a range of 20 up to 70 mmHg encompassing the physiologically normal resting level of around 40 mmHg. They furthermore identified several Cx43 residues involved in Cx43 HC sensitivity to PCO2: K105, K109, K144 & K234; mutation of 2 or more of these AAs is necessary to abolish CO<sub>2</sub> sensitivity. The subject is interesting and the results indicate that a fraction of HCs is open at a physiological 40 mmHg PCO<sub>2</sub>, which differs from the situation under HEPES buffered solutions where HCs are mostly closed under resting conditions. The mechanism of HC opening with CO<sub>2</sub> gassing is linked to carbamylation, and the authors pinpointed several Lys residues involved in this process. 

      Overall, the work is interesting as it shows that Cx43 HCs have a significant open probability under resting conditions of physiological levels of CO<sub>2</sub> gassing, probably applicable to the brain, heart, and other Cx43 expressing organs. The paper gives a detailed account of various experiments performed (dye uptake, electrophysiology, ATP release to assess HC function) and results concluded from those. They further consider many candidate carbamylation sites by mutating them to negatively charged Glu residues. The paper ends with hippocampal slice work showing evidence for connexin-dependent increases of the EPSP amplitude that could be inhibited by HC inhibition with Gap26 (Figure 10). Another line of evidence comes from the Cx43-linked ODDD genetic disease, whereby L90V as well as the A44V mutations of Cx43 prevented the CO<sub>2</sub>-induced hemichannel opening response (Figure 11). Although the paper is interesting, in its present state, it suffers from (i) a problematic Figure 3, precluding interpretation of the data shown, and (ii) the poor use of hemichannel inhibitors that are necessary to strengthen the evidence in the crucial experiment of Figure 2 and others. 

      The panels in Figure 3 were mislabelled in the accompanying legend possibly leading to some confusion. This has now been corrected.

      We disagree that hemichannel blockers are needed to strengthen the evidence in Figure 2 and other figures. Our controls show that the CO<sub>2</sub>-sensitive responses absolutely requires expression of Cx43 and was modified by mutations of Cx43. It is hard to see how this evidence would be strengthened by use of peptide inhibitors or other blockers of hemichannels that may not be completely selective.

      Reviewing Editor Comments:

      (1) Improve electrophysiological evidence, addressing concerns about the initial experiment and including peptide inhibitor data where applicable. 

      We think the concerns about the electrophysiological evidence arise from a misunderstanding because we gave insufficient information about how we conducted the experiments. We have now provided a much more complete legend, added explanations in the text and given more detail in the Methods. We further respond to the reviewer below.

      We do not agree on the necessity of the peptide inhibitor to demonstrate dependence on Cx43.  We have shown that parental HeLa cells do not release ATP to changes in PCO<sub>2</sub> or voltage (Fig 2D; Butler & Dale 2023, 10.3389/fncel.2023.1330983; Lovatt et al 2025, 10.1101/2025.03.12.642803, 10.1101/2025.01.23.634273). Our previous papers have shown many times that parental HeLa cells do not load with dye to CO<sub>2</sub> or zero Ca<sup>2+</sup> (e.g. Huckstepp et al 2010, 10.1113/jphysiol.2010.192096; Meigh et al 2013, 10.7554/eLife.01213; Meigh et al 2014, 10.7554/eLife.04249), and we have shown that parental HeLa cells do not exhibit the same CO<sub>2</sub> dependent change in whole cell conductance that the Cx43-expressing cells do (Fig 2B). In addition, we shown that mutating key residues in Cx43 alters both CO<sub>2</sub>-sensitive release of ATP and the CO<sub>2</sub>-dependent dye loading without affecting the respective positive control. To bolster this, we have included data for the K144R mutation as a supplement to Fig 3. Given the expense of Gap26 it is impractical to include this as a standard control and unnecessary given the comprehensive controls outlined.

      Collectively, these data show that the responses to CO<sub>2</sub> require expression of Cx43 and can be modified by mutation of Cx43.

      (2) Strengthen the manuscript by measuring the effects of CO on cytosolic pH and Cx43 hemichannel opening. Consider using tail truncation mutants to assess the role of the C-terminal pH sensor in CO-mediated channel opening.

      We agree and have performed the suggested experiments to address this issue.

      (3) Investigate the effect of expressing the K105E/K109E Cx43 double mutant on cell viability.

      In our experiments the cells look completely healthy based on their morphology in brightfield microscopy and growth rates. 

      (4) Discuss and analyze the uniqueness of Cx43 among alpha connexins in maintaining the carbamylation motif.

      now discuss this -Cx43 is not unique. We have added a molecular phylogenetic survey of the alpha connexin clade in Fig 12. Apart from Cx37, the carbamylation motif appears in all the other members of the clade (but not necessarily in the human orthologue). In a different MS, currently posted on bioRxiv, we have documented the CO<sub>2</sub> sensitivity of Cx50 and its dependence on the motif.

      (5) Consider omitting data on ODDD-associated mutations unless there is evidence linking CO<sub>2</sub> sensitivity to disease pathology.

      This experiment is observational, and we are not making claims that there is a direct causal link. Removing the ODDD mutant findings would lose potentially useful information for anyone studying how these mutations alter channel function. We have reworded the text to ensure that we say that the link between loss of CO<sub>2</sub> sensitivity and ODDD remains unproven.

      (6) Justify the choice of high K<sup>⁺</sup> and low external calcium as a positive control in ATP release experiments.

      These two manipulations can open the hemichannel independently of the CO<sub>2</sub> stimulus. Extracellular Ca<sup>2+</sup> is well known to block all connexin hemichannels, and Cx43 is known to be voltage sensitive. The depolarisation from high K<sup>+</sup> is effective at opening the hemichannel and we preferred this as a more physiological way of opening the Cx43 hemichannel. We have added some explanatory text.

      (7) Clarify whether Cx43A44V or Cx43L90V mutations block gap junctional coupling.

      This is an interesting point. Since Cx43 GJCs are not CO<sub>2</sub> sensitive we feel this is beyond the scope of our paper. 

      (8) Discuss the potential implications of pCO₂ changes on myocardial function through alterations in intracellular pH.

      We have modified the discussion to consider this point.

      Reviewer #1 (Recommendations for the authors):

      (1) Measurements of the effects of CO<sub>2</sub> on cytosolic pH/Cx43 hemichannel opening would strengthen the manuscript. Since the pH sensor of Cx43 is on the C terminus, the authors could consider making tail truncation mutants to see how this affects CO<sub>2</sub>-mediated Cx43 channel opening.

      We have done this (truncating after residue 256) -the channel remains highly CO<sub>2</sub> and voltage sensitive. We have also documented the effect of the  hypercapnic solutions on intracellular pH measured with BCECF. These new data are now included as figure supplements to Figure 2.

      (2) What is the impact of expressing the K105E / K109E Cx43 double mutant on cell viability?

      There was no obvious observed impact, cell density was as expected (no evidence of increased cell death), brightfield and fluorescence visualisation indicated normal healthy cells. We have added a movie (Fig 9, movie supplement 1) to show the effect of La<sup>3+</sup> on the GRAB<sub>ATP</sub> signal in cells expressing Cx43<sup>K105E, K109E</sup> so readers can appreciate the morphology and its stability during the recording.

      (3) A quick look at other alpha connexins suggested that Cx43 was unique among alpha connexins in maintaining the carbamylation motif. This merits additional discussion/ analysis.

      This is an interesting point. Cx43 is not unique in the alpha clade in having the carbamylation motif as a number of other human alpha connexins also possess: Cx50, Cx59 and Cx62, and non-human alpha connexins (Cx40, Cx59, Cx46) also possess the motif. We have shown that Cx50 is CO<sub>2</sub> sensitive. We have performed a brief molecular phylogenetic analysis of the alpha connexon clade to highlight the occurrence of the carbamylation motif. This is now presented as Fig 12 to go with the accompanying discussion.

      (4) There were some minor writing issues that should be addressed. For instance, fEPSP is not defined. Also, insets showing positive controls in some experiments were not described in the figure legends.

      We have corrected these issues.

      Reviewer #2 (Recommendations for the authors):

      (1) I would omit the data on the ODDD-associated mutations since there is no evidence that loss of CO<sub>2</sub> sensitivity plays an important role in the underlying disease pathology.

      We are not making the claim CO<sub>2</sub> loss leads to the underlying pathology and have reviewed the text to ensure that we clearly express that this is a correlation not a cause. We think this is worth retaining as many pathological mutations in other CO<sub>2</sub> sensitive connexins (Cx26, Cx32 and Cx50) cause loss of CO<sub>2</sub> sensitivity, and this information may be helpful to other researchers.

      (2) Why is high K+ rather than low external calcium used as a positive control in ATP release experiments?

      We used of high K<sup>+</sup> and depolarisation as a positive control as regard this as a more physiological stimulus than the low external Ca<sup>2+</sup>.

      (3) Does Cx43A44V or Cx43L90V block gap junctional coupling?

      An interesting question but we have not examined this.

      (4) Provide references for biophysical recordings of Cx43 hemichannels performed in HEPES-buffered salines, which document Cx43 hemichannels as being shut.

      have added the original and some later references which examine Cx43 hemichannel gating in HEPES buffer and shows the need for substantial depolarisation to induce channel opening.

      (5) In the heart muscle, changes in PCO<sub>2</sub> have long been hypothesized to cause changes in myocardial function by changing pHi.

      This is true and we now add some discussion of this point. Now that we know that Cx43 is directly sensitive to CO<sub>2</sub> a direct action of CO<sub>2</sub> cannot be ruled out and careful experimentation is required to test this possibility. 

      Reviewer #3 (Recommendations for the authors):

      (1) Page 3: "... homologs of K125 and R104 ... ": the context is linked to Cx26, so Cx26 needs to be added here.

      Done

      (2) Page 4 text and related Figure 2:

      (a) Figure 2A&B: PCO2-dependent Cx43 HC opening is clearly present in the carboxy-fluorescein dye uptake experiments (Figure 2A) as well as in the electrophysiological experiments (Figure 2B). The curves look quite different between these two distinct readouts: dye uptake doubles from 20 to 70 mmHg in Figure 2A while the electrophysiological data double from 45 to 70 mmHg in Figure 2B. These responses look quite distinct and may be linked to a non-linearity of the dye uptake assay or a problem in the electrophysiological measurements of Figure 2B discussed in the next point.

      Different molecules/ions may have different permeabilities through the channel, which could explain the observed difference. Also, there is some contamination of the whole cell conductance change with another conductance (evident in recordings from parental HeLa cells). This is evident particularly at 70 mmHg. If this contaminating conductance were subtracted from the total conductance in the Cx43 expressing cells, then the dose response relations would be more similar. However, we are reluctant to add this additional data processing step to the paper.

      (b) The traces in Figure 2B show that the HC current is inward at 20 mmHg PCO2, while it switches to an outward current at 55mmHg PCO2. HCs are non-selective channels, so their current should switch direction around 0 mV but not at -50 mV. As such, the -50 mV switching point indicates involvement of another channel distinct from non-selective Cx43 hemichannels.

      We think that our incomplete description in the legend led to this misunderstanding. We used a baseline of 35 mmHg (where the channels will be slightly open) and changed to 20 mmHg to close them (or to higher PCO<sub>2</sub> to open them from this baseline), hence a decrease in conductance and loss of outward current for 20 mmHg. The holding potential for the recordings and voltage steps were the same in all recordings. We have now edited the legend and added more information into the methods to clarify this and how we constructed the dose response curve.

      We agree that Cx43 hemichannels are relatively nonselective and would normally be expected to have a reversal potential around 0 mV, but we are using K-Gluconate and the lowered reversal potential (~-65 mV) is likely due to poor permeation of this anion via Cx43.

      (c) A Hill slope of 6 is reported for this curve, which is extremely steep. The paper does not provide any further consideration, making this an isolated statement without any theoretical framework to understand the present finding in such context (i.e., in relation to the PCO2 dependency of Cx channels).

      Yes, we agree -it seems to be the case with all CO<sub>2</sub> sensitive connexins that we have looked at that the Hill coefficient versus CO<sub>2</sub> is >4. Hemichannels are of course hexameric so there is potential for 6 CO<sub>2</sub> molecules to be bound and extensive cooperativity. We have modified the text to give greater context.

      (d) A further remark to Figure 2 is that it does not contain any experiment showing the effect of Cx43 hemichannel inhibition with a reliable HC inhibitor such as Gap26, which is only used in the penultimate illustration of Figure 10. Gap26 should be used in Figure 2 and most of the other figures to show evidence of HC contribution. The lanthanum ions used in Figure 9 are a very non-specific hemichannel blocker and should be replaced by experiments with Gap26.

      We have addressed the first part of this comment above.

      We agree that La<sup>3+</sup> blocks all hemichannels, but in the context of our experiments and the controls we have performed it is entirely adequate and supports our conclusions. Our controls show (mentioned above and below) show that the expression of Cx43 is absolutely required for CO<sub>2</sub>-dependent ATP release (and dye loading). In Figure 9 our use of La<sup>3+</sup> was to show the presence of a constitutively open Cx43 mutant hemichannel. Gap26 would add little to this. Our further controls show that with expression of Cx43<sup>WT</sup> La<sup>3+</sup> did nothing to the ATP signal under baseline conditions (20 mmHg) supporting our conclusion that the mutant channels are constitutively open.

      (e) As the experiments of Figure 2 form the basis of what is to follow, the above remarks cast doubt on the robustness of the experiments and the data produced.

      We disagree, our results are extremely robust: 1) we have used three independent assays confirm the presence of the response; 2) parental HeLa cells do not release ATP, dye load or show large conductance changes to CO<sub>2</sub> showing the absolute requirement for expression of Cx43; 3) mutations of Cx43 (in the carbamylation motif) alter the CO<sub>2</sub> evoked ATP release and dye loading giving further confirmation of Cx43 as the conduit for ATP release and dye loading; and 4) we use standard positive controls (0 Ca<sup>²</sup>, high K<sup></sup>) to confirm cells still have functional channels for those mutations that modified CO<sub>2</sub> sensitivity.

      (f) The sentence "Cells transfected with GRAB-ATP only, showed ... " should be

      modified to "In contrast, cells not expressing Cx43 showed no responses to any applied CO2 concentration as concluded from GRAB-ATP experiments"

      We have modified the text.

      (3) Page 5 and Figures 3 & 4:

      (a) Figure 3 illustrates results obtained with mutations of 4 distinct Lys residues. However, the corresponding legend indicates mutations that are different from the ones shown in the corresponding illustrations, making it impossible to reliably understand and interpret the results shown in panels A-E.

      Thanks for pointing this out. Our apologies, we modified the figure so that the order of the images matched the order of the graph (and the legend) but then forgot to put the new version of the figure in the text. We have now corrected this so that Figure and legend match.

      (b) Figure 4 lacks control WT traces!

      The controls for this (showing that parental HeLa cells do not release ATP in response to CO<sub>2</sub> or depolarisation) are shown in Figure 2.

      (c) Figure 4, Supplement 1: High Hill coefficients of 10 are shown here, but they are not discussed anywhere, as is also the case for the remark on p.4. A Hill steepness of 10 is huge and points to many processes potentially involved. As reported above, these data are floating around in the manuscript without any connection.

      Yes, we agree this is very high and surprising. It may reflect as mentioned above the hexameric nature of the channel and that 4 Lys residues seem to be involved. We have used this equation to give some quantitative understanding of the effect of the mutations on CO<sub>2</sub> sensitivity and still think this is useful. We have no further evidence to interpret these values one way or the other.

      (4) Page 6: Carbamate bridges are proposed to be formed between K105 and K144, and between K109 and K234. The first three of these Lysine residues are located in the 55aa long cytoplasmic loop of Cx43, while K234 is in the juxta membrane region involved in tubulin interactions. Both K144 and and K234 are involved in Cx43 HC inhibition: K144 is the last aa of the L2 peptide (D119-K144 sequence) that inhibits Cx43 hemichannels while K234 is the first aa of the TM2 peptide that reduces hemichannel presence in the membrane (sequence just after TM4, at the start of the C-tail). This context should be added to increase insight and understanding of the CO2 carbamylation effects on Cx43 hemichannel opening.

      Thanks for suggesting this. We have added some discussion of CT to CL interactions in the context of regulation by pH and [Ca<sup>2+</sup>].

      (5) Page 7: The Cx43 ODDD A44V and L90V mutations lead to loss of pCO2 sensitivity in dye loading and ATP assays. However, A44V located in EL1 is reportedly associated with Cx43 HC activation, while L90V in TM2 is associated with HC inhibition. Remarkably, these mutations are focused on non-Lys residues, which brings up the question of how to link this to the paper's main thread.

      This follows the pattern that we have seen for other mutations such as A40V, A88V in Cx26 and several CMTX mutations of Cx32. Our cryoEM structures of Cx26 suggest that these mutations alter the flexibility of the molecule and hence abolish CO<sub>2</sub> sensitivity. We have reworded the text to avoid giving the impression that there is a demonstrated link between loss of CO<sub>2</sub> sensitivity of Cx43 and pathology.

      (6) Page 8: HCs constitutively open - 'constutively' perhaps does not have the best connotation as it is not related to HC constitution but CO2 partial pressure.

      Yes, we agree and have reworded this.

      (7) Page 9: "in all subtypes" -> not clear what is meant - do you mean "in all cell types"?

      We agree this is unclear -it refers to all astrocytic subtypes. We have amended the text.

      (8) Page 10: Composition of hypocapnic recording solution: bubbling description is incomplete "95%O2/5%" and should be "95%O2/5%CO2".

      Changed.

      (9) Page 11: Composition of zero Ca<sup>²⁺</sup> hypocapnic recording solution: perhaps better to call this "nominally Ca<sup>²⁺</sup>-free hypocapnic recording solution" as no Ca<sup>²⁺</sup> buffer is included in this solution

      Thanks for pointing this out. We did in fact add 1 mM EGTA to the solutions but omitted this from the recipe, this has now been corrected.

      (10) Page 11: in M&M I found that the NaHCO3- is lowered to 10 mM in the zero Ca<sup>²⁺</sup>condition, while the control experimental condition has 26 mM NaHCO3-. The zero Ca condition should be kept at a physiologically normal 26 mM NaHCO3- concentration, so why was this done? Lowering NaHCO3- during hemichannel stimulation may result in smaller responses and introduce non-linearities.

      For the dye loading we used 20 mmHg as the baseline condition and increased PCO<sub>2</sub> from this. Hence for the zero Ca<sup>2+</sup> positive control we modified the 20 mmHg hypocapnic solution by substituting Mg<sup>2+</sup> for Ca<sup>2+</sup> and adding EGTA. We have modified the text in the Methods to clarify this.

      Further remarks on the figures:

      (1) Figure 2A: Add 20 & 70 mmHg to the images, to improve the readability of this illustration.

      Done

      (2) Figure 3: WT responses are shown in panel F, but experimental data (images and curves) are lacking and should be included in a revised version.

      The wild type data is shown in Fig 2A. We have some sympathy for the comment, but we felt that Fig 2 should document CO<sub>2</sub> sensitivity, and then the subsequent Figs should analyse its basis. Hence the separation of Cx43<sup>WT</sup> data from the mutant data. In panel F, we state that we have recalculated the WT data from Fig 2A to allow the comparison.

      (3) Figures 4, 6, 8: Color codes for mmHg CO<sub>2</sub> pressure make reading these figures difficult; perhaps better to add mmHg values directly in relation to the traces.

      We have considered this suggestion but feel that the figures would become very cluttered with the additional labelling.

      (4) I wouldn't use colored lines when not necessary, e.g., Figure 9 100 µM La3+; Figure 10 (add 20->35 mmHg PCO2 switch; add scrGap26 above blue bars); Figure 11C & D.

      We agree and can see that in Figs 9 and 10 this muddles our colour scheme in other figures so have modified these figures. There was not space to put the suggested labels.

      (5) The mechanism of increased HC opening is not clear.

      We agree and have discussed various options and the analogy with what we know about Cx26. Ultimately new cryo-EM data is required.

      (6) Figure 10: 35G/35S are weird abbreviations for 35 mmHg Gap26 and scrambled Gap26.

      Yes, but we used these to fit into the available space.

      (7) Figure 11, legend: '20 mmHg PCO2 for each transfection for 70 mmHg PCO2'. It is not clear what is meant here.

      Thanks for pointing this out, we have reworded this to ensure clarity.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The manuscript by Choi and colleagues investigates the impact of variation in cortical geometry and growth on cortical surface morphology. Specifically, the study uses physical gel models and computational models to evaluate the impact of varying specific features/parameters of the cortical surface. The study makes use of this approach to address the topic of malformations of cortical development and finds that cortical thickness and cortical expansion rate are the drivers of differences in morphogenesis.

      The study is composed of two main sections. First, the authors validate numerical simulation and gel model approaches against real cortical postnatal development in the ferret. Next, the study turns to modelling malformations in cortical development using modified tangential growth rate and cortical thickness parameters in numerical simulations. The findings investigate three genetically linked cortical malformations observed in the human brain to demonstrate the impact of the two physical parameters on folding in the ferret brain.

      This is a tightly presented study that demonstrates a key insight into cortical morphogenesis and the impact of deviations from normal development. The dual physical and computational modeling approach offers the potential for unique insights into mechanisms driving malformations. This study establishes a strong foundation for further work directly probing the development of cortical folding in the ferret brain. One weakness of the current study is that the interpretation of the results in the context of human cortical development is at present indirect, as the modelling results are solely derived from the ferret. However, these modelling approaches demonstrate proof of concept for investigating related alterations more directly in future work through similar approaches to models of the human cerebral cortex.

      We thank the reviewer for the very positive comments. While the current gel and organismal experiments focus on the ferret only, we want to emphasize that our analysis does consider previous observations of human brains and morphologies therein (Tallinen et al., Proc. Natl. Acad. Sci. 2014; Tallinen et al., Nat. Phys. 2016), which we compare and explain. This allows us to analyze the implications of our study broadly to understand the explanations of cortical malformations in humans using the ferret to motivate our study. Further analysis of normal human brain growth using computational and physical gel models can be found in our companion paper (Yin et al., 2025), now also published to eLife: S. Yin, C. Liu, G. P. T. Choi, Y. Jung, K. Heuer, R. Toro, L. Mahadevan, Morphogenesis and morphometry of brain folding patterns across species. eLife, 14, RP107138, 2025. doi:10.7554/eLife.107138

      In future work, we plan to obtain malformed human cortical surface data, which would allow us to further investigate related alterations more directly. We have added a remark on this in the revised manuscript (please see page 8–9).

      Reviewer 2 (Public review):

      Summary:

      Based on MRI data of the ferret (a gyrencephalic non-primate animal, in whom folding happens postnatally), the authors create in vitro physical gel models and in silico numerical simulations of typical cortical gyrification. They then use genetic manipulations of animal models to demonstrate that cortical thickness and expansion rate are primary drivers of atypical morphogenesis. These observations are then used to explain cortical malformations in humans.

      Strengths:

      The paper is very interesting and original, and combines physical gel experiments, numerical simulations, as well as observations in MCD. The figures are informative, and the results appear to have good overall face validity.

      We thank the reviewer for the very positive comments.

      Weaknesses:

      On the other hand, I perceived some lack of quantitative analyses in the different experiments, and currently, there seems to be rather a visual/qualitative interpretation of the different processes and their similarities/differences. Ideally, the authors also quantify local/pointwise surface expansion in the physical and simulation experiments, to more directly compare these processes. Time courses of eg, cortical curvature changes, could also be plotted and compared for those experiments. I had a similar impression about the comparisons between simulation results and human MRI data. Again, face validity appears high, but the comparison appeared mainly qualitative.

      We thank the reviewer for the comments. Besides the visual and qualitative comparisons between the models, we would like to point out that we have included the quantification of the shape difference between the real and simulated ferret brain models via spherical parameterization and the curvature-based shape index as detailed in main text Fig. 4 and SI Section 3. We have also utilized spherical harmonics representations for the comparison between the real and simulated ferret brains at different maximum order N. In our revision, we have included more calculations for the comparison between the real and simulated ferret brains at more time points in the SI (please see SI page 6). As for the comparison between the malformation simulation results and human MRI data in the current work, since the human MRI data are two-dimensional while our computational models are threedimensional, we focus on the qualitative comparison between them. In future work, we plan to obtain malformed human cortical surface data, from which we can then perform the parameterization-based and curvature-based shape analysis for a more quantitative assessment.

      I felt that MCDs could have been better contextualized in the introduction.

      We thank the reviewer for the comment. In our revision, we have revised the description of MCDs in the introduction (please see page 2).

      Reviewer #1 (Recommendations for the authors):

      The study is beautifully presented and offers an excellent complement to the work presented by Yin et al. In its current form, the malformation portion of the study appears predominantly reliant on the numerical simulations rather than the gel model. It might be helpful, therefore, to further incorporate the results presented in Figure S5 into the main text, as this seems to be a clear application of the physical gel model to modelling malformations. Any additional use of the gel models in the malformation portion of the study would help to further justify the necessity and complementarity of the dual methodological approaches.

      We thank the reviewer for the suggestion. We have moved Fig. S5 and the associated description to the main text in the revised manuscript (please see the newly added Figure 5 on page 6 and the description on page 5–7). In particular, we have included a new section on the physical gel and computational models for ferret cortical malformations right before the section on the neurology of ferret and human cortical malformations.

      One additional consideration is that the analyses in the current study focus entirely on the ferret cortex. Given the emphasis in the title on the human brain, it may be worthwhile to either consider adding additional modelling of the human cortex or to consider modifying the title to more accurately align with the focus of the methods/results.

      We thank the reviewer for the suggestion. While the current gel and organismal experiments focus on the ferret only, we want to emphasize that our analysis does consider previous observations of human brains and morphologies therein (Tallinen et al., Proc. Natl. Acad. Sci. 2014; Tallinen et al., Nat. Phys. 2016), which we compare and explain. This allows us to analyze the implications of our study broadly to understand the explanations of cortical malformations in humans using the ferret to motivate our study. Therefore, we think that the title of the paper seems reasonable. To further highlight the connection between the ferret brain simulations and human brain growth, we have included an additional comparison between human brain surface reconstructions adapted from a prior study and the ferret simulation results in the SI (please see SI Section S4 and SI Fig. S5 on page 9–10).

      Two additional minor points:

      Table S1 seems sufficiently critical to the motivation for the study and organization of the results section to justify inclusion in the main text. Of course, I would leave any such minor changes to the discretion of the authors.

      We thank the reviewer for the suggestion. We have moved Table S1 and the associated description to the main text in the revised manuscript (please see Table 1 on page 7).

      Page 7, Column 1: “macacques” → “macaques”.

      We thank the reviewer for pointing out the typo. We have fixed it in the revised manuscript (please see page 8).

      Reviewer #2 (Recommendations for the authors):

      The methods lack details on the human MRI data and patients.

      We thank the reviewer for the comment. Note that the human MRI data and patients were from prior works (Smith et al., Neuron 2018; Johnson et al., Nature 2018; Akula et al., Proc. Natl. Acad. Sci. 2023) and were used for the discussion on cortical malformations in Fig. 6. In the revision, we have included a new subsection in the Methods section and provided more details and references of the MRI data and patients (please see page 9–10).

    1. Some people, including school professionals, root their beliefs aboutgender norms or the inappropriateness of homosexuality in their culturalbackground or religious tradition. Cultural beliefs and religious texts of-ten are interpreted to mean that LGBTQ people are aberrant, sinful, or atthe very least unacceptable

      I think it is important to remember that religion and culture are not fixed. They have changed across history and will continue to change as society develops. Many ideas that were once seen as absolute were later reinterpreted or replaced. So when some people use tradition to justify strict beliefs about gender or sexuality, they may be holding on to only one version of that tradition. If we look at the past, we can see that many cultures and even some religious communities once accepted more diverse gender roles.

    2. Transgender students themselves also may feel pres-sured to conform to the gender binary, hiding their birth gender or deciJingto be as gender normative in their chosen gender as possible so as not toraise any suspicions

      I am curious about how norms will change in the future. For a long time society has created fixed expectations for men and women and these ideas became so common that people often forget they are learned. As transgender people become more visible and more accepted I wonder if new expectations will slowly form around them too. It is possible that society will start creating its own image of what a transgender person should look like act like or live like even though the whole point of acceptance is to allow people to live freely. I think this shows how important it is to stay aware of how norms form so we do not turn one kind of freedom into another kind of pressure.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Response to referee comments: ____RC-2025-03008


      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary In this article, the authors used the synthetic TALE DNA binding proteins, tagged with YFP, which were designed to target five specific repeat elements in Trypanosoma brucei genome, including centromere and telomeres-associated repeats and those of a transposon element. This is in order to detect and identified, using YFP-pulldown, specific proteins that bind to these repetitive sequences in T. brucei chromatin. Validation of the approach was done using a TALE protein designed to target the telomere repeat (TelR-TALE) that detected many of the proteins that were previously implicated with telomeric functions. A TALE protein designed to target the 70 bp repeats that reside adjacent to the VSG genes (70R-TALE) detected proteins that function in DNA repair and the protein designed to target the 177 bp repeat arrays (177R-TALE) identified kinetochore proteins associated T. brucei mega base chromosomes, as well as in intermediate and mini-chromosomes, which imply that kinetochore assembly and segregation mechanisms are similar in all T. brucei chromosome.

      Major comments: Are the key conclusions convincing? The authors reported that they have successfully used TALE-based affinity selection of protein-associated with repetitive sequences in the T. brucei genome. They claimed that this study has provided new information regarding the relevance of the repetitive region in the genome to chromosome integrity, telomere biology, chromosomal segregation and immune evasion strategies. These conclusions are based on high-quality research, and it is, basically, merits publication, provided that some major concerns, raised below, will be addressed before acceptance for publication. 1. The authors used TALE-YFP approach to examine the proteome associated with five different repetitive regions of the T. brucei genome and confirmed the binding of TALE-YFP with Chip-seq analyses. Ultimately, they got the list of proteins that bound to synthetic proteins, by affinity purification and LS-MS analysis and concluded that these proteins bind to different repetitive regions of the genome. There are two control proteins, one is TRF-YFP and the other KKT2-YFP, used to confirm the interactions. However, there are no experiment that confirms that the analysis gives some insight into the role of any putative or new protein in telomere biology, VSG gene regulation or chromosomal segregation. The proteins, which have already been reported by other studies, are mentioned. Although the author discovered many proteins in these repetitive regions, their role is yet unknown. It is recommended to take one or more of the new putative proteins from the repetitive elements and show whether or not they (1) bind directly to the specific repetitive sequence (e.g., by EMSA); (2) it is recommended that the authors will knockdown of one or a small sample of the new discovered proteins, which may shed light on their function at the repetitive region, as a proof of concept.

      Response

      The main request from Referee 1 is for individual evaluation of protein-DNA interaction for a few candidates identified in our TALE-YFP affinity purifications, particularly using EMSA to identify binding to the DNA repeats used for the TALE selection. In our opinion, such an approach would not actually provide the validation anticipated by the reviewer. The power of TALE-YFP affinity selection is that it enriches for protein complexes that associate with the chromatin that coats the target DNA repetitive elements rather than only identifying individual proteins or components of a complex that directly bind to DNA assembled in chromatin.

      The referee suggests we express recombinant proteins and perform EMSA for selected candidates, but many of the identified proteins are unlikely to directly bind to DNA - they are more likely to associate with a combination of features present in DNA and/or chromatin (e.g. specific histone variants or histone post-translational modifications). Of course, a positive result would provide some validation but only IF the tested protein can bind DNA in isolation - thus, a negative result would be uninformative.

      In fact, our finding that KKT proteins are enriched using the 177R-TALE (minichromosome repeat sequence) identifies components of the trypanosome kinetochore known (KKT2) or predicted (KKT3) to directly bind DNA (Marciano et al., 2021; PMID: 34081090), and likewise the TelR-TALE identifies the TRF component that is known to directly associate with telomeric (TTAGGG)n repeats (Reis et al 2018; PMID: 29385523). This provides reassurance on the specificity of the selection, as does the lack of cross selectivity between different TALEs used (see later point 3 below). The enrichment of the respective DNA repeats quantitated in Figure 2B (originally Figure S1) also provides strong evidence for TALE selectivity.

      It is very likely that most of the components enriched on the repetitive elements targeted by our TALE-YFP proteins do not bind repetitive DNA directly. The TRF telomere binding protein is an exception - but it is the only obvious DNA binding protein amongst the many proteins identified as being enriched in our TelR-TALE-YFP and TRF-YFP affinity selections.

      The referee also suggests that follow up experiments using knockdown of the identified proteins found to be enriched on repetitive DNA elements would be informative. In our opinion, this manuscript presents the development of a new methodology previously not applied to trypanosomes, and referee 2 highlights the value of this methodological development which will be relevant for a large community of kinetoplastid researchers. In-depth follow-up analyses would be beyond the scope of this current study but of course will be pursued in future. To be meaningful such knockdown analyses would need to be comprehensive in terms of their phenotypic characterisation (e.g. quantitative effects on chromosome biology and cell cycle progression, rates and mechanism of recombination underlying antigenic variation, etc) - simple RNAi knockdowns would provide information on fitness but little more. This information is already publicly available from genome-wide RNAi screens (www.tritrypDB.org), with further information on protein location available from the genome-wide protein localisation resource (Tryptag.org). Hence basic information is available on all targets selected by the TALEs after RNAi knock down but in-depth follow-up functional analysis of several proteins would require specific targeted assays beyond the scope of this study.

      NonR-TALE-YFP does not have a binding site in the genome, but YFP protein should still be expressed by T. brucei clones with NLS. The authors have to explain why there is no signal detected in the nucleus, while a prominent signal was detected near kDNA (see Fig.2). Why is the expression of YFP in NonR-TALE almost not shown compared to other TALE clones?

      Response

      The NonR-TALE-YFP immunolocalisation signal indeed is apparently located close to the kDNA and away from the nucleus. We are not sure why this is so, but the construct is sequence validated and correct. However, we note that artefactual localisation of proteins fused to a globular eGFP tag, compared to a short linear epitope V5 tag, near to the kinetoplast has been previously reported (Pyrih et al, 2023; PMID: 37669165),

      The expression of NonR-TALE-YFP is shown in Supplementary Fig. S2 in comparison to other TALE proteins. Although it is evident that NonR-TALE-YFP is expressed at lower levels than other TALEs (the different TALEs have different expression levels), it is likely that in each case the TALE proteins would be in relative excess.

      It is possible that the absence of a target sequence for the NonR-TALE-YFP in the nucleus affects its stability and cellular location. Understanding these differences is tangential to the aim of this study.

      However, importantly, NonR-TALE-YFP is not the only control for used for specificity in our affinity purifications. Instead, the lack of cross-selection of the same proteins by different TALEs (e.g. TelR-TALE-YFP, 177R-TALE-YFP) and the lack of enrichment of any proteins of interest by the well expressed ingiR-TALE-YFP or 147R-TALE-YFP proteins each provide strong evidence for the specificity of the selection using TALEs, as does the enrichment of similar protein sets following affinity purification of the TelR-TALE-YFP and TRF-YFP proteins which both bind telomeric (TTAGGG)n repeats. Moreover, control affinity purifications to assess background were performed using cells that completely lack an expressed YFP protein which further support specificity (Figure 6).

      We have added text to highlight these important points in the revised manuscript:

      Page 8:

      "However, the expression level of NonR-TALE-YFP was lower than other TALE-YFP proteins; this may relate to the lack of DNA binding sites for NonR-TALE-YFP in the nucleus."

      Page 8:

      "NonR-TALE-YFP displayed a diffuse nuclear and cytoplasmic signal; unexpectedly the cytoplasmic signal appeared to be in the vicinity the kDNA of the kinetoplast (mitochrondria). We note that artefactual localisation of some proteins fused to an eGFP tag has previously been observed in T. brucei (Pyrih et al, 2023)."

      Page 10:

      Moreover, a similar set of enriched proteins was identified in TelR-TALE-YFP affinity purifications whether compared with cells expressing no YFP fusion protein (No-YFP), the NonR-TALE-YFP or the ingiR-TALE-YFP as controls (Fig. S7B, S8A; Tables S3, S4). Thus, the most enriched proteins are specific to TelR-TALE-YFP-associated chromatin rather than to the TALE-YFP synthetic protein module or other chromatin.

      As a proof of concept, the author showed that the TALE method determined the same interacting partners enrichment in TelR-TALE as compared to TRF-YFP. And they show the same interacting partners for other TALE proteins, whether compared with WT cells or with the NonR-TALE parasites. It may be because NonR-TALE parasites have almost no (or very little) YFP expression (see Fig. S3) as compared to other TALE clones and the TRF-YFP clone. To address this concern, there should be a control included, with proper YFP expression.

      Response

      See response to point 2, but we reiterate that the ingi-TALE -YFP and 147R-TALE-YFP proteins are well expressed (western original Fig. S3 now Fig. S2) but few proteins are detected as being enriched or correspond to those enriched in TelR-TALE-YFP or TRF-YFP affinity purifications (see Fig. S9). Therefore, the ingi-TALE -YFP and 147R-TALE-YFP proteins provide good additional negative controls for specificity as requested. To further reassure the referee we have also included additional volcano plots which compare TelR-TALE-YFP, 70R-TALE-YFP or 177R-TALE-YFP to the ingiR-TALE-YFP affinity selection (new Figure S8). As with No-YFP or NonR-TALE-YFP controls, the use of ingiR-TALE-YFP as a negative control demonstrates that known telomere associated proteins are enriched in TelR-TALE-YFP affinity purification, RPA subunits enriched with 70R-TALE-YFP and Kinetochore KKT poroteins enriched with 177R-TALE-YFP. These analyses demonstrate specificity in the proteins enriched following affinity purification of our different TALE-YFPs and provide support to strengthen our original findings.

      We now refer to use of No-YFP, NonR-TALE-YFP, and ingiR-TALE -YFP as controls for comparison to TelR-TALE-YFP, 70R-TALE-YFP or 177R-TALE-YFP in several places:

      Page10:

      "Moreover, a similar set of enriched proteins was identified in TelR-TALE-YFP affinity purifications whether compared with cells expressing no YFP fusion protein (No-YFP), the NonR-TALE-YFP or the ingiR-TALE-YFP as controls (Fig. S7B, S8A; Tables S3, S4)."

      Page 11:

      "Thus, the nuclear ingiR-TALE-YFP provides an additional chromatin-associated negative control for affinity purifications with the TelR-TALE-YFP, 70R-TALE-YFP and 177R-TALE-YFP proteins (Fig. S8)."

      "Proteins identified as being enriched with 70R-TALE-YFP (Figure 6D) were similar in comparisons with either the No-YFP, NonR-TALE-YFP or ingiR-TALE-YFP as negative controls."

      Top Page 12:

      "The same kinetochore proteins were enriched regardless of whether the 177R-TALE proteomics data was compared with No-YFP, NonR-TALE or ingiR-TALE-YFP controls."

      Discussion Page 13:

      "Regardless, the 147R-TALE and ingiR-TALE proteins were well expressed in T. brucei cells, but their affinity selection did not significantly enrich for any relevant proteins. Thus, 147R-TALE and ingiR-TALE provide reassurance for the overall specificity for proteins enriched TelR-TALE, 70R-TALE and 177R-TALE affinity purifications."

      After the artificial expression of repetitive sequence binding five-TALE proteins, the question is if there is any competition for the TALE proteins with the corresponding endogenous proteins? Is there any effect on parasite survival or health, compared to the control after the expression of these five TALEs YFP protein? It is recommended to add parasite growth curves, for all the TALE-proteins expressing cultures.

      Response

      Growth curves for cells expressing TelR-TALE-YFP, 177R-TALE-YFP and ingiR-TALE-YFP are now included (New Fig S3A). No deficit in growth was evident while passaging 70R-TALE-YFP, 147R-TALE-YFP, NonR-TALE-YFP cell lines (indeed they grew slightly better than controls).

      The following text has been added page 8:

      "Cell lines expressing representative TALE-YFP proteins displayed no fitness deficit (Fig. S3A)."

      Since the experiments were performed using whole-cell extracts without prior nuclear fractionation, the authors should consider the possibility that some identified proteins may have originated from compartments other than the nucleus. Specifically, the detection of certain binding proteins might reflect sequence homology (or partial homology) between mitochondrial DNA (maxicircles and minicircles) and repetitive regions in the nuclear genome. Additionally, the lack of subcellular separation raises the concern that cytoplasmic proteins could have been co-purified due to whole cell lysis, making it challenging to discern whether the observed proteome truly represents the nuclear interactome.

      Response

      In our experimental design, we confirmed bioinformatically that the repeat sequences targeted were not represented elsewhere in the nuclear or mitochondrial genome (kDNA). The absence of subcellular fractionation could result in some cytoplasmic protein selection, but this is unlikely since each TALE targets a specific DNA sequence but is otherwise identical such that cross-selection of the same contaminating protein set would be anticipated if there was significant non-specific binding. We have previously successfully affinity selected 15 chromatin modifiers and identified associated proteins without major issues concerning cytoplasmic protein contamination (Staneva et al 2021 and 2022; PMID: 34407985 and 36169304). Of course, the possibility that some proteins are contaminants will need to be borne in mind in any future follow-up analysis of proteins of interest that we identified as being enriched on specific types of repetitive element in T. brucei. Proteins that are also detected in negative control, or negative affinity selections such as No-YFP, NoR-YFP, IngiR-TALE or 147R-TALE must be disregarded.

      '6'. Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether? As mentioned earlier, the author claimed that this study has provided new information concerning telomere biology, chromosomal segregation mechanisms, and immune evasion strategies. But there are no experiments that provides a role for any unknown or known protein in these processes. Thus, it is suggested to select one or two proteins of choice from the list and validate their direct binding to repetitive region(s), and their role in that region of interaction.

      Response

      As highlighted in response to point 1 the suggested validation and follow up experiments may well not be informative and are beyond the scope of the methodological development presented in this manuscript. Referee 2 describes the study in its current form as "a significant conceptual and technical advancement" and "This approach enhances our understanding of chromatin organization in these regions and provides a foundation for investigating the functional roles of associated proteins in parasite biology."

      The Referee's phrase 'validate their direct binding to repetitive region(s)' here may also mean to test if any of the additional proteins that we identified as being enriched with a specific TALE protein actually display enrichment over the repeat regions when examined by an orthogonal method. A key unexpected finding was that kinetochore proteins including KKT2 are enriched in our affinity purifications of the 177R-TALE-YFP that targets 177bp repeats (Figure 6F). By conducting ChIP-seq for the kinetochore specific protein KKT2 using YFP-KKT2 we confirmed that KKT2 is indeed enriched on 177bp repeat DNA but not flanking DNA (Figure 7). Moreover, several known telomere-associated proteins are detected in our affinity selections of TelR-TALE-YFP (Figure 6B, FigS6; see also Reis et al, 2018 Nuc. Acids Res. PMID: 29385523; Weisert et al, 2024 Sci. Reports PMID: 39681615).

      Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation. The answer for this question depends on what the authors want to present as the achievements of the present study. If the achievement of the paper was is the creation of a new tool for discovering new proteins, associated with the repeat regions, I recommend that they add a proof for direct interactions between a sample the newly discovered proteins and the relevant repeats, as a proof of concept discussed above, However, if the authors like to claim that the study achieved new functional insights for these interactions they will have to expand the study, as mentioned above, to support the proof of concept.

      Response

      See our response to point 1 and the point we labelled '6' above.

      Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments. I think that they are realistic. If the authors decided to check the capacity of a small sample of proteins (which was unknown before as a repetitive region binding proteins) to interacts directly with the repeated sequence, it will substantially add of the study (e.g., by EMSA; estimated time: 1 months). If the authors will decide to check the also the function of one of at least one such a newly detected proteins (e.g., by KD), I estimate the will take 3-6 months.

      Response

      As highlighted previously the proposed EMSA experiment may well be uninformative for protein complex components identified in our study or for isolated proteins that directly bind DNA in the context of a complex and chromatin. RNAi knockdown data and cell location data (as well as developmental expression and orthology data) is already available through tritrypDB.org and trtyptag.org

      Are the data and the methods presented in such a way that they can be reproduced? Yes

      Are the experiments adequately replicated, and statistical analysis adequate? The authors did not mention replicates. There is no statistical analysis mentioned.

      Response

      The figure legends indicate that all volcano plots of TALE affinity selections were derived from three biological replicates. Cutoffs used for significance: PFor ChiP-seq two biological replicates were analysed for each cell line expressing the specific YFP tagged protein of interest (TALE or KKT2). This is now stated in the relevant figure legends - apologies for this oversight. The resulting data are available for scrutiny at GEO: GSE295698.

      Minor comments: -Specific experimental issues that are easily addressable. The following suggestions can be incorporated: 1. Page 18, in the material method section author mentioned four drugs: Blasticidine, Phleomycin and G418, and hygromycin. It is recommended to mention the purpose of using these selective drugs for the parasite. If clonal selection has been done, then it should also be mentioned.

      Response

      We erroneously added information on several drugs used for selection in our labaoratory. In fact all TALE-YFP construct carry the Bleomycin resistance genes which we select for using Phleomycin. Also, clones were derived by limiting dilution immediately after transfection.

      We have amended the text accordingly:

      Page 17/18:

      "Cell cultures were maintained below 3 x 106 cells/ml. Pleomycin 2.5 mg/ml was used to select transformants containing the TALE construct BleoR gene."

      "Electroporated bloodstream cells were added to 30 ml HMI-9 medium and two 10-fold serial dilutions were performed in order to isolate clonal Pleomycin resistant populations from the transfection. 1 ml of transfected cells were plated per well on 24-well plates (1 plate per serial dilution) and incubated at 37{degree sign}C and 5% CO2 for a minimum of 6 h before adding 1 ml media containing 2X concentration Pleomycin (5 mg/ml) per well."

      In the method section the authors mentioned that there is only one site for binding of NonR-TALE in the parasite genome. But in Fig. 1C, the authors showed zero binding site. So, there is one binding site for NonR-TALE-YFP in the genome or zero?

      Response

      We thank the reviewer for pointing out this discrepancy. We have checked the latest Tb427v12 genome assembly for predicted NonR-TALE binding sites and there are no exact matches. We have corrected the text accordingly.

      Page 7:

      "A control NonR-TALE protein was also designed which was predicted to have no target sequence in the T. bruceigenome."

      Page 17:

      "A control NonR-TALE predicted to have no recognised target in the T. brucei geneome was designed as follows: BLAST searches were used to identify exact matches in the TREU927 reference genome. Candidate sequences with one or more match were discarded."

      The authors used two different anti-GFP antibodies, one from Roche and the other from Thermo Fisher. Why were two different antibodies used for the same protein?

      Response

      We have found that only some anti-GFP antibodies are effective for affinity selection of associated proteins, whereas others are better suited for immunolocalisation. The respective suppliers' antibodies were optimised for each application.

      Page 6: in the introduction, the authors give the number of total VSG genes as 2,634. Is it known how many of them are pseudogenes?

      Response

      This value corresponds to the number reported by Consentino et al. 2021 (PMID: 34541528) for subtelomeric VSGs, which is similar to the value reported by Muller et al 2018 (PMID: 30333624) (2486), both in the same strain of trypanosomes as used by us. Based on the earlier analysis by Cross et al (PMID: 24992042), 80% of the identified VSGs in their study (2584) are pseudogenes. This approximates to the estimation by Consentino of 346/2634 (13%) being fully functional VSG genes at subtelomeres, or 17% when considering VSGs at all genomic locations (433/2872).

      I found several typos throughout the manuscript.

      Response

      Thank you for raising this, we have read through the manuscipt several times and hopefully corrected all outstanding typos.

      Fig. 1C: Table: below TOTAL 2nd line: the number should be 1838 (rather than 1828)

      Corrected- thank you.

      • Are prior studies referenced appropriately? Yes

      • Are the text and figures clear and accurate? Yes

      • Do you have suggestions that would help the authors improve the presentation of their data and conclusions? Suggested above

      Reviewer #1 (Significance (Required)):

      Describe the nature and significance of the advance (e.g., conceptual, technical, clinical) for the field: This study represents a significant conceptual and technical advancement by employing a synthetic TALE DNA-binding protein tagged with YFP to selectively identify proteins associated with five distinct repetitive regions of T. brucei chromatin. To the best of my knowledge, it is the first report to utilize TALE-YFP for affinity-based isolation of protein complexes bound to repetitive genomic sequences in T. brucei. This approach enhances our understanding of chromatin organization in these regions and provides a foundation for investigating the functional roles of associated proteins in parasite biology. Importantly, any essential or unique interacting partners identified could serve as potential targets for therapeutic intervention.

      • Place the work in the context of the existing literature (provide references, where appropriate). I agree with the information that has already described in the submitted manuscript, regarding its potential addition of the data resulted and the technology established to the study of VSGs expression, kinetochore mechanism and telomere biology.

      • State what audience might be interested in and influenced by the reported findings. These findings will be of particular interest to researchers studying the molecular biology of kinetoplastid parasites and other unicellular organisms, as well as scientists investigating chromatin structure and the functional roles of repetitive genomic elements in higher eukaryotes.

      • 1Define your field of expertise with a few keywords to help the authors contextualize your point of view. 2Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate. (1) Protein-DNA interactions/ chromatin/ DNA replication/ Trypanosomes (2) None

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary

      Carloni et al. comprehensively analyze which proteins bind repetitive genomic elements in Trypanosoma brucei. For this, they perform mass spectrometry on custom-designed, tagged programmable DNA-binding proteins. After extensively verifying their programmable DNA-binding proteins (using bioinformatic analysis to infer target sites, microscopy to measure localization, ChIP-seq to identify binding sites), they present, among others, two major findings: 1) 14 of the 25 known T. brucei kinetochore proteins are enriched at 177bp repeats. As T. brucei's 177bp repeat-containing intermediate-sized and mini-chromosomes lack centromere repeats but are stable over mitosis, Carloni et al. use their data to hypothesize that a 'rudimentary' kinetochore assembles at the 177bp repeats of these chromosomes to segregate them. 2) 70bp repeats are enriched with the Replication Protein A complex, which, notably, is required for homologous recombination. Homologous recombination is the pathway used for recombination-based antigenic variation of the 70bp-repeat-adjacent variant surface glycoproteins.

      Major Comments

      None. The experiments are well-controlled, claims well-supported, and methods clearly described. Conclusions are convincing.

      Response Thank you for these positive comments.

      Minor Comments

      1) Fig. 2 - I couldn't find an uncropped version showing multiple cells. If it exists, it should be linked in the legend or main text; Otherwise, this should be added to the supplement.

      Response

      The images presented represent reproducible analyses, and independently verified by two of the authors. Although wider field of view images do not provide the resolution to be informative on cell location, as requested we have provided uncropped images in new Fig. S4 for all the cell lines shown in Figure 2A.

      In addition, we have included as supplementary images (Fig. S3B) additional images of TelR-TALE-YFP, 177R-TALE-YFP and ingiR-TALE YFP localisation to provide additional support their observed locations presented in Figure 1. The set of cells and images presented in Figure 2A and in Fig S3B were prepared and obtained by a different authors, independently and reproducibly validating the location of the tagged protein.

      2) I think Suppl. Fig. 1 is very valuable, as it is a quantification and summary of the ChIP-seq data. I think the authors could consider making this a panel of a main figure. For the main figure, I think the plot could be trimmed down to only show the background and the relevant repeat for each TALE protein, leaving out the non-target repeats. (This relates to minor comment 6.) Also, I believe, it was not explained how background enrichment was calculated.

      Response

      We are grateful for the reviewer's positive view of original Fig. S1 and appreciate the suggestion. We have now moved these analysis to part B of main Figure 2 in the revised manuscript - now Figure 2B. We have also provided additional details in the Methods section on the approaches used to assess background enrichment.

      Page 19:

      Background enrichment calculation

      The genome was divided into 50 bp sliding windows, and each window was annotated based on overlapping genomic features, including CIR147, 177 bp repeats, 70 bp repeats, and telomeric (TTAGGG)n repeats. Windows that did not overlap with any of these annotated repeat elements were defined as "background" regions and used to establish the baseline ChIP-seq signal. Enrichment for each window was calculated using bamCompare, as log₂(IP/Input). To adjust for background signal amongst all samples, enrichment values for each sample were further normalized against the corresponding No-YFP ChIP-seq dataset.

      Note: While revising the manuscript we also noticed that the script had a nomalization error. We have therefore included a corrected version of these analyses as Figure 2B (old Fig. S1)

      3) Generally, I would plot enrichment on a log2 axis. This concerns several figures with ChIP-seq data.

      Response

      Our ChIP-seq enrichment is calculated by bamCompare. The resulting enrichment values are indeed log2 (IP/Input). We have made this clear in the updated figures/legends.

      4) Fig. 4C - The violin plots are very hard to interpret, as the plots are very narrow compared to the line thickness, making it hard to judge the actual volume. For example, in Centromere 5, YFP-KKT2 is less enriched than 147R-TALE over most of the centromere with some peaks of much higher enrichment (as visible in panel B), however, in panel C, it is very hard to see this same information. I'm sure there is some way to present this better, either using a different type of plot or by improving the spacing of the existing plot.

      Response

      We thank the reviewer for this suggestion; we have elected to provide a Split-Violin plot instead. This improves the presentation of the data for each centromere. The original violin plot in Figure 4C has been replaced with this Split-Violin plot (still Figure 4C).

      5) Fig. 6 - The panels are missing an x-axis label (although it is obvious from the plot what is displayed). Maybe the "WT NO-YFP vs" part that is repeated in all the plot titles could be removed from the title and only be part of the x-axis label?

      Response

      In fact, to save space the X axis was labelled inside each volcano plot but we neglected to indicate that values are a log2 scale indicating enrichment. This has been rectified - see Figure 6, and Fig. S7, S8 and S9.

      6) Fig. 7 - I would like to have a quantification for the examples shown here. In fact, such a quantification already exists in Suppl. Figure 1. I think the relevant plots of that quantification (YFP-KKT2 over 177bp-repeats and centromere-repeats) with some control could be included in Fig. 7 as panel C. This opportunity could be used to show enrichment separated out for intermediate-sized, mini-, and megabase-chromosomes. (relates to minor comment 2 & 8)

      Response

      The CIR147 sequence is found exclusively on megabase-sized chromosomes, while the 177 bp repeats are located on intermediate- and mini-sized chromosomes. Due to limitations in the current genome assembly, it is not possible to reliably classify all chromosomes into intermediate- or mini- sized categories based on their length. Therefore, original Supplementary Fig. S1 presented the YFP-KKT2 enrichment over CIR147 and 177 bp repeats as a representative comparison between megabase chromosomes and the remaining chromosomes (corrected version now presented as main Figure 2B). Additionally, to allow direct comparison of YFP-KKT2 enrichment on CIR147 and 177 bp repeats we have included a new plot in Figure 7C which shows the relative enrichment of YFP-KKT2 on these two repeat types.

      We have added the following text , page 12:

      "Taking into account the relative to the number of CIR147 and 177 bp repeats in the current T.brucei genome (Cosentino et al., 2021; Rabuffo et al., 2024), comparative analyses demonstrated that YFP-KKT2 is enriched on both CIR147 and 177 bp repeats (Figure 7C)."

      7) Suppl. Fig. 8 A - I believe there is a mistake here: KKT5 occurs twice in the plot, the one in the overlap region should be KKT1-4 instead, correct?

      Response

      Thanks for spotting this. It has been corrected

      8) The way that the authors mapped ChIP-seq data is potentially problematic when analyzing the same repeat type in different regions of the genome. The authors assigned reads that had multiple equally good mapping positions to one of these mapping positions, randomly. This is perfectly fine when analysing repeats by their type, independent of their position on the genome, which is what the authors did for the main conclusions of the work. However, several figures show the same type of repeat at different positions in the genome. Here, the authors risk that enrichment in one region of the genome 'spills' over to all other regions with the same sequence. Particularly, where they show YFP-KKT2 enrichment over intermediate- and mini-chromosomes (Fig. 7) due to the spillover, one cannot be sure to have found KKT2 in both regions. Instead, the authors could analyze only uniquely mapping reads / read-pairs where at least one mate is uniquely mapping. I realize that with this strict filtering, data will be much more sparse. Hence, I would suggest keeping the original plots and adding one more quantification where the enrichment over the whole region (e.g., all 177bp repeats on intermediate-/mini-chromosomes) is plotted using the unique reads (this could even be supplementary). This also applies to Fig. 4 B & C.

      Response

      We thank the reviewer for their thoughtful comments. Repetitive sequences are indeed challenging to analyze accurately, particularly in the context of short read ChIP-seq data. In our study, we aimed to address YFP-KKT2 enrichment not only over CIR147 repeats but also on 177 bp repeats, using both ChIP-seq and proteomics using synthetic TALE proteins targeted to the different repeat types. We appreciate the referees suggestion to consider uniquely mapped reads, however, in the updated genome assembly, the 177 bp repeats are frequently immediately followed by long stretches of 70 bp repeats which can span several kilobases. The size and repetitive nature of these regions exceeds the resolution limits of ChIP-seq. It is therefore difficult to precisely quantify enrichment across all chromosomes.

      Additionally, the repeat sequences are highly similar, and relying solely on uniquely mapped reads would result in the exclusion of most reads originating from these regions, significantly underestimating the relative signals. To address this, we used Bowtie2 with settings that allow multi-mapping, assigning reads randomly among equivalent mapping positions, but ensuring each read is counted only once. This approach is designed to evenly distribute signal across all repetitive regions and preserve a meaningful average.

      Single molecule methods such as DiMeLo (Altemose et al. 2022; PMID: 35396487) will need to be developed for T. brucei to allow more accurate and chromosome specific mapping of kinetochore or telomere protein occupancy at repeat-unique sequence boundaries on individual chromosomes.

      Reviewer #2 (Significance (Required)):

      This work is of high significance for chromosome/centromere biology, parasitology, and the study of antigenic variation. For chromosome/centromere biology, the conceptual advancement of different types of kinetochores for different chromosomes is a novelty, as far as I know. It would certainly be interesting to apply this study as a technical blueprint for other organisms with mini-chromosomes or chromosomes without known centromeric repeats. I can imagine a broad range of labs studying other organisms with comparable chromosomes to take note of and build on this study. For parasitology and the study of antigenic variation, it is crucial to know how intermediate- and mini-chromosomes are stable through cell division, as these chromosomes harbor a large portion of the antigenic repertoire. Moreover, this study also found a novel link between the homologous repair pathway and variant surface glycoproteins, via the 70bp repeats. How and at which stages during the process, 70bp repeats are involved in antigenic variation is an unresolved, and very actively studied, question in the field. Of course, apart from the basic biological research audience, insights into antigenic variation always have the potential for clinical implications, as T. brucei causes sleeping sickness in humans and nagana in cattle. Due to antigenic variation, T. brucei infections can be chronic.

      Response

      Thank you for supporting the novelty and broad interest of our manuscript

      My field of expertise / Point of view:

      I'm a computer scientist by training and am now a postdoctoral bioinformatician in a molecular parasitology laboratory. The laboratory is working on antigenic variation in T. brucei. The focus of my work is on analyzing sequencing data (such as ChIP-seq data) and algorithmically improving bioinformatic tools.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The authors only report the quality of the classification considering the number of videos used for training, but not considering the number of mice represented or the mouse strain. Therefore, it is unclear if the classification model works equally well in data from all the mouse strains tested, and how many mice are represented in the classifier dataset and validation.

      We agree that strain-level performance is critical for assessing generalizability. In the revision we now report per-strain accuracy and F1 for the grooming classifier, which was trained on videos spanning 60 genetically diverse strains (n = 1100 videos) and evaluated on the test set videos spanning 51 genetically diverse strains (n=153 videos). Performance is uniform across most strains (median F1 = 0.94, IQR = 0.899–0.956), with only modest declines in albino lines that lack contrast under infrared illumination; this limitation and potential remedies are discussed in the text. The new per-strain metrics are presented in the Supplementary figure (corresponding to Figure 4).

      (2) The GUI requires pose tracking for classification, but the software provided in JABS does not do pose tracking, so users must do pose tracking using a separate tool. Currently, there is no guidance on the pose tracking recommendations and requirements for usage in JABS. The pose tracking quality directly impacts the classification quality, given that it is used for the feature calculation; therefore, this aspect of the data processing should be more carefully considered and described.

      We have added a section to the methods describing how to use the pose estimation models used in JABS. The reviewer is correct that pose tracking quality will impact classification quality. We recommend that classifiers should only be re-used on pose files generated by the same pose models used in the behavior classifier training dataset. We hope that the combination of sharing classifier training data and making a more unified framework for developing and comparing classifiers will get us closer to having foundational behavior classification models that work in many environments. We also would like to emphasize that deviating from using our pose model will also likely hinder re-using our shared large datasets in JABS-AI (JABS1200, JABS600, JABS-BxD).

      (3) Many statistical and methodological details are not described in the manuscript, limiting the interpretability of the data presented in Figures 4,7-8. There is no clear methods section describing many of the methods used and equations for the metrics used. As an example, there are no details of the CNN used to benchmark the JABS classifier in Figure 4, and no details of the methods used for the metrics reported in Figure 8.

      We thank the reviewer for bringing this to our attention. We have added a methods section to the manuscript to address this concern. Specifically, we now provide: (1) improved citation visibility of the source of CNN experiments such that the reader can locate the architecture information, (2) mathematical formulations for all performance metrics (precision, recall, F1, …) with explicit equations;  (3) detailed statistical procedures including permutation testing methods, power analysis and multiple testing corrections used throughout Figures 7-8. These additions facilitate reproducibility and proper interpretation of all quantitative results presented in the manuscript.

      Reviewer #2 (Public review):

      (1) The manuscript as written lacks much-needed context in multiple areas: what are the commercially available solutions, and how do they compare to JABS (at least in terms of features offered, not necessarily performance)? What are other open-source options?

      JABS adds to a list of commercial and open source animal tracking platforms. There are several reviews and resources that cover these technologies. JABS covers hardware, behavior prediction, a shared resource for classifiers, and genetic association studies. We’re not aware of another system that encompasses all these components. Commercial packages such as EthoVision XT and HomeCage Scan give users a ready-made camera-plus-software solution that automatically tracks each mouse and reports simple measures such as distance travelled or time spent in preset zones, but they do not provide open hardware designs, editable behavior classifiers, or any genetics workflow. At the open-source end, the >100 projects catalogued on OpenBehavior and summarised in recent reviews (Luxem et al., 2023; Işık & Ünal 2023) usually cover only one link in the chain—DIY rigs, pose-tracking libraries (e.g., DeepLabCut, SLEAP) or supervised and unsupervised behaviour-classifier pipelines (e.g., SimBA, MARS, JAABA, B-SOiD, DeepEthogram). JABS provides an open source ecosystem that integrates all four: (i) top-down arena hardware with parts list and assembly guide; (ii) an active-learning GUI that produces shareable classifiers; (iii) a public web service that enables sharing of the trained classifier and applies any uploaded classifier to a large and diverse strain survey; and (iv) built-in heritability, genetic-correlation and GWAS reporting. We have added a concise paragraph in the Discussion that cites these resources and makes this end-to-end distinction explicit.

      (2) How does the supervised behavioral classification approach relate to the burgeoning field of unsupervised behavioral clustering (e.g., Keypoint-MoSeq, VAME, B-SOiD)? 

      The reviewer raises an important point about the rapidly evolving landscape of automated behavioral analysis, where both supervised and unsupervised approaches offer complementary strengths for different experimental contexts. Unsupervised methods like Keypoint-MoSeq , VAME , and B-SOiD , which prioritize motif discovery from unlabeled data but may yield less precise alignments with expert annotations, as evidenced by lower F1 scores in comparative evaluations. Supervised approaches (like ours), by contrast, employ fully supervised classifiers to deliver frame-accurate, behavior-specific scores that align directly with experimental hypotheses. Ultimately, a pragmatic hybrid strategy, starting with unsupervised pilots to identify motifs and transitioning to supervised fine-tuning with minimal labels, can minimize annotation burdens and enhance both discovery and precision in ethological studies. This has been added in the discussion section of the manuscript.

      (3) What kind of studies will this combination of open field + pose estimation + supervised classifier be suitable for? What kind of studies is it unsuited for? These are all relevant questions that potential users of this platform will be interested in.

      This approach is suitable for a wide array of neuroscience, genetics, pharmacology, preclinical, and ethology studies. We have published in the domains of action detection for complex behaviors such as grooming, gait and posture, frailty, nociception, and sleep. We feel these tools are indispensable for modern behavior analysis. 

      (4) Throughout the manuscript, I often find it unclear what is supported by the software/GUI and what is not. For example, does the GUI support uploading videos and running pose estimation, or does this need to be done separately? How many of the analyses in Figures 4-6 are accessible within the GUI?

      We have now clarified these. The JABS framework comprises two distinct GUI applications with complementary functionalities. The JABS-AL (active learning) desktop application handles video upload, behavioral annotation, classifier training, and inference -- it does not perform pose estimation, which must be completed separately using our pose tracking pipeline (https://github.com/KumarLabJax/mouse-tracking-runtime). If a user does not want to use our pose tracking pipeline, we have provided conversions through SLEAP to convert to our JABS pose format.  The web-based GUI enables classifier sharing and cloud-based inference on our curated datasets (JABS600, JABS1200) and downstream behavioral statistics and genetic analyses (Figures 4-6). The JABS-AL application also supports CLI (command line interface) operation for batch processing.  We have clarified these distinctions and provided a comprehensive workflow diagram in the revised Methods section.

      (5) While the manuscript does a good job of laying out best practices, there is an opportunity to further improve reproducibility for users of the platform. The software seems likely to perform well with perfect setups that adhere to the JABS criteria, but it is very likely that there will be users with suboptimal setups - poorly constructed rigs, insufficient camera quality, etc. It is important, in these cases, to give users feedback at each stage of the pipeline so they can understand if they have succeeded or not. Quality control (QC) metrics should be computed for raw video data (is the video too dark/bright? are there the expected number of frames? etc.), pose estimation outputs (do the tracked points maintain a reasonable skeleton structure; do they actually move around the arena?), and classifier outputs (what is the incidence rate of 1-3 frame behaviors? a high value could indicate issues). In cases where QC metrics are difficult to define (they are basically always difficult to define), diagnostic figures showing snippets of raw data or simple summary statistics (heatmaps of mouse location in the open field) could be utilized to allow users to catch glaring errors before proceeding to the next stage of the pipeline, or to remove data from their analyses if they observe critical issues.

      These are excellent suggestions that align with our vision for improving user experience and data quality assessment. We recognize the critical importance of providing users with comprehensive feedback at each stage of the pipeline to ensure optimal performance across diverse experimental setups. Currently, we provide end-users with tools and recommendations to inspect their own data quality. In our released datasets (Strain Survey OFA and BXD OFA), we provide video-level quality summaries for coverage of our pose estimation models. 

      For behavior classification quality control, we employ two primary strategies to ensure proper operation: (a) outlier manual validation and (b) leveraging known characteristics about behaviors. For each behavior that we predict on datasets, we manually inspect the highest and lowest expressions of this behavior to ensure that the new dataset we applied it to maintains sufficient similarity. For specific behavior classifiers, we utilize known behavioral characteristics to identify potentially compromised predictions. As the reviewer suggested, high incidence rates of 1-3 frame bouts for behaviors that typically last multiple seconds would indicate performance issues.

      We currently maintain in-house post-processing scripts that handle quality control according to our specific use cases. Future releases of JABS will incorporate generalized versions of these scripts, integrating comprehensive QC capabilities directly into the platform. This will provide users with automated feedback on video quality, pose estimation accuracy, and classifier performance, along with diagnostic visualizations such as movement heatmaps and behavioral summary statistics.

      Reviewer #1 (Recommendations for the authors):

      (1) A weakness of this tool is that it requires pose tracking, but the manuscript does not detail how pose tracking should be done and whether users should expect that the data deposited will help their pose tracking models. There is no specification on how to generate pose tracking that will be compatible with JABS. The classification quality is directly linked to the quality of the pose tracking. The authors should provide more details of the requirements of the pose tracking (skeleton used) and what pose tracking tools are compatible with JABS. In the user website link, I found no such information. Ideally, JABS would be integrated with the pose tracking tool into a single pipeline. If that is not possible, then the utility of this tool relies on more clarity on which pose tracking tools are compatible with JABS.

      The JABS ecosystem was deliberately designed with modularity in mind, separating the pose estimation pipeline from the active learning and classification app (JABS-AL) to offer greater flexibility and scalability for users working across diverse experimental setups. Our pose estimation pipeline is documented in detail within the new Methods subsection, outlining the steps to obtain JABS-compatible keypoints with our recommended runtime (https://github.com/KumarLabJax/mouse-tracking-runtime) and frozen inference models (https://github.com/KumarLabJax/deep-hrnet-mouse). This pipeline is an independent component within the broader JABS workflow, generating skeletonized keypoint data that are then fed into the JABS-AL application for behavior annotation and classifier training.

      By maintaining this separation, users have the option to use their preferred pose tracking tools— such as SLEAP —while ensuring compatibility through provided conversion utilities to the JABS skeleton format. These details, including usage instructions and compatibility guidance, are now thoroughly explained in the newly added pose estimation subsection of our Methods section. This modular design approach ensures that users benefit from best-in-class tracking while retaining the full power and reproducibility of our active learning pipeline.

      (2) The authors should justify why JAABA was chosen to benchmark their classifier. This tool was published in 2013, and there have been other classification tools (e.g., SIMBA) published since then.  

      We appreciate the reviewer’s suggestion regarding SIMBA. However, our comparisons to JAABA and a CNN are based on results from prior work (Geuther, Brian Q., et al. "Action detection using a neural network elucidates the genetics of mouse grooming behavior." Elife 10 (2021): e63207.), where both were used to benchmark performance on our publicly released dataset. In this study, we introduce JABS as a new approach and compare it against those established baselines. While SIMBA may indeed offer competitive performance, we believe the responsibility to demonstrate this lies with SIMBA’s authors, especially given the availability of our dataset for benchmarking.

      (3) I had a lot of trouble understanding the elements of the data calculated in JABS vs outside of JABS. This should be clarified in the manuscript.

      (a) For example, it was not intuitive that pose tracking was required and had to be done separately from the JABS pipeline. The diagrams and figures should more clearly indicate that.

      (b) In section 2.5, are any of those metrics calculated by JABS? Another software GEMMA, but no citation is provided for this tool. This created ambiguity regarding whether this is an analysis that is separate from JABS or integrated into the pipeline.  

      We acknowledge the confusion regarding the delineation between JABS components and external tools, and we have comprehensively addressed this throughout the manuscript. The JABS ecosystem consists of three integrated modules: JABS-DA (data acquisition), JABS-AL (active learning for behavior annotation and classifier training), and JABS-AI (analysis and integration via web application). Pose estimation, while developed by our laboratory, operates as a preprocessing pipeline that generates the keypoint coordinates required for subsequent JABS classifier training and annotation workflows. We have now added a dedicated Methods subsection that explicitly maps each analytical step to its corresponding software component, clearly distinguishing between core JABS modules and external tools (such as GEMMA for genetic analysis). Additionally, we have provided proper citations and code repositories for all external pipelines to ensure complete transparency regarding the computational workflow and enable full reproducibility of our analyses.

      (4) There needs to be clearer explanations of all metrics, methods, and transformations of the data reported.

      (a) There is very little information about the architecture of the classification model that JABS uses.

      (b) There are no details on the CNN used for comparing and benchmarking the classifier in JABS.

      (c) Unclear how the z-scoring of the behavioral data in Figure 7 was implemented.

      (d) There is currently no information on how the metrics in Figure 8 are calculated.

      We have added a comprehensive Methods section that not only addresses the specific concerns raised above but provides complete methodological transparency throughout our study. This expanded section includes detailed descriptions of all computational architectures (including the JABS classifier and grooming benchmark models and metrics), statistical procedures and data transformations (including the z-scoring methodology for Figure 7), downstream genetic analysis (including all measures presented in Figure 8), and preprocessing pipelines. 

      (5) The authors talk about their datasets having visual diversity, but without seeing examples, it is hard to know what they mean by this visual diversity. Ideally, the manuscript would have a supplementary figure with a representation of the variety of setups and visual diversity represented in the datasets used to train the model. This is important so that readers can quickly assess from reading the manuscript if the pre-trained classifier models could be used with the experimental data they have collected.

      The visual diversity of our training datasets has been comprehensively documented in our previous tracking work (https://www.nature.com/articles/s42003-019-0362-1), which systematically demonstrates tracking performance across mice with diverse coat colors (black, agouti, albino, gray, brown, nude, piebald), body sizes including obese mice, and challenging recording conditions with dynamic lighting and complex environments. Notably, Figure 3B in that publication specifically illustrates the robustness across coat colors and body shapes that characterize the visual diversity in our current classifier training data. To address the reviewer's concern and enable readers to quickly assess the applicability of our pre-trained models to their experimental data, we have now added this reference to the manuscript to ground our claims of visual diversity in published evidence.

      (6) All figures have a lot of acronyms used that are not defined in the figure legend. This makes the figures really hard to follow. The figure legends for Figures 1,2, 7, and 9 did not have sufficient information for me to comprehend the figure shown.

      We have fixed this in the manuscript. 

      (7) In the introduction, the authors talk about compression artifacts that can be introduced in camera software defaults. This is very vague without specific examples.

      This is a complex topic that balances the size and quality of video data and is beyond the scope of this paper. We have carefully optimized this parameter and given the user a balanced solution. A more detailed blog post on compression artifacts can be found at our lab’s webpage (https://www.kumarlab.org/2018/11/06/brians-video-compression-tests/). We have also added a comment about keyframes shifting temporal features in the main manuscript. 

      (8) More visuals of the inside of the apparatus should be included as supplementary figures. For example, to see the IR LEDs surrounding the camera.

      We have shared data from JABS as part of several papers including the tracking paper (Geuther et al 2019), grooming, gait and posture, mouse mass. We have also released entire datasets that as part of this paper (JABS1800, JABS-BXD). We also have step by step assembly guide that shows the location of the lights/cameras and other parts (see Methods, JABS workflow guide, and this PowerPoint file in the GitHub repository (https://github.com/KumarLabJax/JABS-datapipeline/blob/main/Multi-day%20setup%20PowerPoint%20V3.pptx).

      (9) Figure 2 suggests that you could have multiple data acquisition systems simultaneously. Do each require a separate computer? And then these are not synchronized data across all boxes?

      Each JABS-DA unit has its own edge device (Nvidia Jetson). Each system (which we define as multiple JABS-DA areas associated with one lab/group) can have multiple recording devices (arenas). The system requires only 1 control portal (RPi computer) and can handle as many recording devices as needed (Nvidia computer w/ camera associated with each JABS-DA arena). To collect data, 1 additional computer is needed to visit the web control portal and initiate a recording session. Since this is a web portal, users can use any computer or a tablet. The recording devices are not strictly synchronized but can be controlled in a unified manner.

      (10) The list of parts on GitHub seems incomplete; many part names are not there.

      We thank referee for bringing this to our attention. We have updated the GitHub repository (and its README) which now links out to the design files. 

      (11) The authors should consider adding guidance on how tethers and headstages are expected to impact the use of JABS, as many labs would be doing behavioral experiments combined with brain measurements.

      While our pose estimation model was not specifically trained on tethered animals, published research demonstrates that keypoint detection models maintain robust performance despite the presence of headstages and recording equipment. Once accurate pose coordinates are extracted, the downstream behavior classification pipeline operates independently of the pose estimation method and would remain fully functional. We recommend users validate pose estimation accuracy in their specific experimental setup, as the behavior classification component itself is agnostic to the source of pose coordinates.

      Reviewer #2 (Recommendations for the authors):

      (1) "Using software-defaults will introduce compression artifacts into the video and will affect algorithm performance." Can this be quantified? I imagine most of the performance hit comes from a decrease in pose estimation quality. How does a decrease in pose estimation quality translate to action segmentation? Providing guidelines to potential users (e.g., showing plots of video compression vs classifier performance) would provide valuable information for anyone looking to use this system (and could save many labs countless hours replicating this experiment themselves). A relevant reference for the effect of compression on pose estimation is Mathis, Warren 2018 (bioRxiv): On the inference speed and video-compression robustness of DeepLabCut.

      Since our behavior classification approach depends on features derived from keypoint, changes in keypoint accuracy will affect behavior segmentation accuracy. We agree that it is important to try and understand this further, particularly with the shared bioRxiv paper investigating the effect of compression on pose estimation accuracy. Measuring the effect of compression on keypoint and behavior classification is a complex task to evaluate concisely, given the number of potential variables to inspect. To list a few variables that should be investigated are: discrete cosine transform quality (Mathis, Warren experiment), Frame Size (Mathis, Warren experiment), Keyframe Interval (new, unique to video data), inter-frame settings (new, unique to video data), behavior of interest, Pose models with compression-augmentation used in training ( https://arxiv.org/pdf/1506.08316?) and type of CNN used (under active development). The simplest recommendation that we can make at this time is that we know compression will affect behavior predictions and that users should be cautious about using our shared classifiers on compressed video data. To show that we are dedicated in sharing these results as we run those experiments, in a related work ( CV4Animals conference accepted paper (https://www.cv4animals.com/) and can be downloaded here https://drive.google.com/file/d/1UNQIgCUOqXQh3vcJbM4QuQrq02HudBLD/view) we have already begun to inspect how changing some factors affect behavior segmentation performance. In this work, we investigate the robustness of behavior classification across multiple behaviors using different keypoint subsets. Our findings in this work is that classifiers are relatively stable across different keypoint subsets. We are actively working on follow-up effort to investigate the effect of keypoint noise, CNN model architecture, and other factors we've listed above on behavior segmentation tasks.

      (2) The analysis of inter-annotator variability is very interesting. I'm curious how these differences compare to two other types of variability:

      (a) intra-annotator variability; I think this is actually hard to quantify with the presented annotation workflow. If a given annotator re-annotated a set of videos, but using different sparse subsets of the data, it is not possible to disentangle annotator variability versus the effect of training models on different subsets of data. This can only be rigorously quantified if all frames are labeled in each video.

      We propose an alternative approach to behavior classifier development in the text associated with Figure 3C. We do not advocate for high inter-annotator agreement since individual behavior experts have differing labeling style (an intuitive understanding of the behavior). Rather, we allow multiple classifiers for the same behavior and allow the end user to prioritize classifiers based on heritability of the behavior from a classifier.  

      (b) In lieu of this, I'd be curious to see the variability in model outputs trained on data from a single annotator, but using different random seeds or train/val splits of the data. This analysis would provide useful null distributions for each annotator and allow for more rigorous statistical arguments about inter-annotator variability. 

      JABS allows the user to use multiple classifiers (random forest, XGBoost). We do not expect the user to carry out hyperparameter tuning or other forms of optimization. We find that the major increase in performance comes from optimizing the size of the window features and folds of cross validation. However, future versions of JABS-AL could enable a complete hyper-parameter scan across seeds and data splits to obtain a null distribution for each annotator. 

      (c) I appreciate the open-sourcing of the video/pose datasets. The authors might also consider publicly releasing their pose estimation and classifier training datasets (i.e., data plus annotations) for use by method developers.

      We thank the referee for acknowledging our commitment to open data sharing practices. Building upon our previously released strain survey dataset, we have now also made our complete classifier training resources publicly available, including the experimental videos, extracted pose coordinates, and behavioral annotations. The repository link has been added to the manuscript to ensure full reproducibility and facilitate community adoption of our methods.  

      (3) More thorough discussion on the limitations of the top-down vs bottom-up camera viewpoint; are there particular scientific questions that are much better suited to bottomup videos (e.g., questions about paw tremors, etc.).

      Top-down imaging, bottom-up, and multi-view imaging have a variety of pros and cons. Generally speaking, multi-view imaging will provide the most accurate pose models but requires increased resources on both hardware setup as well as processing of data. Top-down provides the advantage of flexibility for materials, since the floor doesn’t need to be transparent. Additionally lighting and potential reflection with the bottom-up perspective. Since the paws are not occluded from the bottom-up perspective, models should have improved paw keypoint precision allowing the model to observe more subtle behaviors. However, the appearance of the arena floor will change over time as the mice defecate and urinate. Care must be taken to clean the arena between recordings to ensure transparency is maintained. This doesn’t impact top-down imaging that much but will occlude or distort from the bottom-up perspective. Additionally, the inclusion of bedding for longer recordings, which is required by IACUC, will essentially render bottom-up imaging useless because the bedding will completely obscure the mouse. Overall, while bottomup may provide a precision benefit that will greatly enhance subtle motion, top-down imaging is overall more robust for obtaining consistent imaging across large experiments for longer periods of time.

      (4) More thorough discussion on what kind of experiments would warrant higher spatial or temporal resolution (e.g., investigating slight tremors in a mouse model of neurodegenerative disease might require this greater resolution).

      This is an important topic that deserves its own perspective guide. We try to capture some of this in the paper on specifications. However, we only scratch the surface. Overall, there are tradeoffs between frame rate, resolution, color/monochrome, and compression. Labs have collected data at hundreds of frames per second to capture the kinetics of reflexive behavior for pain (AbdoosSaboor lab) or whisking behavior. Labs have also collected data a low 2.5 frames per second for tracking activity or centroid tracking (see Kumar et al PNAS). The data collection specifications are largely dependent on the behaviors being captured. Our rule of thumb is the Nyquist Limit, which states that the data capture rate needs to be twice that of the frequency of the event. For example, certain syntaxes of grooming occur at 7Hz and we need 14FPS to capture this data. JABS collects data at 30FPS, which is a good compromise between data load and behavior rate. We use 800x800 pixel resolution which is a good compromise to capture animal body parts while limiting data size. Thank you for providing the feedback that the field needs guidance on this topic. We will work on creating such guidance documents for video data acquisition parameters to capture animal behavior data for the community as a separate publication.

      (5) References 

      (a) Should add the following ref when JAABA/MARS are referenced: Goodwin et al.2024, Nat Neuro (SimBA)

      (b) Could also add Bohnslav et al. 2021, eLife (DeepEthogram).

      (c) The SuperAnimal DLC paper (Ye et al. 2024, Nature Comms) is relevant to the introduction/discussion as well.

      We thank the referee for the suggestions. We have added these references.  

      (6) Section 2.2:

      While I appreciate the thoroughness with which the authors investigated environmental differences in the JABS arena vs standard wean cage, this section is quite long and eventually distracted me from the overall flow of the exposition; might be worth considering putting some of the more technical details in the methods/appendix.

      These are important data for adopters of JABS to gain IACUC approval in their home institution. These committees require evidence that any new animal housing environment has been shown to be safe for the animals. In the development of JABS, we spent a significant amount of time addressing the JAX veterinary and IACUC concerns. Therefore, we propose that these data deserve to be in the main text. 

      (7) Section 2.3.1:

      (a) Should again add the DeepEthogram reference here

      (b) Should reference some pose estimation papers: DeepLabCut, SLEAP, Lightning Pose. 

      We thank the referee for the suggestions. We have added these references.  

      (c) "Pose based approach offers the flexibility to use the identified poses for training classifiers for multiple behaviors" - I'm not sure I understand why this wouldn't be possible with the pixel-based approach. Is the concern about the speed of model training? If so, please make this clearer.

      The advantage lies not just in training speed, but in the transferability and generalization of the learned representations. Pose-based approaches create structured, low-dimensional latent embeddings that capture behaviorally relevant features which can be readily repurposed across different behavioral classification tasks, whereas pixel-based methods require retraining the entire feature extraction pipeline for each new behavior. Recent work demonstrates that pose-based models achieve greater data efficiency when fine-tuned for new tasks compared to pixel-based transfer learning approaches [1], and latent behavioral representations can be partitioned into interpretable subspaces that generalize across different experimental contexts [2]. While pixel-based approaches can achieve higher accuracy on specific tasks, they suffer from the "curse of dimensionality" (requiring thousands of pixels vs. 12 pose coordinates per frame) and lack the semantic structure that makes pose-based features inherently reusable for downstream behavioral analysis.

      (1) Ye, Shaokai, et al. "SuperAnimal pretrained pose estimation models for behavioral analysis." Nature communications 15.1 (2024): 5165.

      (2) Whiteway, Matthew R., et al. "Partitioning variability in animal behavioral videos using semi-supervised variational autoencoders." PLoS computational biology 17.9 (2021): e1009439.  

      (d) The pose estimation portion of the pipeline needs more detail. Do users use a pretrained network, or do they need to label their own frames and train their own pose estimator? If the former, does that pre-trained network ship with the software? Is it easy to run inference on new videos from a GUI or scripts? How accurate is it in compliant setups built outside of JAX? How long does it take to process videos?

      We have added the guidance on pose estimation in the manuscript (section “2.3.1 Behavior annotation and classifier training” and in the methods section titled “Pose tracking pipeline”)

      (e) The final paragraph describing how to arrive at an optimal classifier is a bit confusing - is this the process that is facilitated by the app, or is this merely a recommendation for best practices? If this is the process the app requires, is it indeed true that multiple annotators are required? While obviously good practice, I imagine there will be many labs that just want a single person to annotate, at least in the beginning prototyping stages. Will the app allow training a model with just a single annotator?

      We have clarified this in the text. 

      (8) Section 2.5:

      (a) This section contained a lot of technical details that I found confusing/opaque, and didn't add much to my overall understanding of the system; sec 2.6 did a good job of clarifying why 2.5 is important. It might be worth motivating 2.5 by including the content of 2.6 first, and moving some of the details of 2.5 to the method/appendix.

      We moved some of the technical details in section 2.5 to the methods section titled “Genetic analysis”. Furthermore, we have added few statements to motivate the need of genetic analysis and how the webapp can facilitate this (which is introduced in the section 2.6)    

      (9) Minor corrections:

      (a) Bottom of first page, "always been behavior quantification task" missing "a".

      (b) "Type" column in Table S2 is undocumented and unused (i.e., all values are the same); consider removing.

      (c) Figure 4B, x-axis: add units.

      (d) Page 8/9: all panel references to Figure S1 are off by one

      We have fixed them in the updated manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Circannual timing is a phylogenetically widespread phenomenon in long-lived organisms and is central to the seasonal regulation of reproduction, hibernation, migration, fur color changes, body weight, and fat deposition in response to photoperiodic changes. Photoperiodic control of thyroid hormone T3 levels in the hypothalamus dictates this timing. However, the mechanisms that regulate these changes are not fully understood. The study by Stewart et al. reports that hypothalamic iodothyronine deiodinase 3 (Dio3), the major inactivator of the biologically active thyroid hormone T3, plays a critical role in circannual timing in the Djungarian hamster. Overall, the study yields important results for the field and is well-conducted, with the exception of the CRISPR/Cas9 manipulation.

      We appreciate the positive and supportive comment from the Reviewer. We have clarified the oversight in the Crispr/Cas9 data representation below. Our correction should alleviate any concern raised.

      Figure 1 lays the foundation for examining circannual timing by establishing the timing of induction, maintenance, and recovery phases of the circannual timer upon exposure of hamsters to short photoperiod (SP) by monitoring morphological and physiological markers. Measures of pelage color, torpor, body mass, plasma glucose, etc, established that the initiation phase occurred by weeks 4-8 in SP, the maintenance by weeks 12-20, and the recovery after week 20, where all morphological and physiological changes started to reverse back to long photoperiod phenotypes.

      The statistical analyses look fine, and the results are unambiguous.

      We thank the Reviewer for recognizing our attempts to highlight the phenomenon of circannual interval timing.

      Their representation could, however, be improved. In Figures 1d and 1e, two different measures are plotted on each graph and differentiated by dots and upward or downward arrowheads. The plots are so small, though, that distinguishing between the direction of the arrows is difficult. Some color coding would make it more reader-friendly. The same comment applies to Figure S4. 

      We have increased the panel size for Figure 1d and 1e. We have also changed the colour of the graphs in Figure 1d and 1e to facilitate the differentiation of the two dependent variables. For the circos plots, we attempted different ways to represent the data. We have opted to keep the figures in their current stage. The overall aim is to provide a ‘gestalt’ view of the timing of changes in transcript expression and highlighted only a few key genes. The whole dataset is provided in the supplementary materials for Reviewer/Reader interrogation.

      The authors went on to profile the transcriptome of the mediobasal and dorsomedial hypothalamus, paraventricular nucleus, and pituitary gland (all known to be involved in seasonal timing) every 4 weeks over the different phases of the circannual interval timer. A number of transcripts displaying seasonal rhythms in expression levels in each of the investigated structures were identified, including transcripts whose expression peaks during each phase. This included two genes of particular interest due to their known modulation of expression in response to photoperiod, Dio3 and Sst, found among the transcripts upregulated during the induction and maintenance phases, respectively. The experiments are technically sound and properly analyzed, revealing interesting candidates. Again, my main issues lie with the representation in the figure. In particular, the authors should clarify what the heatmaps on the right of Figures 1f and 1g represent. I suspect they are simply heatmaps of averaged expression of all genes within a defined category, but a description is missing in the legend, as well as a scale for color coding near the figure.

      We have clarified the heatmap and density maps in the Figure legend. We apologise for the lack of information to describe the figure panels. (see lines 644-648)

      Figure 2 reveals that SP-programmed body mass loss is correlated to increased Dio3-dependent somatostatin (Sst) expression. First, to distinguish whether the body mass loss was controlled by rheostatic mechanisms and not just acute homeostatic changes in energy balance, experiments from hamsters fed ad lib or experiencing an acute food restriction in both LP and SP were tested. Unlike plasma insulin, food restriction had no additional effect on SP-driven epididymal fat mass loss (Figure S7). This clearly establishes a rheostatic control of body mass loss across weeks in SP conditions. Importantly, Sst expression in the mediobasal hypothalamus increased in both ad lib fed or restriction fed SP hamsters and this increase in expression could be reduced by a single subcutaneous injection of active T3, clearly suggesting that increase in Sst expression in SP is due to a decrease of active T3 likely via Dio3 increase in expression in the hypothalamus. The results are unambiguous

      We thank the Reviewer for the supportive and affirmative feedback.

      Figure 3 provides a functional test of Dio3's role in the circannual timer. Mediobasal hypothalamic injections of CRISPR-Cas9 lentiviral vectors expressing two guide RNAs targeting the hamster Dio3 led to a significant reduction in the interval between induction and recovery phases seen in SP as measured by body mass, and diminished the extent of pelage color change by weeks 15-20. In addition, hamsters that failed to respond to SP exposure by decreasing their body mass also had undetectable Dio3 expression in the mediobasal hypothalamus. Together, these data provide strong evidence that Dio3 functions in the circannual timer. I noted, however, a few problems in the way the CRISPR modification of Dio3 in the mediobasal hypothalamus was reported in Figure S8. One is in Figure S8b, where the PAM sites are reported to be 9bp and 11bp downstream of sgRNA1 and sgRNA2, respectively. Is this really the case? If so, I would have expected the experiment to fail to show any effect as PAM sites need to immediately follow the target genomic sequence recognized by the sgRNA for Cas9 to induce a DNA double-stranded break. It seems that each guide contains a 3' NGG sequence that is currently underlined as part of sgRNAs in both Fig S8b and in the method section. If this is not a mistake in reporting the experimental design, I believe that the design is less than optimal and the efficiencies of sgRNAs are rather low, if at all functional.

      We apologize for the oversight and indeed the reporting in Figure S8b was a mistake. The PAM site previously indicated was the ‘secondary PAM site’ (which as the Reviewer notes would likely have low efficiency). The PAM site is described within the gRNA in the figure. We use Adobe Illustrator to generate figures, and during the editing process, the layer for PAM text was accidentally moved ‘back’ to a lower level. The oversight was not rectified before submission. We apologise for this unreservedly. The PAM site text has been moved forward, to highlight the location of the primary site (ie immediately following gRNA) and labelled the gRNA and PAM site in the ‘Target region’. The secondary PAM site text was removed to eliminate any confusion.

      The authors report efficiencies around 60% (line 325), but how these were obtained is not specified. 

      The efficiency provided are based on bioinformatic analyses and not in vivo assays. To reduce any confusion, we have removed the text. The gRNA were clearly effective to induce mutations based on the sequencing analyses.

      Another unclear point is the degree to which the mediobasal hypothalamus was actually mutated. Only one mutated (truncated) sequence in Figure S8c is reported, but I would have expected a range of mutations in different cells of the tissue of interest.

      The tissue punch would include multiple different cells (e.g., neuronal, glial, etc). We agree with the Reviewer that genomic samples from different cells would be included in the sequencing analyses. Given the large mutation in the target region, the gRNA was effective. We have only shown one representative sequence. If the Reviewer would like to see all mutations, we can easily show the other samples.

      Although the authors clearly find a phenotypic effect with their CRISPR manipulation, I suspect that they may have uncovered greater effects with better sgRNA design. These points need some clarification. I would also argue that repeating this experiment with properly designed sgRNAs would provide much stronger support for causally linking Dio3 in circannual timing.

      The gRNA was designed using the Gold-standard approach – ChopChop [citation Labon et al., 2019]. If the Reviewer’s concern re design is due to the comment above re PAM site; this issue was clarified and there are no concerns for the gRNA design. The major challenge with the Dio3 gene (single exon) with a very short sequence length (approx.. 412bp). There is limited scope within this sequence length to generate gRNA.

      A proposed schematic model for mechanisms of circannual interval timing is presented in Figure S9. I think this represents a nice summary of the findings put in a broader context and should be presented as a main figure in the manuscript itself rather than being relayed in supplementary materials.

      We agree with the Reviewer position and moved the figure to the main manuscript. The figure is now Figure 4.

      Reviewer #2 (Public review):

      Several animals and plants adjust their physiology and behavior to seasons. These changes are timed to precede the seasonal transitions, maximizing chances of survival and reproduction. The molecular mechanisms used for this process are still unclear. Studies in mammals and birds have shown that the expression of deiodinase type-1, 2, and 3 (Dio1, 2, 3) in the hypothalamus spikes right before the transition to winter phenotypes. Yet, whether this change is required or an unrelated product of the seasonal changes has not been shown, particularly because of the genetic intractability of the animal models used to study seasonality. Here, the authors show for the first time a direct link between Dio3 expression and the modulation of circannual rhythms.

      We appreciate the clear synthesis and support for the manuscript.

      Strengths:

      The work is concise and presents the data in a clear manner. The data is, for the most part, solid and supports the author's main claims. The use of CRISPR is a clear advancement in the field. This is, to my knowledge, the first study showing a clear (i.e., causal) role of Dio3 in the circannual rhythms in mammals. Having established a clear component of the circannual timing and a clean approach to address causality, this study could serve as a blueprint to decipher other components of the timing mechanism. It could also help to enlighten the elusive nature of the upstream regulators, in particular, on how the integration of day length takes place, maybe within the components in the Pars tuberalis, and the regulation of tanycytes.

      We thank the Reviewer for this positive summary.

      Weaknesses:

      Due to the nature of the CRISPR manipulation, the low N number is a clear weakness. This is compensated by the fact that the phenotypes shown here are strong enough. Also, this is the only causal evidence of Dio3's role; thus, additional evidence would have significantly strengthened the author's claims. The use of the non-responsive population of hamsters also helps, but it falls within the realm of correlations.

      We would also like to remind the Reviewer that one Crispr-Cas9 Dio3<sup>cc</sup> treated hamster did not show any mutation in the genome. This hamster was observed to have a change in body mass and pelage colour like controls. This animal provides another positive control.

      We also conducted a statistical power analysis to examine whether n=3 is sufficient for the Dio3<sup>cc</sup> treatment group. Using the appropriate expected difference in means and standard deviations for an alpha of 0.05; we regularly observed beta >0.8 across the dependent variables. 

      Additionally, the consequences of the mutations generated by CRISPR are not detailed; it is not clear if the mutations affect the expression of Dio3 or generate a truncation or deletion, resulting in a shorter protein.

      We agree with the Reviewer that transcript and protein assays would strengthen the genome mutation data. Due to the small brain region under investigation, we are limited in the amount of biological material to extract. Dio3 is an intronless gene and very short – approximately 412 base pairs in length. We opted to maximize resources into sequencing the gene as the confirmation of genetic mutation is paramount. Given the large size of the mutation in the treated hamsters, there would be no amplification of transcript or protein translated.

      Reviewer #3 (Public review):

      The authors investigated SP-induced physiological and molecular changes in Djungarian hamsters and the endogenous recovery from it after circa half a year. The study aimed to elucidate the intrinsic mechanism and included nice experiments to distinguish between rheostatic effects on energy state and homeostatic cues driven by an interval timer. It also aimed to elucidate the role of Dio3 by introducing a targeted mutation in the MBH by ICV. The experiments and analyses are sound, and the amount of work is impressive. The impact of this study on the field of seasonal chronobiology is probably high.

      We thank the Reviewer for their positive comments and support for our work.

      Even though the general conclusions are well-founded, I have fundamental criticism concerning 3 points, which I recommend revising:

      (1) The authors talk about a circannual interval timer, but this is no circannual timer. This is a circasemiannual timer. It is important that the authors use precise wording throughout the manuscript.

      We agree with the Reviewer that the change in physiology and behaviour does not approximate a full year (e.g. annual) and only a half of the year. We opted to use circannual timer as this term is established in the field (see doi: 10.1177/0748730404266626; doi: 10.1098/rstb.2007.2143). We cannot identify any publication that has used the term ‘semiannual timer’. We do not feel this manuscript is the appropriate time to introduce a new term to the field; we will endeavour to push the field to consider the use of ‘semiannual timer’. A Review or Opinion paper is best place for this discussion. We hope the Reviewer will understand our position.

      (2) The authors put their results in the context of clocks. For example, line 180/181 seasonal clock. But they have described and investigated an interval timer. A clock must be able to complete a full cycle endogenously (and ideally repeatedly) and not only half of it. In contrast, a timer steers a duration. Thus, it is well possible that a circannual clock mechanism and this circa-semiannual timer of photoperiodic species are 2 completely different mechanisms. The argumentation should be changed accordingly.

      We agree with the Reviewers definitions of circannual ‘clock’ and ‘timer’. We were careful to distinguish between the two concepts early in the manuscript (lines 41-46). We have added italics to emphasis the different terms. The use of seasonal clock on line 180/191 was imprecise and we appreciate the Reviewer highlighting our oversight and the text was revised. We have also revised the Abstract accordingly.

      (3) The authors chose as animal model the Djungarian hamster, which is a predominantly photoperiodic species and not a circannual species. A photoperiodic species has no circannual clock. That is another reason why it is difficult to draw conclusions from the experiment for circannual clocks. However, the Djungarian hamster is kind of "indifferent" concerning its seasonal timing, since a small fraction of them are indeed able to cycle (Anchordoquy HC, Lynch GR (2000), Evidence of an annual rhythm in a small proportion of Siberian hamsters exposed to chronic short days. J Biol Rhythms 15:122-125.). Nevertheless, the proportion is too small to suggest that the findings in the current study might reflect part of the circannual timing. Therefore, the authors should make a clear distinction between timers and clocks, as well as between circa-annual and circa-semiannual durations/periods.

      This comment is not clear to us. The Reviewer states the hamsters are not a circannual species, but then highlight one study that shows circannual rhythmicity. We agree that circannual rhythmicity in Djungarian hamsters is dependent on the physiological process under investigation (e.g. body mass versus reproduction) and that photoperiodic response system either dampen or mask robust cycles. We have corrected the text oversight highlighted above and the manuscript is focused on interval timers. We have kept the term circannual over semicircannual due to the prior use in the scientific literature.

      Reviewing Editor Comments:

      The detailed suggestions of the reviewers are outlined below (or above in case of reviewer 1). In light of the criticism, we ask the authors to especially pay attention to the comments on the Cas9/Crisp experiment, raised by Reviewers 1 and 2. As currently described, there are serious questions on the design of the sgRNAs, and also missing critical methodological details. If the latter are diligently taken care of, they may resolve the questions on the sgRNA design. Please also reconsider the wording along the suggestions of Reviewer 3.

      We appreciate the Editors time and support for the manuscript. We have clarified and corrected our oversight for the PAM site. This correction confirms the strength of the Crispr-cas9 gRNA used in the study. The correction should remove all concerns. We have also considered using semicircannual in the text. As there is existing scientific literature using circannual interval timer, and there is no publication to our knowledge for using ‘semicircannual; we have opted to keep with the current approach and use circannual. We feel a subsequent Opinion paper is more suitable to introduce a new term.

      Reviewer #2 (Recommendations for the authors):

      First, I want to commend the authors for their work. It is a clear advancement for our field. Below are a couple of comments and suggestions I have:

      we thank the Review for the positive comment and support. We have endeavoured to incorporate their suggested improvements to the manuscript.

      (1) Looking at the results of Figure 1A and Figure S8, the control in S8 showed a lower pelage color score as compared to the hamsters in 1A. Is this a byproduct of the ICV injection?

      The difference between Figure 1 and 3 is likely due to the smaller sample sizes. The controls in Figure 1 had a higher proportion of hamsters show complete white fur (score =3) at 1618 weeks compared to controls in Figure 3. It is possible, although unlikely that the ICV injection would reduce the development of winter phenotype. There was no substance in the ICV injection that would impact the prolactin signalling pathway. Our perspective is that the difference between the two figures is due to the different sampling population. Overall, the timing of the change in pelage colour is the same between the figures and suggest that the mechanisms of interval timer were unaffected.

      (2) Is there a particular reason why the pelage color for the CRISPR mutants is relegated to the supplemental information? In my opinion, this is also important, even though the results might be difficult to explain. Additionally, did the authors check for food intake and adipose mass in these animals?

      We agree with the Reviewer the pelage change is very interesting. We decided to have Figure 3 focus on body mass. The rationale was due to the robust nature of the data collection from Crispr-cas9 study (Fig.3b), in addition to the non-responsive hamsters (Fig.3e). We disagree that the data patterns are hard to explain, as pelage changes was similar to the photoperiodic induced change in body mass. No differences were observed for food intake or adipose tissue. We have added this information in the text (see lines 162-163).

      (3) I might have missed it, but did the authors check for the expression of Dio3 on the CRISPR mutants? Does the deletion cause reduced expression or any other mRNA effect, such as those resulting in the truncation of a protein?

      Due to the limited biological material extracted from the anatomical punches, we decided to focus on genomic mutations. Dio3 has a very short sequence length and the size of the mutations identified indicate that no RNA could be transcribed.

      (4) Could the authors clarify which reference genome or partial CDS (i.e., accession numbers) they used to align the gRNA? Did they use the SSSS strain or the Psun_Stras_1 isolate?

      The gRNAs were designed using the online tool CHOPCHOP, using the Mus musculus

      Dio3 gene. The generated gRNAs were subsequently aligned via blast with the Phodopus sungorus Dio3 partial cds (GenBank: MF662622.1), to ensure alignment with the species. We are confident that the gRNA designed align 100% in hamsters. Furthermore, we conducted BLAST to ensure there were no off-targets. The only gene identified in the BLAST was the rodent (i.e. hamster, mouse) Dio3 sequence.

      (5) Figure 3b. I do agree with the authors in pointing out that the decrease in body mass is occurring earlier in Dio3wt hamsters; however, the shape of the body mass dynamic is also different. Do the authors have any comments on the possible role of Dio3 in the process of exist of overwintering?

      This is a very interesting question. We do not have the data to evaluate the role of Dio3 for overwintering. We argue that disruption in Dio3 reduced the circannual interval period. For this interpretation, yes, Dio3 is necessary for overwintering. However, we would need to show the sufficiency of Dio3 to induce the winter phenotype in hamsters housed in long photoperiod. At this time, we do not have the technical ability to conduct this experiment.

      (6) In Figure 3d, the Dio3wt group does not show any dispersion. Is this correct? If that's true, and no dispersion is observed, no normality can be assumed, and a t-test can't be performed (Line 692).The Mann-Whitney test might be better suited.

      We conducted a Welch’s t-test to compare the difference in body mass period. We used the Welch’s test as the variance were not equal; Mann-Whitney test is best for skewed distributions. To clarify the test used, we have added ‘Welch’s test’ to the Figure legend.

      (9) Figure 1 h. It might be convenient to add the words "Induction", "maintenance", and "recovery" over each respective line on the polar graph for easier reading.

      We have added the text as suggested by the Reviewer.

      Reviewer #3 (Recommendations for the authors):

      (1) Figure 1: Please enlarge all partial graphics at least to the size of Figure 2. In the print version, labels are barely readable

      we have increased the panels in Figure 1 and 3 by 20% to accommodate the Reviewers suggestion.

      (2) Legend Figure 2: Add that the food restriction was 16h.

      We have added 16h to the text.

      (3) Figure 3b: enlarge font size. In the legend: Dio3cc hamsters delayed.... The delay might have been a week or so, but not more (and even that is unclear since the rise in body mass in that week seems to be rather a disturbance of the curve). Thus 'delay' might not be the most appropriate wording. Instead, the initial decline is slower, but both started at nearly the same week (=> no delay). Minimum body mass is reached at the identical week as in wt (=> no delay). Also, the increase started at the same week but was much faster in Dio3cc than in wt. Figure 3c: How can there be a period when there is no repeated cycle (rhythm)? This is rather a duration. Moreover, according to the displayed data, I am wondering which start point and which endpoint is used. The first and last values are the highest of the graph, but have they been the maximum? Especially for Dio3wt, it can be assumed that animals haven't reached the maximum at the end of the graph.

      We have increased the font size in Figure 3b. We have changed ‘delayed’ to ‘slower’ in the text. Period analyses, such as the Lomb-Scargle measure the duration of a cycle (and multiple cycles). The start point and end point used in the analyses were the initial data collection date (week 0) and the final data collection date (week 32). The Lomb-Scargle analyses determines the duration of the period that occurs within these phases of the cycle. We believe the period analyses conducted by the Lomb-Scargle is the most suitable for the scientific question.

      (4) Figure S9: This is a very nice graph and summarises your main results. It should appear in the main manuscript and not in the supplements.

      We appreciate the positive comment and suggestion. We agree with the Reviewer and have move the graph to the main figure. The revised manuscript indicates the graph as Figure 4.

    1. Reviewer #2 (Public review):

      This study aims to disentangle the contribution of sensory and motor processes (mapped onto the inverse and forward components of speech motor control models like DIVA) to production changes as a result of altered auditory feedback. After five experiments, the authors conclude that it is the motor compensation on the previous trial, and not the sensory error, that drives compensatory responses in subsequent trials.

      Assessment:

      The goal of this paper is great, and the question is timely. Quite a bit of work has gone into the study, and the technical aspects are sound. That said, I just don't understand how the current design can accomplish what the authors have set as their goal. This may, of course, be a misunderstanding on my part, so I'll try to explain my confusion below. If it is indeed my mistake, then I encourage the authors to dedicate some space to unpacking the logic in the Introduction, which is currently barely over a page long. They should take some time to lay out the logic of the experimental design and the dependent and independent variables, and how this design disentangles sensory and motor influences. Then clearly discuss the opposing predictions supporting sensory-driven vs. motor-driven changes. Given that I currently don't understand the logic and, consequently, the claims, I will focus my review on major points for now.

      Main issues

      (1) Measuring sensory change. As acknowledged by the authors, making a motor correction as a function of altered auditory feedback is an interactive process between sensory and motor systems. However, one could still ask whether it is primarily a change to perception vs. a change to production that is driving the motor correction. But to do this, one has to have two sets of measurements: (a) perceptual change, and (b) motor change. As far as I understand, the study has the latter (i.e., C), but not the former. Instead, the magnitude of perceptual change is estimated through the proxy of the magnitude of perturbation (P), but the two are not the same; P is a physical manipulation; perceptual change is a psychological response to that physical manipulation. It is theoretically possible that a physical change does not cause a psychological change, or that the magnitude of the two does not match. So my first confusion centers on the absence of any measure of sensory change in this study.

      To give an explicit example of what I mean, consider a study like Murphy, Nozari, and Holt (2024; Psychonomic Bulletin & Review). This work is about changes to production as a function of exposure to other talkers' acoustic properties - rather than your own altered feedback - but the idea is that the same sensory-motor loop is involved in both. When changing the acoustic properties of the input, the authors obtain two separate measures: (a) how listeners' perception changes as a function of this physical change in the acoustics of the auditory signal, and (b) how their production changes. This allows the authors to identify motor changes above and beyond perceptual changes. Perhaps making a direct comparison with this study would help the reader understand the parallels better.

      (2) A more fundamental issue for me is a theoretical one: Isn't a compensatory motor change ALWAYS a consequence of a perceptual change? I think it makes sense to ask, "Does a motor compensation hinge on a previous motor action or is sensory change enough to drive motor compensation?" This question has been asked for changed acoustics for self-produced speech (e.g., Hantzsch, Parrell, & Niziolek, 2022) and other-produced speech (Murphy, Holt, & Nozari, 2025), and in both cases, the answer has been that sensory changes alone are, in fact, sufficient to drive motor changes. A similar finding has been reported for the role of cerebellum in limb movements (Tseng et al., 2007), with a similar answer (note that in that study, the authors explicitly talk about "the addition" of motor corrections to sensory error, not one vs. the other as two independent factors. So I don't understand a sentence like "We found that motor compensation, rather than sensory errors, predicted the compensatory responses in the subsequent trials", which views motor compensations and sensory errors as orthogonal variables affecting future motor adjustments.

      In other words, there is a certain degree of seriality to the compensation process, with sensory changes preceding motor corrections. If the authors disagree with this, they should explain how an alternative is possible. If they mean something else, a comparison with the above studies and explaining the differences in positions would greatly help.

      (3) Clash with previous findings. I used the examples in point 2 to bring up a theoretical issue, but those examples are also important in that all three of them reach a conclusion compatible with one another and different from the current study. The authors do discuss Tseng et al.'s findings, which oppose their own, but dismiss the opposition based on limb vs. articulator differences. I don't find the authors reasoning theoretically convincing here, but more importantly, the current claims also oppose findings from speech motor studies (see citations in point 2), to which the authors' arguments simply don't apply. Strangely, Hantzsch et al.'s study has been cited a few times, but never in its most important capacity, which is to show that speech motor adaptation can take place after a single exposure to auditory error. Murphy et al. report a similar finding in the context of exposure to other talkers' speech.

      If the authors can convincingly justify their theoretical position in 2, the next step would be to present a thorough comparison with the results of the three studies above. If indeed there is no discrepancy, this comparison would help clarify it.

      References

      Hantzsch, L., Parrell, B., & Niziolek, C. A. (2022). A single exposure to altered auditory feedback causes observable sensorimotor adaptation in speech. eLife, 11, e73694.

      Murphy, T. K., Nozari, N., & Holt, L. L. (2024). Transfer of statistical learning from passive speech perception to speech production. Psychonomic Bulletin & Review, 31(3), 1193-1205.

      Murphy, T. K., Holt, L. L. & Nozari, N. (2025). Exposure to an Accent Transfers to Speech Production in a Single Shot. Preprint available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5196109.

      Tseng, Y. W., Diedrichsen, J., Krakauer, J. W., Shadmehr, R., & Bastian, A. J. (2007). Sensory prediction errors drive cerebellum-dependent adaptation of reaching. Journal of neurophysiology, 98(1), 54-62.

    1. Author response:

      Reviewer 1 (Public review):

      (1) Figure 1B shows the PREDICTED force-extension curve for DNA based on a worm-like chain model. Where is the experimental evidence for this curve? This issue is crucial because the F-E curve will decide how and when a catch-bond is induced (if at all it is) as the motor moves against the tensiometer. Unless this is actually measured by some other means, I find it hard to accept all the results based on Figure 1B.

      The Worm-Like-Chain model for the elasticity of DNA was established by early work from the Bustamante lab (Smith et al., 1992)  and Marko and Siggia (Marko and Siggia, 1995), and was further validated and refined by the Block lab (Bouchiat et al., 1999; Wang et al., 1997). The 50 nm persistence length is the consensus value, and was shown to be independent of force and extension in Figure 3 of Bouchiat et al (Bouchiat et al., 1999). However, we would like to stress that for our conclusions, the precise details of the Force-Extension relationship of our dsDNA are immaterial. The key point is that the motor stretches the DNA and stalls when it reaches its stall force. Our claim of the catch-bond character of kinesin is based on the longer duration at stall compared to the run duration in the absence of load. Provided that the motor is indeed stalling because it has stretched out the DNA (which is strongly supported by the repeated stalling around the predicted extension corresponding to ~6 pN of force), then the stall duration depends on neither the precise value for the extension nor the precise value of the force at stall.

      (2) The authors can correct me on this, but I believe that all the catch-bond studies using optical traps have exerted a load force that exceeds the actual force generated by the motor. For example, see Figure 2 in reference 42 (Kunwar et al). It is in this regime (load force > force from motor) that the dissociation rate is reduced (catch-bond is activated). Such a regime is never reached in the DNA tensiometer study because of the very construction of the experiment. I am very surprised that this point is overlooked in this manuscript. I am therefore not even sure that the present experiments even induce a catch-bond (in the sense reported for earlier papers).

      It is true that Kunwar et al measured binding durations at super-stall loads and used that to conclude that dynein does act as a catch-bond (but kinesin does not) (Kunwar et al., 2011). However, we would like to correct the reviewer on this one. This approach of exerting super-stall forces and measuring binding durations is in fact less common than the approach of allowing the motor to walk up to stall and measuring the binding duration. This ‘fixed trap’ approach has been used to show catch-bond behavior of dynein (Leidel et al., 2012; Rai et al., 2013) and kinesin (Kuo et al., 2022; Pyrpassopoulos et al., 2020). For the non-processive motor Myosin I, a dynamic force clamp was used to keep the actin filament in place while the myosin generated a single step (Laakso et al., 2008). Because the motor generates the force, these are not superstall forces either.

      (3) I appreciate the concerns about the Vertical force from the optical trap. But that leads to the following questions that have not at all been addressed in this paper:

      (i) Why is the Vertical force only a problem for Kinesins, and not a problem for the dynein studies?

      Actually, we do not claim that vertical force is not a problem for dynein; our data do not speak to this question. There is debate in the literature as to whether dynein has catch bond behavior in the traditional single-bead optical trap geometry - while some studies have measured dynein catch bond behavior (Kunwar et al., 2011; Leidel et al., 2012; Rai et al., 2013), others have found that dynein has slip-bond or ideal-bond behavior (Ezber et al., 2020; Nicholas et al., 2015; Rao et al., 2019). This discrepancy may relate to vertical forces, but not in an obvious way.

      (ii) The authors state that "With this geometry, a kinesin motor pulls against the elastic force of a stretched DNA solely in a direction parallel to the microtubule". Is this really true? What matters is not just how the kinesin pulls the DNA, but also how the DNA pulls on the kinesin. In Figure 1A, what is the guarantee that the DNA is oriented only in the plane of the paper? In fact, the DNA could even be bending transiently in a manner that it pulls the kinesin motor UPWARDS (Vertical force). How are the authors sure that the reaction force between DNA and kinesin is oriented SOLELY along the microtubule?

      We acknowledge that “solely” is an absolute term that is too strong to describe our geometry. We will soften this term in our revision to “nearly parallel to the microtubule”. In the Geometry Calculations section of Supplementary Methods, we calculate that if the motor and streptavidin are on the same protofilament, the vertical force will be <1% of the horizontal force. We also note that if the motor is on a different protofilament, there will be lateral forces and forces perpendicular to the microtubule surface, except they are oriented toward rather than away from the microtubule. The DNA can surely bend due to thermal forces, but because inertia plays a negligible role at the nanoscale (Howard, 2001; Purcell, 1977), any resulting upward forces will only be thermal forces, which the motor is already subjected to at all times.

      (4) For this study to be really impactful and for some of the above concerns to be addressed, the data should also have included DNA tensiometer experiments with Dynein. I wonder why this was not done?

      As much as we would love to fully characterize dynein here, this paper is about kinesin and it took a substantial effort. The dynein work merits a stand-alone paper.

      While I do like several aspects of the paper, I do not believe that the conclusions are supported by the data presented in this paper for the reasons stated above.

      The three key points the reviewer makes are the validity of the worm-like-chain model, the question of superstall loads, and the role of DNA bending in generating vertical forces. We hope that we have fully addressed these concerns in our responses above.

      Reviewer #2 (Public review):

      Major comments:

      (1) The use of the term "catch bond" is misleading, as the authors do not really mean consistently a catch bond in the classical sense (i.e., a protein-protein interaction having a dissociation rate that decreases with load). Instead, what they mean is that after motor detachment (i.e., after a motor protein dissociating from a tubulin protein), there is a slip state during which the reattachment rate is higher as compared to a motor diffusing in solution. While this may indeed influence the dynamics of bidirectional cargo transport (e.g., during tug-of-war events), the used terms (detachment (with or without slip?), dissociation, rescue, ...) need to be better defined and the results discussed in the context of these definitions. It is very unsatisfactory at the moment, for example, that kinesin-3 is at first not classified as a catch bond, but later on (after tweaking the definitions) it is. In essence, the typical slip/catch bond nomenclature used for protein-protein interaction is not readily applicable for motors with slippage.

      We appreciate the reviewer’s point and we will work to streamline and define terms in our revision.

      (2) The authors define the stall duration as the time at full load, terminated by >60 nm slips/detachments. Isn't that a problem? Smaller slips are not detected/considered... but are also indicative of a motor dissociation event, i.e., the end of a stall. What is the distribution of the slip distances? If the slip distances follow an exponential decay, a large number of short slips are expected, and the presented data (neglecting those short slips) would be highly distorted.

      The reviewer brings up a good point that there may be undetected slips. To address this question, we plotted the distribution of slip distances for kinesin-3, which by far had the most slip events. As the reviewer suggested, it is indeed an exponential distribution. Our preliminary analysis suggests that roughly 20% of events are missed due to this 60 nm cutoff. This will change our unloaded duration numbers slightly, but this will not alter our conclusions.\

      (3) Along the same line: Why do the authors compare the stall duration (without including the time it took the motor to reach stall) to the unloaded single motor run durations? Shouldn't the times of the runs be included?

      The elastic force of the DNA spring is variable as the motor steps up to stall, and so if we included the entire run duration then it would be difficult to specify what force we were comparing to unloaded. More importantly, if we assume that any stepping and detachment behavior is history independent, then it is mathematically proper to take any arbitrary starting point (such as when the motor reaches stall), start the clock there, and measure the distribution of detachments durations relative to that starting point.

      More importantly, what we do in Fig. 3 is to separate out the ramps from the stalls and, using a statistical model, we compute a separate duration parameter (which is the inverse of the off-rate) for the ramp and the stall. What we find is that the relationship between ramp, stall, and unloaded durations is different for the three motors, which is interesting in itself.

      (4) At many places, it appears too simple that for the biologically relevant processes, mainly/only the load-dependent off-rates of the motors matter. The stall forces and the kind of motor-cargo linkage (e.g., rigid vs. diffusive) do likely also matter. For example: "In the context of pulling a large cargo through the viscous cytoplasm or competing against dynein in a tug-of-war, these slip events enable the motor to maintain force generation and, hence, are distinct from true detachment events." I disagree. The kinesin force at reattachment (after slippage) is much smaller than at stall. What helps, however, is that due to the geometry of being held close to the microtubule (either by the DNA in the present case or by the cargo in vivo) the attachment rate is much higher. Note also that upon DNA relaxation, the motor is likely kept close to the microtubule surface, while, for example, when bound to a vesicle, the motor may diffuse away from the microtubule quickly (e.g., reference 20).

      We appreciate the reviewer’s detailed thinking here, and we offer our perspective. As to the first point, we agree that the stall force is relevant and that the rigidity of the motor-cargo linkage will play a role. The goal of the sentence on pulling cargo that the reviewer highlights is to set up our analysis of slips, which we define as rearward displacements that don’t return to the baseline before force generation resumes. We agree that force after slippage is much smaller than at stall, and we plan to clarify that section of text. However, as shown in the model diagram in Fig. 5, we differentiate between the slip state (and recovery from this slip state) and the detached state (and reattachment from this detached state). This delineation is important because, as the reviewer points out, if we are measuring detachment and reattachment with our DNA tensiometer, then the geometry of a vesicle in a cell will be different and diffusion away from the microtubule or elastic recoil perpendicular to the microtubule will suppress this reattachment.

      Our evidence for a slip state in which the motor maintains association with the microtubule comes from optical trapping work by Tokelis et al (Toleikis et al., 2020) and Sudhakar et al (Sudhakar et al., 2021). In particular, Sudhakar used small, high index Germanium microspheres that had a low drag coefficient. They showed that during ‘slip’ events, the relaxation time constant of the bead back to the center of the trap was nearly 10-fold slower than the trap response time, consistent with the motor exerting drag on the microtubule. (With larger beads, the drag of the bead swamps the motor-microtubule friction.) Another piece of support for the motor maintaining association during a slip is work by Ramaiya et al. who used birefringent microspheres to exert and measure rotational torque during kinesin stepping (Ramaiya et al., 2017). In most traces, when the motor returned to baseline following a stall, the torque was dissipated as well, consistent with a ‘detached’ state. However, a slip event is shown in S18a where the motor slips backward while maintaining torque. This is best explained by the motor slipping backward in a state where the heads are associated with the microtubule (at least sufficiently to resist rotational forces). Thus, we term the resumption after slip to be a rescue from the slip state rather than a reattachment from the detached state.

      To finish the point, with the complex geometry of a vesicle, during slip events the motor remains associated with the microtubule and hence primed for recovery. This recovery rate is expected to be the same as for the DNA tensiometer. Following a detachment, however, we agree that there will likely be a higher probability of reattachment in the DNA tensiometer due to proximity effects, whereas with a vesicle any elastic recoil or ‘rolling’ will pull the detached motor away from the microtubule, suppressing reattachment. We plan to clarify these points in the text of the revision.

      (5) Why were all motors linked to the neck-coil domain of kinesin-1? Couldn't it be that for normal function, the different coils matter? Autoinhibition can also be circumvented by consistently shortening the constructs.

      We chose this dimerization approach to focus on how the mechoanochemical properties of kinesins vary between the three dominant transport families. We agree that in cells, autoinhibition of both kinesins and dynein likely play roles in regulating bidirectional transport, as will the activity of other regulatory proteins. The native coiled-coils may act as as ‘shock absorbers’ due to their compliance, or they might slow the motor reattachment rate due to the relatively large search volumes created by their long lengths (10s of nm). These are topics for future work. By using the neck-coil domain of kinesin-1 for all three motors, we eliminate any differences in autoinhibition or other regulation between the three kinesin families and focus solely on differences in the mechanochemistry of their motor domains.

      (6) I am worried about the neutravidin on the microtubules, which may act as roadblocks (e.g. DOI: 10.1039/b803585g), slip termination sites (maybe without the neutravidin, the rescue rate would be much lower?), and potentially also DNA-interaction sites? At 8 nM neutravidin and the given level of biotinylation, what density of neutravidin do the authors expect on their microtubules? Can the authors rule out that the observed stall events are predominantly the result of a kinesin motor being stopped after a short slippage event at a neutravidin molecule?

      We will address these points in our revision.

      (7) Also, the unloaded runs should be performed on the same microtubules as in the DNA experiments, i.e., with neutravidin. Otherwise, I do not see how the values can be compared.

      We will address this point in our revision.

      (8) If, as stated, "a portion of kinesin-3 unloaded run durations were limited by the length of the microtubules, meaning the unloaded duration is a lower limit." corrections (such as Kaplan-Meier) should be applied, DOI: 10.1016/j.bpj.2017.09.024.

      (9) Shouldn't Kaplan-Meier also be applied to the ramp durations ... as a ramp may also artificially end upon stall? Also, doesn't the comparison between ramp and stall duration have a problem, as each stall is preceded by a ramp ...and the (maximum) ramp times will depend on the speed of the motor? Kinesin-3 is the fastest motor and will reach stall much faster than kinesin-1. Isn't it obvious that the stall durations are longer than the ramp duration (as seen for all three motors in Figure 3)?

      The reviewer rightly notes the many challenges in estimating the motor off-rates during ramps. To estimate ramp off-rates and as an independent approach to calculating the unloaded and stall durations, we developed a Markov model coupled with Bayesian inference methods to estimate a duration parameter (equivalent to the inverse of the off-rate) for the unloaded, ramp, and stall duration distributions. With the ramps, we have left censoring due to the difficulty in detecting the start of the ramps in the fluctuating baseline, and we have right censoring due to reaching stall (with different censoring of the ramp duration for the three motors due to their different speeds). The Markov model assumes a constant detachment probability and history independence, and thus is robust even in the face of left and right censoring (details in the Supplementary section). This approach is preferred over Kaplan-Meier because, although these non-parametric methods make no assumptions for the distribution, they require the user to know exactly where the start time is.

      Regarding the potential underestimate of the kinesin-3 unloaded run duration due to finite microtubule lengths. The first point is that the unloaded duration data in Fig. 2C are quite linear up to 6 s and are well fit by the single-exponential fit (the points above 6s don’t affect the fit very much). The second point is that when we used our Markov model (which is robust against right censoring) to estimate the unloaded and stall durations, the results agreed with the single-exponential fits very well (Table S2). For instance, the single-exponential fit for the kinesin-3 unloaded duration was 2.74 s (2.33 – 3.17 s 95% CI) and the estimate from the Markov model was 2.76 (2.28 – 3.34 s 95% CI). Thus, we chose not to make any corrections due to finite microtubule lengths.

      (10) It is not clear what is seen in Figure S6A: It looks like only single motors (green, w/o a DNA molecule) are walking ... Note: the influence of the attached DNA onto the stepping duration of a motor may depend on the DNA conformation (stretched and near to the microtubule (with neutravidin!) in the tethered case and spherically coiled in the untethered case).

      In Figure S6A kymograph, the green traces are GFP-labeled kinesin-1 without DNA attached (which are in excess) and the red diagonal trace is a motor with DNA attached. There are also two faint horizontal red traces, which are labeled DNA diffusing by (smearing over a large area during a single frame). Panel S6B shows run durations of motors with DNA attached. We agree that the DNA conformation will differ if it is attached and stretched (more linear) versus simply being transported (random coil), but by its nature this control experiment is only addressing random coil DNA.

      (11) Along this line: While the run time of kinesin-1 with DNA (1.4 s) is significantly shorter than the stall time (3.0 s), it is still larger than the unloaded run time (1.0 s). What do the authors think is the origin of this increase?

      Our interpretation of the unloaded kinesin-DNA result is that the much slower diffusion constant of the DNA relative to the motor alone enables motors to transiently detach and rebind before the DNA cargo has diffused away, thus extending the run duration. In contrast, such detachment events for motors alone normally result in the motor diffusing away from the microtubule, terminating the run. This argument has been used to reconcile the longer single-motor run lengths in the gliding assay versus the bead assay (Block et al., 1990). Notably, this slower diffusion constant should not play a role in the DNA tensiometer geometry because if the motor transiently detaches, then it will be pulled backward by the elastic forces of the DNA and detected as a slip or detachment event. We will address this point in the revision.

      (12) "The simplest prediction is that against the low loads experienced during ramps, the detachment rate should match the unloaded detachment rate." I disagree. I would already expect a slight increase.

      Agreed. We will change this text to: “The prediction for a slip bond is that against the low loads experienced during ramps, the detachment rate should be equal to or faster than the unloaded detachment rate.”

      (13) Isn't the model over-defined by fitting the values for the load-dependence of the strong-to-weak transition and fitting the load dependence into the transition to the slip state?

      Essentially, yes, it is overdefined, but that is essentially by design and it is still very useful. Our goal here was to make as simple a model as possible that could account for the data and use it to compare model parameters for the different motor families. Ignoring the complexity of the slip and detached states, a model with a strong and weak state in the stepping cycle and a single transition out of the stepping cycle is the simplest formulation possible. And having rate constants (k<sub>S-W</sub> and k<sub>slip</sub> in our case) that vary exponentially with load makes thermodynamic sense for modeling mechanochemistry (Howard, 2001). Thus, we were pleasantly surprised that this bare-bones model could recapitulate the unloaded and stall durations for all three motors (Fig. 5C-E).

      (14) "When kinesin-1 was tethered to a glass coverslip via a DNA linker and hydrodynamic forces were imposed on an associated microtubule, kinesin-1 dissociation rates were relatively insensitive to loads up to ~3 pN, inconsistent with slip-bond characteristics (37)." This statement appears not to be true. In reference 37, very similar to the geometry reported here, the microtubules were fixed on the surface, and the stepping of single kinesin motors attached to large beads (to which defined forces were applied by hydrodynamics) via long DNA linkers was studied. In fact, quite a number of statements made in the present manuscript have been made already in ref. 37 (see in particular sections 2.6 and 2.7), and the authors may consider putting their results better into this context in the Introduction and Discussion. It is also noteworthy to discuss that the (admittedly limited) data in ref. 37 does not indicate a "catch-bond" behavior but rather an insensitivity to force over a defined range of forces.

      The reviewer misquoted our sentence. The actual wording of the sentence was: “When kinesin-1 was connected to micron-scale beads through a DNA linker and hydrodynamic forces parallel to the microtubule imposed, dissociation rates were relatively insensitive to loads up to ~3 pN, inconsistent with slip-bond characteristics (Urbanska et al., 2021).” The sentence the reviewer quoted was in a previous version that is available on BioRxiv and perhaps they were reading that version. Nonetheless, in the revision we will note in the Discussion that this behavior was indicative of an ideal bond (not a catch-bond), and we will also add a sentence in the Introduction highlighting this work.

      Reviewer #3 (Public review):

      The authors attribute the differences in the behaviour of kinesins when pulling against a DNA tether compared to an optical trap to the differences in the perpendicular forces. However, the compliance is also much different in these two experiments. The optical trap acts like a ~ linear spring with stiffness ~ 0.05 pN/nm. The dsDNA tether is an entropic spring, with negligible stiffness at low extensions and very high compliance once the tether is extended to its contour length (Fig. 1B). The effect of the compliance on the results should be addressed in the manuscript.

      This is an interesting point. To address it, we calculated the predicted stiffness of the dsDNA by taking the slope of theoretical force-extension curve in Fig. 1B. Below 650 nm extension, the stiffness is <0.001 pN/nM; it reaches 0.01 pN/nM at 855 nm, and at 960 nm where the force is 6 pN the stiffness is roughly 0.2 pN/nm. That value is higher than the quoted 0.05 pN/nm trap stiffness, but for reference, at this stiffness, an 8 nm step leads to a 1.6 pN jump in force, which is reasonable. Importantly, the stiffness of kinesin motors has been estimated to be in the range of 0.3 pN (Coppin et al., 1996; Coppin et al., 1997). Granted, this stiffness is also nonlinear, but what this means is that even at stall, our dsDNA tether has a similar predicted compliance to the motor that is pulling on it. We will address this point in our revision.  

      Compared to an optical trapping assay, the motors are also tethered closer to the microtubule in this geometry. In an optical trap assay, the bead could rotate when the kinesin is not bound. The authors should discuss how this tethering is expected to affect the kinesin reattachment and slipping. While likely outside the scope of this study, it would be interesting to compare the static tether used here with a dynamic tether like MAP7 or the CAP-GLY domain of p150glued.

      Please see our response to Reviewer #2 Major Comment #4 above, which asks this same question in the context of intracellular cargo. We plan to address this in our revision. Regarding a dynamic tether, we agree that’s interesting – there are kinesins that have a second, non-canonical binding site that achieves this tethering (ncd and Cin8); p150glued likely does this naturally for dynein-dynactin-activator complexes; and we speculated in a review some years ago (Hancock, 2014) that during bidirectional transport kinesin and dynein may act as dynamic tethers for one another when not engaged, enhancing the activity of the opposing motor.

      In the single-molecule extension traces (Figure 1F-H; S3), the kinesin-2 traces often show jumps in position at the beginning of runs (e.g., the four runs from ~4-13 s in Fig. 1G). These jumps are not apparent in the kinesin-1 and -3 traces. What is the explanation? Is kinesin-2 binding accelerated by resisting loads more strongly than kinesin-1 and -3?

      Due to the compliance of the dsDNA, the 95% limits for the initial attachment position are +/- 290 nm (Fig. S2). Thus, some apparent ‘jumps’ from the detached state are expected. We will take a closer look at why there are jumps for kinesin-2 that aren’t apparent for kinesin-1 or -3.

      When comparing the durations of unloaded and stall events (Fig. 2), there is a potential for bias in the measurement, where very long unloaded runs cannot be observed due to the limited length of the microtubule (Thompson, Hoeprich, and Berger, 2013), while the duration of tethered runs is only limited by photobleaching. Was the possible censoring of the results addressed in the analysis?

      Yes. Please see response to Reviewer #2 points (8) and (9) above.

      The mathematical model is helpful in interpreting the data. To assess how the "slip" state contributes to the association kinetics, it would be helpful to compare the proposed model with a similar model with no slip state. Could the slips be explained by fast reattachments from the detached state?

      In the model, the slip state and the detached states are conceptually similar; they only differ in the sequence (slip to detached) and the transition rates into and out of them. The simple answer is: yes, the slips could be explained by fast reattachments from the detached state. In that case, the slip state and recovery could be called a “detached state with fast reattachment kinetics”. However, the key data for defining the kinetics of the slip and detached states is the distribution of Recovery times shown in Fig. 4D-F, which required a triple exponential to account for all of the data. If we simplified the model by eliminating the slip state and incorporating fast reattachment from a single detached state, then the distribution of Recovery times would be a single-exponential with a time constant equivalent to t<sub>1</sub>, which would be a poor fit to the experimental distributions in Fig. 4D-F.

      We appreciate the efforts and helpful suggestions of all three reviewers and the Editor.

      References:

      Block, S.M., L.S. Goldstein, and B.J. Schnapp. 1990. Bead movement by single kinesin molecules studied with optical tweezers. Nature. 348:348-352.

      Bouchiat, C., M.D. Wang, J. Allemand, T. Strick, S.M. Block, and V. Croquette. 1999. Estimating the persistence length of a worm-like chain molecule from force-extension measurements. Biophys J. 76:409-413.

      Coppin, C.M., J.T. Finer, J.A. Spudich, and R.D. Vale. 1996. Detection of sub-8-nm movements of kinesin by high-resolution optical-trap microscopy. Proc Natl Acad Sci U S A. 93:1913-1917.

      Coppin, C.M., D.W. Pierce, L. Hsu, and R.D. Vale. 1997. The load dependence of kinesin's mechanical cycle. Proc Natl Acad Sci U S A. 94:8539-8544.

      Ezber, Y., V. Belyy, S. Can, and A. Yildiz. 2020. Dynein Harnesses Active Fluctuations of Microtubules for Faster Movement. Nat Phys. 16:312-316.

      Hancock, W.O. 2014. Bidirectional cargo transport: moving beyond tug of war. Nat Rev Mol Cell Biol. 15:615-628.

      Howard, J. 2001. Mechanics of Motor Proteins and the Cytoskeleton. Sinauer Associates, Inc., Sunderland, MA. 367 pp.

      Kunwar, A., S.K. Tripathy, J. Xu, M.K. Mattson, P. Anand, R. Sigua, M. Vershinin, R.J. McKenney, C.C. Yu, A. Mogilner, and S.P. Gross. 2011. Mechanical stochastic tug-of-war models cannot explain bidirectional lipid-droplet transport. Proc Natl Acad Sci U S A. 108:18960-18965.

      Kuo, Y.W., M. Mahamdeh, Y. Tuna, and J. Howard. 2022. The force required to remove tubulin from the microtubule lattice by pulling on its alpha-tubulin C-terminal tail. Nature communications. 13:3651.

      Laakso, J.M., J.H. Lewis, H. Shuman, and E.M. Ostap. 2008. Myosin I can act as a molecular force sensor. Science. 321:133-136.

      Leidel, C., R.A. Longoria, F.M. Gutierrez, and G.T. Shubeita. 2012. Measuring molecular motor forces in vivo: implications for tug-of-war models of bidirectional transport. Biophys J. 103:492-500.

      Marko, J.F., and E.D. Siggia. 1995. Stretching DNA. Macromolecules. 28:8759-8770.

      Nicholas, M.P., F. Berger, L. Rao, S. Brenner, C. Cho, and A. Gennerich. 2015. Cytoplasmic dynein regulates its attachment to microtubules via nucleotide state-switched mechanosensing at multiple AAA domains. Proc Natl Acad Sci U S A. 112:6371-6376.

      Purcell, E.M. 1977. Life at low Reynolds Number. Amer J. Phys. 45:3-11.

      Pyrpassopoulos, S., H. Shuman, and E.M. Ostap. 2020. Modulation of Kinesin's Load-Bearing Capacity by Force Geometry and the Microtubule Track. Biophys J. 118:243-253.

      Rai, A.K., A. Rai, A.J. Ramaiya, R. Jha, and R. Mallik. 2013. Molecular adaptations allow dynein to generate large collective forces inside cells. Cell. 152:172-182.

      Ramaiya, A., B. Roy, M. Bugiel, and E. Schaffer. 2017. Kinesin rotates unidirectionally and generates torque while walking on microtubules. Proc Natl Acad Sci U S A. 114:10894-10899.

      Rao, L., F. Berger, M.P. Nicholas, and A. Gennerich. 2019. Molecular mechanism of cytoplasmic dynein tension sensing. Nature communications. 10:3332.

      Smith, S.B., L. Finzi, and C. Bustamante. 1992. Direct mechanical measurements of the elasticity of single DNA molecules by using magnetic beads. Science. 258:1122-1126.

      Sudhakar, S., M.K. Abdosamadi, T.J. Jachowski, M. Bugiel, A. Jannasch, and E. Schaffer. 2021. Germanium nanospheres for ultraresolution picotensiometry of kinesin motors. Science. 371.

      Toleikis, A., N.J. Carter, and R.A. Cross. 2020. Backstepping Mechanism of Kinesin-1. Biophys J. 119:1984-1994.

      Urbanska, M., A. Ludecke, W.J. Walter, A.M. van Oijen, K.E. Duderstadt, and S. Diez. 2021. Highly-Parallel Microfluidics-Based Force Spectroscopy on Single Cytoskeletal Motors. Small. 17:e2007388.

      Wang, M.D., H. Yin, R. Landick, J. Gelles, and S.M. Block. 1997. Stretching DNA with optical tweezers. Biophys J. 72:1335-1346.

    1. Author response:

      Reviewer #1 (Public review):

      Fombellida-Lopez and colleagues describe the results of an ART intensification trial in people with HIV infection (PWH) on suppressive ART to determine the effect of increasing the dose of one ART drug, dolutegravir, on viral reservoirs, immune activation, exhaustion, and circulating inflammatory markers. The authors hypothesize that ART intensification will provide clues about the degree to which low-level viral replication is occurring in circulation and in tissues despite ongoing ART, which could be identified if reservoirs decrease and/or if immune biomarkers change. The trial design is straightforward and well-described, and the intervention appears to have been well tolerated. The investigators observed an increase in dolutegravir concentrations in circulation, and to a lesser degree in tissues, in the intervention group, indicating that the intervention has functioned as expected (ART has been intensified in vivo). Several outcome measures changed during the trial period in the intervention group, leading the investigators to conclude that their results provide strong evidence of ongoing replication on standard ART. The results of this small trial are intriguing, and a few observations in particular are hypothesis-generating and potentially justify further clinical trials to explore them in depth. However, I am concerned about over-interpretation of results that do not fully justify the authors' conclusions.

      We thank Reviewer #1 for their thoughtful and constructive comments, which helped us clarify and improve the manuscript. Below, we address each of the reviewer’s points and describe the changes that we implemented in the revised version. We acknowledge the reviewer’s concern regarding potential overinterpretation of certain findings, and in the revised version we took particular care to ensure that all conclusions are supported by the data and framed within the exploratory nature of the study.

      (1) Trial objectives: What was the primary objective of the trial? This is not clearly stated. The authors describe changes in some reservoir parameters and no changes in others. Which of these was the primary outcome? No a priori hypothesis / primary objective is stated, nor is there explicit justification (power calculations, prior in vivo evidence) for the small n, unblinded design, and lack of placebo control. In the abstract (line 36, "significant decreases in total HIV DNA") and conclusion (lines 244-246), the authors state that total proviral DNA decreased as a result of ART intensification. However, in Figures 2A and 2E (and in line 251), the authors indicate that total proviral DNA did not change. These statements are confusing and appear to be contradictory. Regarding the decrease in total proviral DNA, I believe the authors may mean that they observed transient decrease in total proviral DNA during the intensification period (day 28 in particular, Figure 2A), however this level increases at Day 56 and then returns to baseline at Day 84, which is the source of the negative observation. Stating that total proviral DNA decreased as a result of the intervention when it ultimately did not is misleading, unless the investigators intended the day 28 timepoint as a primary endpoint for reservoir reduction - if so, this is never stated, and it is unclear why the intervention would then be continued until day 84? If, instead, reservoir reduction at the end of the intervention was the primary endpoint (again, unstated by the authors), then it is not appropriate to state that the total proviral reservoir decreased significantly when it did not.

      We agree with the reviewer that the primary objective of the study was not explicitly stated in the submitted manuscript. We clarified this in the revised manuscript (lines 361-364). As registered on ClinicalTrials.gov (NCT05351684), the primary outcome was defined as “To evaluate the impact of treatment intensification at the level of total and replication-competent reservoir (RCR) in blood and in tissues”, with a time frame of 3 months. Accordingly, our aim was to explore whether any measurable reduction in the HIV reservoir (total or replication-competent) occurred during the intensification period, including at day 28, 56, or 84. The protocol did not prespecify a single time point for this effect to occur, and the exploratory design allowed for detection of transient or sustained changes within the intensification window.

      We recognize that this scope was not clearly articulated in the original text and may have led to confusion in interpreting the transient drop in total HIV DNA observed at day 28. While total DNA ultimately returned to baseline by the end of intensification, the presence of a transient reduction during this 3-month window still fits within the framework of the study’s registered objective. Moreover, although the change in total HIV DNA was transient, it aligns with the consistent direction of changes observed across the multiple independent measures, including CA HIV RNA, RNA/DNA ratio and intact HIV DNA, collectively supporting a biological effect of intensification.

      We would also like to stress that this is the first clinical trial ever, in which an ART intensification is performed not by adding an extra drug but by increasing the dosage of an existing drug. Therefore, we were more interested in the overall, cumulative, effect of intensification throughout the entire trial period, than in differences between groups at individual time points. We clarified in the revised manuscript that this was a proof-of-concept phase 2 study, designed to reveal biological effects of ART intensification rather than confirm efficacy in a powered comparison. The absence of a prespecified statistical endpoint or sample size calculation reflects the exploratory nature of the trial.

      (2) Intervention safety and tolerability: The results section lacks a specific heading for participant safety and tolerability of the intervention. I was wondering about clinically detectable viremia in the study. Were there any viral blips? Was the increased DTG well tolerated? This drug is known to cause myositis, headache, CPK elevation, hepatotoxicity, and headache. Were any of these observed? What is the authors' interpretation of the CD4:8 ratio change (line 198)? Is this a significant safety concern for a longer duration of intensification? Was there also a change in CD4% or only in absolute counts? Was there relative CD4 depletion observed in the rectal biopsy samples between days 0 and 84? Interestingly, T cells dropped at the same timepoints that reservoirs declined... how do the authors rule out that reservoir decline reflects transient T cell decline that is non-specific (not due to additional blockade of replication)?

      We improved the Methods section to clarify how safety and tolerability were assessed during the study (lines 389-396). Safety evaluations were conducted on day 28 and day 84 and included a clinical examination and routine laboratory testing (liver function tests, kidney function, and complete blood count). Medication adherence was also monitored through pill counts performed by the study nurses.

      No virological blips above 50 copies/mL were observed and no adverse events were reported by participants during the 3-month intensification period. Although CPK levels were not included in the routine biological monitoring, no participant reported muscle pain or other symptoms suggestive of muscle toxicity.

      The CD4:CD8 ratio decrease noted during intensification was not associated with significant changes in absolute CD4 or CD8 counts, as shown in Figure 5. We interpret this ratio change as a transient redistribution rather than an immunological risk, therefore we do not consider it to represent a safety concern.

      We would like to clarify that CD4⁺ T-cell counts did not significantly decrease in any of the treatment groups, as shown in Figure 5. The apparent decline observed concerns the CD4/CD8 ratio, which transiently dropped, but not the absolute number of CD4⁺ T cells. Moreover, although the dynamics of total HIV DNA is indeed similar to that of CD4/CD8 ratio (both declined transiently and then returned to baseline by day 84), the dynamics of unspliced RNA and unspliced RNA/total DNA ratio are clearly different, as these markers demonstrated a sustained decrease that was maintained throughout the trial period, even when the CD4/CD8 ratio already returned to baseline. Also, we observed a significant decrease in intact HIV DNA at day 84 compared to day 0. These effects cannot be easily explained by a transient decline in CD4+ cells.

      (3) The investigators describe a decrease in intact proviral DNA after 84 days of ART intensification in circulating cells (Figure 2D), but no changes to total proviral DNA in blood or tissue (Figures 2A and 2E; IPDA does not appear to have been done on tissue samples). It is not clear why ART intensification would result in a selective decrease in intact proviruses and not in total proviruses if the source of these reservoir cells is due to ongoing replication. These reservoir results have multiple interpretations, including (but not limited to) the investigators' contention that this provides strong evidence of ongoing replication. However, ongoing replication results in the production of both intact and mutated/defective proviruses that both contribute to reservoir size (with defective proviruses vastly outnumbering intact proviruses). The small sample size and well-described heterogeneity of the HIV reservoir (with regard to overall size and composition) raise the possibility that the study was underpowered to detect differences over the 84-day intervention period. No power calculations or prior studies were described to justify the trial size or the duration of the intervention. Readers would benefit from a more nuanced discussion of reservoir changes observed here.

      We sincerely thank the reviewer for this insightful comment. We fully agree that the reservoir dynamics observed in our study might raise several possible interpretations, and that its complexity, resulting from continuous cycles of expansion and contraction, reflects the heterogeneity of the latent reservoir. 

      Total HIV DNA in PBMCs showed a transient decline during intensification (notably at day 28), ultimately returning to baseline by day 84. This biphasic pattern likely reflects the combined effects of suppression of ongoing low-level replication by an increased DTG dosage, followed by the expansion of infected cell clones (mostly harbouring defective proviruses). In other words, the transient decrease in total (intact + defective) DNA at day 28 may be due to an initial decrease in newly infected cells upon ART intensification, however at the subsequent time points this effect was masked by proliferation (clonal expansion) of infected cells with defective proviruses. Recent studies suggest that intact and defective proviruses are subjected to different selection pressures by the immune system on ART (PMID: 38337034) and their decay on therapy is different (intact proviruses are cleared much more rapidly than defectives). In addition, defective proviruses can be preferentially expanded as they can reprogram the host cell proliferation machinery (https://doi.org/10.1101/2025.09.22.676989). This explains why in our study the intact proviruses decreased, but the total proviruses did not change, between days 0 and 84, in the intensification group. Interestingly, in the control group, we observed a significant increase in total DNA at day 84 compared to day 0, with no difference for the intact DNA, which is also in line with the clonal expansion of defective proviruses.

      Importantly, we observed a significant decrease in intact proviral DNA between day 0 and day 84 in the intensification group (Figure 2D). This result directly addresses the study’s primary objective: assessing the impact of intensification on the replication-competent reservoir. In comparison, as the reviewer rightly points out, total HIV DNA includes over 90% defective genomes, which limits its interpretability as a biomarker of biologically relevant reservoir changes. In addition, other reservoir markers, such as cell-associated unspliced RNA and RNA/DNA ratios, also showed consistent trends supporting a biologically relevant effect of intensification. Even in the absence of sustained changes in total HIV DNA, the coherence across the different independent measures of the reservoir (intact DNA, unspliced RNA), suggests an effect indicative of ongoing replication pre-intensification.

      Regarding tissue reservoirs, the lack of substantial change in total HIV DNA between days 0 and 84 is also in line with the predominance of defective sequences in these compartments. Moreover, the limited increase in rectal tissue dolutegravir levels during intensification (from 16.7% to 20% of plasma concentrations) may have limited the efficacy of the intervention in this site.

      As for the IPDA on rectal biopsies, we attempted the assay using two independent DNA extraction methods (Promega Reliaprep and Qiagen Puregene), but both yielded high DNA shearing index values, and intact proviral detection was successful in only 3 of 40 samples. Given the poor DNA integrity, these results were not interpretable.

      That said, we fully acknowledge the limitations of our study, especially the small sample size, and we agree with the reviewer that caution is needed when interpreting these findings. In the revised manuscript, we adopted a more measured tone in the discussion (lines 340-346), stating that these observations are exploratory and hypothesis-generating, and require confirmation in larger, more powered studies. Nonetheless, we believe that the convergence of multiple reservoir markers pointing in the same direction constitutes a meaningful biological effect that deserves further investigation.

      (4) While a few statistically significant changes occurred in immune activation markers, it is not clear that these are biologically significant. Lines 175-186 and Figure 3: The change in CD4 cells + for TIGIT looks as though it declined by only 1-2%, and at day 84, the confidence interval appears to widen significantly at this timepoint, spanning an interquartile range of 4%. The only other immune activation/exhaustion marker change that reached statistical significance appears to be CD8 cells + for CD38 and HLA-DR, however, the decline appears to be a fraction of a percent, with the control group trending in the same direction. Despite marginal statistical significance, it is not clear there is any biological significance to these findings; Figure S6 supports the contention that there is no significant change in these parameters over time or between groups. With most markers showing no change and these two showing very small changes (and the latter moving in the same direction as the control group), these results do not justify the statement that intensifying DTG decreases immune activation and exhaustion (lines 38-40 in the abstract and elsewhere).

      We agree with the reviewer that the observed changes in immune activation and exhaustion markers were modest. We revised the abstract and the manuscript text (including a section header) to reflect this more accurately (lines 39, 175, 185, 253). We noted that these differences, while statistically significant (e.g., in TIGIT+ CD4+ T cells and CD38+HLA-DR+ CD8+ T cells), were limited in magnitude. We explicitly acknowledged these limitations and interpreted the findings with appropriate caution.

      (5) There are several limitations of the study design that deserve consideration beyond those discussed at line 327. The study was open-label and not placebo-controlled, which may have led to some medication adherence changes that confound results (authors describe one observation that may be evidence of this; lines 146-148). Randomized/blinded / cross-over design would be more robust and help determine signal from noise, given relatively small changes observed in the intervention arm.There does not seem to be a measurement of key outcome variables after treatment intensification ceased - evidence of an effect on replication through ART intensification would be enhanced by observing changes once intensification was stopped. Why was intensification maintained for 84 days? More information about the study duration would be helpful. Table 1 indicates that participants were 95% male. Sex is known to be a biological variable, particularly with regard to HIV reservoir size and chronic immune activation in PWH. Worldwide, 50% of PWH are women. Research into improving management/understanding of disease should reflect this, and equal participation should be sought in trials. Table 1 shows differing baseline reservoir sizes between the control and intervention groups. This may have important implications, particularly for outcomes where reservoir size is used as the denominator.

      We expanded the limitations section to address several key aspects raised by the reviewer: the absence of blinding and placebo control, the predominantly male study population, and the lack of postintervention follow-up. While we acknowledge that open-label designs can introduce behavioural biases, including potential changes in adherence, we now explicitly state that placebo-controlled, blinded trials would provide a more robust assessment and are warranted in future research (lines 340346). 

      The 84-day duration of intensification was chosen based on previous studies and provided sufficient time for observing potential changes in viral transcription and reservoir dynamics. However, we agree that including post-intervention follow-up would have strengthened the conclusions, and we highlighted this limitation and future direction in the revised manuscript (lines 340-346). 

      The sex imbalance is now clearly acknowledged as a limitation in the revised manuscript, and we fully support ongoing efforts to promote equitable recruitment in HIV research. We would like to add that, in our study, rectal biopsies were coupled with anal cancer screening through HPV testing. This screening is specifically recommended for younger men who have sex with men (MSM), as outlined in the current EACS guidelines (see: https://eacs.sanfordguide.com/eacs part2/cancer/cancerscreening-methods). As a result, MSM participants had both a clinical incentive and medical interest to undergo this procedure, which likely contributed to the higher proportion of male participants in the study.

      Lastly, although baseline total HIV DNA was higher in the intensified group, our statistical approach is based on a within-subject (repeated-measures) design, in which the longitudinal change of a parameter within the same participant during the study was the main outcome. In other words, we are not comparing absolute values of any marker between the groups, we are looking at changes of parameters from baseline within participants, and these are not expected to be affected by baseline imbalances.

      (6) Figure 1: the increase in DTG levels is interesting - it is not uniform across participants. Several participants had lower levels of DTG at the end of the intervention. Though unlikely to be statistically significant, it would be interesting to evaluate if there is a correlation between change in DTG concentrations and virologic / reservoir / inflammatory parameters. A positive relationship between increasing DTG concentration and decreased cell-associated RNA, for example, would help support the hypothesis that ongoing replication is occurring.

      We agree with the reviewer that assessing correlations between DTG concentrations and virological, immunological, or inflammatory markers would be highly informative. In fact, we initially explored this question in a preliminary way by examining whether individuals who showed a marked increase in DTG levels after intensification also demonstrated stronger changes in the viral reservoir. While this exploratory analysis did not reveal any clear associations, we would like to emphasize that correlating biological effects with DTG concentrations measured at a single timepoint may have limited interpretability. A more comprehensive understanding of the relationship between drug exposure and reservoir dynamics would ideally require multiple pharmacokinetic measurements over time, including pre-intensification baselines. This is particularly important given that DTG concentrations vary across individuals and over time, depending on adherence, metabolism, and other individual factors.

      (7) Figure 2: IPDA in tissue- was this done? scRNA in blood (single copy assay) - would this be expected to correlate with usCaRNA? The most unambiguous result is the decrease in cell-associated RNA - accompanying results using single-copy assay in plasma would be helpful to bolster this result.

      As mentioned in our response to point 3, we attempted IPDA on tissue samples, but technical limitations prevented reliable detection of intact proviruses. Regarding residual viremia, we did perform ultra-sensitive plasma HIV RNA quantification but due to a technical issue (an inadvertent PBMC contamination during plasma separation) that affected the reliability of the results we felt uncomfortable including these data in the manuscript.

      The use of the US RNA / Total DNA ratio is not helpful/difficult to interpret since the control and intervention arms were unmatched for total DNA reservoir size at study entry.

      We respectfully disagree with this comment. The US RNA/total DNA ratio is commonly used to assess the relative transcriptional activity of the viral reservoir, rather than its absolute size. While we acknowledge that the total HIV-1 DNA levels differed at baseline between the two groups, the US RNA/total DNA ratio specifically reflects the relationship between transcriptional activity and reservoir size within each individual, and is therefore not directly confounded by baseline differences in total DNA alone.

      Moreover, our analyses focus on within-subject longitudinal changes from baseline, not on direct between-group comparisons of absolute marker values. As such, the observed changes in the US RNA/total DNA ratio over time are interpreted relative to each participant's baseline, mitigating concerns related to baseline imbalances between groups.

      Reviewer #2 (Public review):

      Summary:

      An intensification study with a double dose of 2nd generation integrase inhibitor with a background of nucleoside analog inhibitors of the HIV retrotranscriptase in 2, and inflammation is associated with the development of co-morbidities in 20 individuals randomized with controls, with an impact on the levels of viral reservoirs and inflammation markers. Viral reservoirs in HIV are the main impediment to an HIV cure, and inflammation is associated with co-morbidities.

      Strengths:

      The intervention that leads to a decrease of viral reservoirs and inflammation is quite straightforward forward as a doubling of the INSTI is used in some individuals with INSTI resistance, with good tolerability.

      This is a very well documented study, both in blood and tissues, which is a great achievement due to the difficulty of body sampling in well-controlled individuals on antiretroviral therapy. The laboratory assays are performed by specialists in the field with state-of-the art quantification assays. Both the introduction and the discussion are remarkably well presented and documented.

      The findings also have a potential impact on the management of chronic HIV infection.

      Weaknesses:

      I do not think that the size of the study can be considered a weakness, nor the fact that it is open-label either.

      We thank Reviewer #2 for their constructive and supportive comments. We appreciate their positive assessment of the study design, the translational relevance of the intervention, and the technical quality of the assays. We also take note of their perspective regarding sample size and study design, which supports our positioning of this trial as an exploratory, hypothesis-generating phase 2 study.

      Reviewer #3 (Public review):

      The introduction does a very good job of discussing the issue around whether there is ongoing replication in people with HIV on antiretroviral therapy. Sporadic, non-sustained replication likely occurs in many PWH on ART related to adherence, drug-drug interactions and possibly penetration of antivirals into sanctuary areas of replication and as the authors point out proving it does not occur is likely not possible and proving it does occur is likely very dependent on the population studied and the design of the intervention. Whether the consequences of this replication in the absence of evolution toward resistance have clinical significance challenging question to address.

      It is important to note that INSTI-based therapy may have a different impact on HIV replication events that results in differences in virus release for specific cell type (those responsible for "second phase" decay) by blocking integration in cells that have completed reverse transcription prior to ART initiation but have yet to be fully activated. In a PI or NNRTI-based regimen, those cells will release virus, whereas with an INSTI-based regimen, they will not.

      Given the very small sample size, there is a substantial risk of imbalance between the groups in important baseline measures. Unfortunately, with the small sample size, a non-significant P value is not helpful when comparing baseline measures between groups. One suggestion would be to provide the full range as opposed to the inter-quartile range (essentially only 5 or 6 values). The authors could also report the proportion of participants with baseline HIV RNA target not detected in the two groups.

      We thank Reviewer #3 for their thoughtful and balanced review. We are grateful for the recognition of the strength of the Introduction, the complexity of evaluating residual replication, and the technical execution of the assays. We also appreciate the insightful suggestions for improving the clarity and transparency of our results and discussion.

      We revised the manuscript to address several of the reviewer’s key concerns. We agree that the small sample size increases the risk of baseline imbalances. We acknowledged these limitations in the manuscript (lines 327-330). For transparency, we now provide both the full range and the IQR for all parameters in Table 1. However, we would like to stress that our statistical approach is based on a within-subject (repeated-measures) design, in which the longitudinal change of a parameter within the same participant during the study was the main outcome. In other words, we are not comparing absolute values of any marker between the groups, we are looking at changes of parameters from baseline within participants, and these are not expected to be affected by baseline imbalances.

      A suggestion that there is a critical imbalance between groups is that the control group has significantly lower total HIV DNA in PBMC, despite the small sample size. The control group also has numerically longer time of continuous suppression, lower unspliced RNA, and lower intact proviral DNA. These differences may have biased the ability to see changes in DNA and US RNA in the control group.

      We acknowledge the significant baseline difference in total HIV DNA between groups, which we have clearly reported. However, the other variables mentioned, such as duration of continuous viral suppression, unspliced RNA levels, and intact proviral DNA, did not differ significantly between groups at baseline, despite differences in the median values (that are always present). These numerical differences do not necessarily indicate a critical imbalance.

      Notably, there was no significant difference in the change in US RNA/DNA between groups (Figure 2C).

      The nonsignificant difference in the change in US RNA/total DNA between groups is not unexpected, given the significant between-group differences for both US RNA and total DNA changes. Since the ratio combines both markers, it is likely to show attenuated between-group differences compared to the individual components. However, while the difference did not reach statistical significance (p = 0.09), we still observed a trend towards a greater reduction in the US RNA/total DNA ratio in the intervention group.

      The fact that the median relative change appears very similar in Figure 2C, yet there is a substantial difference in P values, is also a comment on the limits of the current sample size. 

      Although we surely agree that in general, the limited sample size impacts statistical power, we would like to point out that in Figure 2C, while the medians may appear similar, the ranges do differ between groups. At days 56 and 84, the median fold changes from baseline are indeed close but the full interquartile range in the DTG group stays below 1, while in the control group, the interquartile range is wider and covers approximately equal distance above and below 1. This explains the difference in p values between the groups.

      The text should report the median change in US RNA and US RNA/DNA when describing Figures 2A-2C.

      These data are already reported in the Results section (lines 164–166): "By day 84, US RNA and US RNA/total DNA ratio had decreased from day 0 by medians (IQRs) of 5.1 (3.3–6.4) and 4.6 (3.1–5.3) fold, respectively (p = 0.016 for both markers)."

      This statistical comparison of changes in IPDA results between groups should be reported. The presentation of the absolute values of all the comparisons in the supplemental figures is a strength of the manuscript.

      In the assessment of ART intensification on immune activation and exhaustion, the fact that none of the comparisons between randomized groups were significant should be noted and discussed.

      We would like to point out that a statistically significant difference between the randomized groups was observed for the frequency of CD4⁺ T cells expressing TIGIT, as shown in Figure 3A and reported in the Results section (p = 0.048).

      The changes in CD4:CD8 ratio and sCD14 levels appear counterintuitive to the hypothesis and are commented on in the discussion.

      Overall, the discussion highlights the significant changes in the intensified group, which are suggestive. There is limited discussion of the comparisons between groups where the results are less convincing.

      We observed statistically significant differences between the randomized groups for total DNA (p<0.001) and US RNA (p=0.01), as well as for the frequency of CD4⁺ T cells expressing TIGIT (p=0.048). We would like to stress that US RNA is a key marker of residual replication as it is very sensitive to de novo infection events. As discussed in the manuscript (lines 291-294), a newly infected CD4+ T lymphocyte can contain hundreds to thousands of US HIV RNA copies at the peak of infection. Therefore, a change in the US RNA level upon ART intensification is a very sensitive indicator of new infections. The fact that for US RNA we observed both a significant reduction in the intensified group and a significant difference between the groups is a strong indicator that some new infections had been occurring prior to intensification.

      The limitations of the study should be more clearly discussed. The small sample size raises the possibility of imbalance at baseline. The supplemental figures (S3-S5) are helpful in showing the differences between groups at baseline, and the variability of measurements is more apparent. The lack of blinding is also a weakness, though the PK assessments do help (note 3TC levels rise substantially in both groups for most of the time on study (Figure S2).

      The many assays and comparisons are listed as a strength. The many comparisons raise the possibility of finding significance by chance. In addition, if there is an imbalance at baseline outcomes, measuring related parameters will move in the same direction.

      We agree that the multiple comparisons raise the possibility of chance findings but would like to stress that in an exploratory study like this it is very important to avoid a type II error. In addition, the consistent directionality of the most relevant outcomes (US RNA and intact DNA) lends biological plausibility to the observed effects.

      The limited impact on activation and inflammation should be addressed in the discussion, as they are highlighted as a potentially important consequence of intermittent, not sustained replication in the introduction.

      The study is provocative and well executed, with the limitations listed above. Pharmacokinetic analyses help mitigate the lack of blinding. The major impact of this work is if it leads to a much larger randomized, controlled, blinded study of a longer duration, as the authors point out.

      Finally, we fully endorse the reviewer’s suggestion that the primary contribution of this study lies in its value as a proof-of-concept and foundation for future randomized, blinded trials of greater scale and duration. We highlighted this more clearly in the revised Discussion (lines 340-346).

      Reviewer #1 (Recommendations for the authors):

      (1) Lines 84-87: How would chronic immune activation/inflammation be expected to differ if viral antigen is being released from stable reservoirs rather than low-level replication?

      This is a very insightful question. Although release of viral antigens from stable reservoirs could certainly also trigger immune activation/inflammation, the reservoir cells in PWH on long-term ART are constantly being negatively selected by the immune system (PMID: 38337034; PMID: 36596305) so that after a number of years on therapy, most proviruses are either transcriptionally silent or express only a low amount of viral RNA/antigen. Recent evidence suggests that these selected cells possess specific biological properties that include mechanisms that limit proviral gene expression (PMID: 36599977; PMID: 36599978). In comparison, low-level replication would result in de novo infection of unselected, activated CD4+ cells that are expected to produce much more viral antigen than preselected reservoir cells.

      (2) Lines 249-253: There are multiple ways to explain this observation - alternatively, the total proviral DNA declined due to transient CD4 depletion.

      As discussed above, CD4⁺ T-cell counts did not significantly decrease in any of the treatment groups, as shown in Figure 5. The apparent decline observed concerns the CD4/CD8 ratio, which transiently dropped, but not the absolute number of CD4⁺ T cells. Moreover, although the dynamics of total HIV DNA is indeed similar to that of CD4/CD8 ratio (both declined transiently and then returned to baseline by day 84), the dynamics of unspliced RNA and unspliced RNA/total DNA ratio is clearly different, as these markers demonstrated a sustained decrease that was maintained throughout the trial period. Also, we observed a significant decrease in intact HIV DNA at day 84 compared to day 0. These effects cannot be easily explained by a transient decline in CD4+ cells.

      (3) Lines 301-305: This is a confusing explanation for not seeing an effect in tissue. Overall, there was no change in total proviral DNA in blood between days 0 and 84 either - yet the explanation for this observation is different (249-253). Was IPDA not performed on the tissue? Wouldn't this be the preferred test for reservoir depletion?

      We thank the reviewer for bringing this point to our attention. We modified the Discussion to prevent the confusion (lines 303-305). As for the IPDA on tissue, we attempted this assay on the tissue samples using two independent DNA extraction methods (Promega Reliaprep and Qiagen Puregene), but both yielded high DNA shearing index values, and intact proviral detection was successful in only 3 of 40 samples. Given the poor DNA integrity, these results were not interpretable.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Weaknesses:

      Only 1 gene (katG) gave a strong and 1 (Mab_1456c) exhibited a minor defect. Two of the clones did not show any persistence phenotype (blaR and recR) and one (pafA) showed a minor phenotype,

      We have now carried out more detailed validation studies on the Tn-Seq, with analysis of timedependent killing over 14 d. This more comprehensive analysis shows that 4 of 5 genes analyzed do indeed have antibiotic tolerance defects under the conditions that Tn-Seq predicted a survival defect (Revised Figure 3). In addition, we found that even before actual cell death, several mutants had delayed resumption of growth after antibiotic removal (Figure 3 Supplemental).

      Fig 3 - Why is there such a huge difference in the extent of killing of the control strain in media, when exposed to TIG/LZD, when compared to Fig. 1C and Fig. 4. In Fig. 1C, M. abs grown in media decreases by >1 log by Day 3 and >4 log by Day 6, whereas in Fig. 3, the bacterial load decreases by <1 log by Day 3 and <2 log by Day 6. This needs to be clarified, if the experimental conditions were different, because if comparing to Fig. 1C data then the katG mutant strain phenotype is not very different.

      We agree with the reviewer that there is variability in the timing and extent of cell death from experiment to experiment. As noted by the reviewer, in Figure 1C the largest decrement in survival is between day 1 - day 3 (also seen in Figure 6A). As they noted in Figure 4 the largest decrement is between day 3 – day 6 (also seen in Figure 3A, Figure 5F). In each experiment with katG mutants we carefully compare the mutant vs. the control strain within that experiment, which is more accurate than comparing the behavior of mutant in one experiment to a control in another experiment.

      Reviewer #2 (Public review):

      Weaknesses:

      .First, word-choice decisions could better conform to the published literature. Alternatively, novel definitions could be included. In particular, the data support the concept of phenotypic tolerance, not persistence. 

      We appreciate the reviewers comments, text modified.

      Second, two of the novel observations could be explored more extensively to provide mechanistic explanations for the phenomena. 

      We have added several additional experiments, these are detailed below in response to specific comments.

      Reviewer #3 (Public review):

      Weaknesses:

      The findings could not be validated in clinical strains.

      We understand the reviewer’s concern that the katG phenotype was only observed in one of the two clinical strains we studied. We feel that our findings are relevant beyond the ATCC 19977 strain for two reasons

      (1) We have performed additional analyses of the two clinical isolates and indeed find significant accumulation of ROS following antibiotic exposure in both of these strains (revised Figure 6A).

      (2) We do in fact see a role for katG in starvation-induced antibiotic tolerance in Mabs clinical strain-2. It is not surprising that different strains from a particular species may have some different responses to stresses – for example, there is wide strain-specific variability in susceptibility to different phages within a species based on which particular phage defense modules a given strain carries (for example PMID: 37160116). We speculate that different Mabs strains may express varying levels of other antioxidant factors and note that the genes encoding several such factors were identified by our Tn-Seq screen including the peroxidases ahpC, ahpD, and ahpE. Our analysis of the genetic interactions between katG and these other factors is ongoing. 

      Comments/Suggestions

      (1) In Fig1E, the authors show no difference in killing Mtb with or without adaptation in PBS. These data are contrary to the data presented in Figure 1B. These also do not align with the data of M. smegmatis and M. abscesses. Please discuss these observations in light of the Duncan model of persistence (Mol Microbiol. 2002 Feb;43(3):717-31.).’

      The above referenced Duncan laboratory study found tolerance after prolonged starvation but did not actually examine tolerance at early time points. While some of the transcriptional and metabolic changes seen by Duncan and others are slow, other groups have described starvation responses in Mtb that are quite rapid. For example, the stringent response mediator ppGpp accumulates within a few hours after onset of starvation in Mtb (PMID: 30906866). We suspect that a rapid signaling response such as this underlies the phenotype we observe. Regarding the difference between Mtb and other mycobacterial species we also find it surprising that Mtb had a much more rapid starvation response. This is a clear species-specific difference that may reflect an adaptation of Mtb to the nutrient-limited physiologic niche within host macrophages.

      (2) Line 151, the authors state that they have used an M. abscesses Tn mutant library of ~ 55,000 mutant strains. The manuscript will benefit from the description of the coverage of total TA sites covered by the mutants.

      Text modified to add this detail. There are 91,559 TA sites in the abscessus genome. Thus, our Tn density is ~60%.

      (3) Line 155: Please explain how long the cells were kept in an Antibiotic medium.

      This technical detail was noted above on line 153 in the original text: “…and then exposed them to TIG/LZD for 6 days”. To clarify the overall conditions, we have also revised the text of the manuscript and added the detail of how long cells were passaged after removal of antibiotics.

      (4) Line 201: data not shown. Delayed resumption of growth after removal of antibiotic would be helpful in indicating drug resilience. This data could enhance the manuscript.

      Data now provided in Figure 3 Supplemental

      (5) Figures 4C and 4F represent the kill curve. It will be good to show the date with CFU against the drug concentration in place of OD600. CFU rather than OD600 best reflects growth inhibition.

      Figures 4C and 4F are measuring the minimum inhibitory concentration (MIC) to stop the overall growth of the bacterial population. While we agree that CFU could be analyzed, this would be measuring a different outcome – cell death and the minimum bactericidal concentration (MBC). In these experiments we sought to specifically examine the MIC so as to separate growth inhibition from cell death. For this we used the standard method employed by clinical microbiology laboratories for MIC, which is optical density of the culture (PMID: 10325306).

      (6) Figure 5C. The authors shall show the effect of TIG/LZD on M. abscesses ROS production without the PBS adaptation. It is important to conclude that TIG/LZD induces ROS in cells. Authors should utilize ROS scavengers such as Thiourea, DFO, etc., to conclude ROS's contribution to bacterial killing following inhibition of transcription and translation.

      New data added (revised Figure 5 and Figure 5 Supplemental)  

      (7) Line 303. Remove "note".

      Text revised. We thank the reviewer for identifying this typographical error.  

      (8) The introduction and Discussion are very similar, and several lines are repeated.

      Text revised with overlapping content removed.

      Reviewer #1 (Recommendations for the authors):

      It appears that the same datasets for PBS adapted cultures were plotted in A-C and D-F. Either this should be specifically mentioned in the legend or it might be better to integrate the non-adapted plots into A-C which would also allow easier comparison.

      Appreciate the reviewer’s suggestion; text modified with added clarification to figure legend.

      This manuscript is focused on M. abs and the antibiotics TIG/LZD, so the Mtb data or data using the antibiotics INH/RIF/EMB and serves more as a distraction and can be removed

      We appreciate the reviewer’s perspective. However, we wish to include these data to show the similarities (and differences) in starvation-induced tolerance between the three organisms.

      Fig 3 -As mentioned for Fig. 1, it appears that the same dataset was used for the control in all the figures A-E. This should be explicitly stated in the Figure legend.

      Appreciate the reviewer’s suggestion; text modified with added clarification to figure legend.

      The divergent results from the clinical strains are extremely interesting. It would be helpful to determine the oxidative stress levels (similar to the cellROX data shown in 5E), to tease out if the difference in katG role is because of lack of ROS induction in these strains or due to expression of alternate anti-oxidative stress defense mechanisms.

      We have performed additional cellROX analysis as suggested by the reviewer and found that the ROS induction is indeed present across all three Mabs strains, but that katG is only required in one of the two strains (Strain #2). These data are now included in the revised Figure 6.

      Reviewer #2 (Recommendations for the authors):

      GENERAL COMMENTS

      This is a nice piece of work that uses the pathogen Mabs as a test subject.

      The work has findings that likely apply generally to antibiotics and mycobacteria: 1) phenotypic tolerance is associated with suppression of ROS, 2) lethal protein synthesis inhibitors act via accumulation of ROS, and 3) levofloxacin behaves in an unexpected way. Each is a new observation. However, I believe that each topic requires more work to be firmly established to be suitable for eLife.

      Phenotypic tolerance: Association with suppression of ROS is important but expected. I would solidify the conclusion by performing several additional experiments. For example, confirm the lethal effect of ROS by reducing it with an iron chelator and a radical scavenger. There is a large literature on effects of iron uptake, levels, etc. on antibiotic lethality that could be applied to this question. In 2013 Imlay argued against the validity of fluorescent probes. Perhaps getting the same results with another probe would strengthen the conclusion.

      We have carried out additional experiments with both an iron chelator and small molecule ROS scavengers to further test this idea but note that these experiments have several inherent limitations: 1) These compounds have highly pleiotropic effects. For example while N-acetyl cysteine (NAC) is an antioxidant it also increases mycobacterial respiration and was shown to paradoxically decrease antibiotic tolerance in M. tuberculosis (PMID: 28396391). 2) It has been shown by the Imlay group that small-molecule antioxidants are often ineffective in quenching ROS in bacteria (PMID: 388893820), making negative results difficult to interpret. Nonetheless, we present new experimental data showing that iron chelation does indeed improve the survival of antibiotic-treated Mabs (revised Figure 5).  However,  small molecule antioxidants such as thiourea do not restore antibiotic tolerance and actually increased bacterial cell death, suggesting that they may be affecting respiration in Mabs in a manner similar to that seen for NAC in Mtb. We also note that our genetic analysis, which identified numerous other genes encoding proteins with antioxidant function (Figure 2) is a strong additional argument in support of the importance of ROS in antibiotic-mediated lethality. 

      Regarding the concern raised by Imlay about the validity of oxidation-sensitive dyes - this relates to concern bacterial autofluorescence induced by antibiotics that can confound analyses in some species. We have ruled this out in our analyses by using bacteria unstained by cellROX as controls to confirm that there is negligible autofluorescence in Mabs (<0.1%, Figure 5E, Figure 6A).

      Protein synthesis inhibitors: At present, this is simply an observation. More work is needed to suggest a mechanism. For example, with E. coli the aminoglycosides are protein synthesis inhibitors that also cause membrane damage. Membrane damage is known to stimulate ROS-mediated killing. Your observation needs to be extended because chloramphenicol, another protein synthesis inhibitor, blocks ROS production. The lethality may be a property of mycobacteria: does it occur with E. coli (note that rifampicin is bacteriostatic with E. coli but lethal to Mtb)?

      We agree with the reviewer that the mechanism underlying ROS accumulation following transcription or translational inhibition in Mabs is of significant interest. It is likely to be a mechanism different from E. coli, because in E. coli tetracyclines and rifamycins are both bacteriostatic, whereas in Mabs they are both bactericidal. Determining the mechanism by which translation inhibitors cause ROS accumulation in Mabs is an ongoing effort in our laboratory using proteomics and metabolomics, but is outside the scope of this manuscript.

      Levofloxacin: This is also at the observational stage but is unexpected. In other studies, ROS is involved in quinolone-mediated killing of bacteria. Why is this not the case with Mabs? The observation should be solidified by showing the contrast with moxifloxacin, since this compound has been studied with mycobacteria (Shee 2022 AAC). With E. coli, quinolone structure can affect the relative contribution of ROS to killing (Malik 2007 AAC), as is also seen with Mtb (Malik 2006 AAC). What is happening in the present work with levofloxacin, an important anti-tuberculosis drug? Is there a structure explanation (compare with ofloxacin)?

      While these are interesting questions, a detailed exploration of the structure-function relationships between different fluoroquinolone antibiotics and their varying activities on Mtb and Mabs is outside the scope of this manuscript.  

      The writing is generally easy to follow. However, the concept of persistence should be changed to phenotypic tolerance with text changes throughout. I base this suggestion on the definitions of tolerance and persistence as stated in the consensus review (Balaban 2019 Nat Micro Rev). Experimentally, tolerance is seen as a gradual decline in survival following antibiotic addition; the decline is slower than seen with wild-type cells. The data presented in this paper fit that definition. In contrast, persistence refers to a rapid drop in survival followed by a distinct plateau (Balaban 2019 Nat Micro Rev; for example, see Wu Lewis AAC 2012 ). Moreover, to claim persistence, it would be necessary to demonstrate subpopulation status, which is not done. The Balaban review is an attempt to bring order to the field with respect to persistence and tolerance, since the two are commonly used without regard for a consistent definition.

      We appreciate the reviewer’s suggestion; text modified in multiple places to clarify.

      Another issue requiring clarification is the relationship between resistance and tolerance. Killing by antibiotics is a two-step process, as most clearly seen with quinolones. First a reversible bacteriostatic event occurs. Resistance blocks that bacteriostatic damage. Then a lethal metabolic response to that damage occurs. Tolerance selectively blocks the second, killing event, a distinct process that often involves the accumulation of ROS. Direct antibiotic-mediated damage is an additional mode of killing that also stems from the reversible, bacteriostatic damage created by antibiotics. The authors recognize the distinction but could make it clearer. Take a look at Zheng (JJ Collins) 2020, 2022.

      Text modified to clarify this point

      Many readers would also like to see a bit more background on Mabs. For example, does it grow rapidly? Are there features that make it a good model for studying mycobacteria or bacteria in general? The more general, the better.

      Text modified, background added

      Below I have listed specific comments that I hope are useful in bringing the work to publication and making it highly cited.

      SPECIFIC COMMENTS

      Line 30 unexpectedly. I would delete this word because the result is expected from the ROS work of Shee et al 2022 with mycobacteria. Moreover, Zeng et al 2022 PNAS showed that ROS participates in antimicrobial tolerance, and persistence is a form of tolerance (Balalban et al, 2019, Nat Micro Rev).

      Text modified as per review suggestion

      Line 39 key goal: this is probably untrue in the general sense stated, since bacteriostatic antibiotics are sufficient to clear infection (Wald-Dickler 2019 Clin Infect Dis). However, it is likely to be the goal for Mtb infections.

      We agree with the reviewer that bacteriostatic antibiotics are effective in treating most types of infections and do not claim otherwise in the manuscript. However, from a clinical standpoint, eradication of the pathogen causing the infection is indeed the goal of antibiotic therapy in virtually all circumstances (with the exception of specific scenarios such as cystic fibrosis where it is recognized that the infecting organism cannot be fully eliminated). In most cases, the combination of bacteriostatic antibiotics and the host immune response is sufficient to achieve eradication. We have modified the manuscript text to reflect this nuance noted by the reviewer.

      Line 62 several: you list three, but hipAB works via ppGpp, so the sentence needs fixing

      Text modified  

      Line 70 uncertain: this uncertainty is unreferenced. Since everything is uncertain, this vague phrase does not add to the story.

      The reviewer makes an interesting philosophical argument. However, we would submit that some aspects of biology, for example the regulation of glycolysis, are understood in great detail. However, other mechanisms, such as the precise mechanisms of lethality for diverse antibiotics in different bacterial species, are far more uncertain and remain a subject of debate (for example PMID: 39910302). Text not modified.

      Line 72 somewhat controversial: I would delete this, because the points in the Science papers by Lewis and Imlay have been clarified and in some cases refuted by prior and subsequent work.

      Text modified

      Line 72 presumed: this suggests that it is wrong and perhaps a different idea has replaced it. Another, and more likely view is that there is an additional mode of killing. I suggest rephrasing to be more in line with the literature.

      Text modified for clarity. In this sentence “presume” refers to the historical concept that direct target inhibition was solely responsible for antibiotic lethality. As the reviewer notes, there is now significant literature that ROS (and perhaps other secondary effects) also contribute to bacterial killing.  

      Line 73 However and the following might also: this phrasing, plus the presumed, misleads the reader from your intent. I suggest rephrasing.

      See above re: line 72

      Line 75 citations: these are inappropriate and should be changed to fit the statement. I suggest the initial paper by Collins (Kohanski 2007 Cell) a recent paper by Zhao (Zeng PNAS 2022), and a review Drlica Expert Rev Anti-infect Therapy 2021). The present citations are fine if you want to narrow the statement to mycobacteria, but the history is that the E. coli work came first and was then generalized to mycobacteria. A mycobacterial paper for ROS is Shee 2022 AAC.

      We thank the reviewer for noticing that we inadvertently omitted several important E. coli-related references. These have been added.

      Line 75 and 76: Conversely ... unresolved. Compelling arguments have been made that show major flaws in the two papers cited, and a large body of evidence has now accumulated showing the validity of the idea promoted by the Collins lab, beginning with Kohanski 2007. In addition to many papers by Collins, see Hong 2019 PNAS and Zeng 2022 PNAS). It is fine if you want to counter the arguments against the Lewis and Imlay papers (summarized in Drlica & Zhao 2021 Expert Rev Anti-infect Therapy), but making a blanket statement suggests that the authors are unfamiliar with the literature.

      We agree with the reviewer that the weight of the evidence supports a role for antibiotic-induced ROS as an important mechanism for antibiotic lethality under many (though not all) conditions. We have revised the text to better reflect this nuance.

      Line 78. Advantages over what?

      Text modified

      Line 80 exposure: to finish the logic you need to show that E. coli and S. aureus persisters fail to do this.

      We thank the reviewer for their suggestion but studying these other organisms is outside the scope of this study. 

      Line 82 whereas: this misdirects the reader. It would seem that a simple "and" is better

      Text modified

      Line 89 I think this paragraph is about the need to study Mabs, the subject of the present report. This paragraph could use a more appropriate topic sentence to guide the reader so that no guessing is involved. I suggest rephrasing this paragraph to make the case for studying more compelling.

      Text modified

      Line 96. I suggest citing several references after subinhibitory concentration of antibiotic.

      The references are in the following sentence alongside the key observations.

      Line 99. Genetic analysis: how does this phrase fit with the idea of persister cells arising stochastically?

      There are two issues: 1) We would argue that persister formation is not completely stochastic, but rather a probability that can be modified both genetically and by environment (for example hipA PMID: 6348026). 2) Even if persister formation were totally stochastic, the survival of these cells may depend on specific genes – as we indeed find in our Tn-Seq analysis of Mabs.  

      Line 106. In this paragraph you need to define persister. The consensus definition (Balaban 2019 Nat Micro Rev) is a subpopulation of tolerant cells. Tolerance is defined as the slowing or absence of killing while an antibiotic retains its ability to block growth. See Zeng 2022 PNAS for example with rapidly growing cells. Phenotypic tolerance is the absence of killing due to environmental perturbations, most notably nutrient starvation, dormancy, and growth to stationary phase. By extension, phenotypic persistence would be subpopulation status of a phenotypically tolerant cells. If you have a different definition, it is important to state it and emphasize that you disagree with the consensus statement.

      Text modified  

      Line 109 unexpectedly. I would delete this word, because the literature leads the reader to expect this result unless you make a clear case for Mabs being fundamentally different from other bacteria with respect to how antibiotics kill bacteria (this is unlikely, see Shee 2022 AAC). Indeed, lines 111-113 state extensions of E. coli work, although suppression of ROS in phenotypic tolerance and genetic persistence have not been demonstrated.

      Text modified

      Line 124 you might add, in parentheses and with references, that a property of persisters is crosspersistence to multiple antibiotic classes. This is also true for tolerance, both genetic and phenotypic. An addition will support your approach.

      Text modified

      Line 128 minimal

      Text not modified. We appreciate the reviewer’s preference but both “minimal” and “minimum” are both widely accepted terms. Indeed, the Balaban et al 2019 consensus statement on definitions cited by the author above also uses “minimum” (PMID: 30980069), as do IDSA clinical guidelines (PMID: 39108079).

      Line 130 is MIC somehow connected to killing or did you also measure killing? Note that blocking growth and killing cells are mechanistically distinct phenomena, although they are related. By being upstream from killing, blockage of growth will also interfere with killing.

      Text modified

      Line 133 PBS is undefined

      Text modified

      Line 134 increase in persisters ... you need to establish that these are not phenotypically tolerant cells. Do they constitute the entire population (tolerance)? Your data would be more indicative of persisters if you saw a distinct plateau with the PBS samples, as such data are often used to document persistence (retardation of killing is a property of tolerance, Balaban 2019). Fig. 1B is clearly phenotypic tolerance, as the entire population grows. Your data suggest that you are not measuring persistence as defined in the literature (Balaban 2019). Line 139 persister should be tolerance •

      Text modified

      Lines 142, 143, 144. 159, 163, 171, 181, 211, 226, 238, 246, 277, 279,289 persistent should be tolerant

      Text modified

      Line 146 fig 1E Mtb does not show the adaptation phenomenon and it is clearly tolerant, not persistent. This should be pointed out. As stated, you may be misleading the reader.

      Text modified  

      *Line 169. Please make it clear whether these genes are affecting antibiotic susceptibility (MIC will affect killing because blocking growth is upstream) or if you are dealing with tolerance (no change in MIC). These measurements are essential and should included as a table. By antibiotic response, do you mean that antibiotics change expression levels?

      Regarding MICs, the data for MICs in control and katG mutant are presented in Figure 4C and 4F. Regarding ‘response’ we have clarified the text of this sentence.

      Line 174 Interestingly should be as expected

      Text not modified; tetracyclines do not induce ROS in E. coli and oxazolidinones have not been studied in this regard.

      Line 183 you need to include citations. You can cite the ability of chloramphenicol to block ROS-mediated killing of E. coli. That allows you to use the word unexpected

      Text modified

      Line 199. All of the data in Fig. 3 shows tolerance, not persistence, requiring word changes in this paragraph.

      Text modified

      Line 226. The MIC experiment is important. You can add that this result solidifies the idea that blocking growth and killing cells are distinct phenomena. You can cite Shee 2022 AAC for a mycobacterial paper

      Text modified

      Line 241. The result with levofloxacin is unexpected, because the fluoroquinolones are widely reported to induce ROS, even with mycobacteria (see Shee 2022 AAC). You need to point this out and perhaps redo the experiment to make sure it is correct.

      We appreciate the reviewer’s interest in this question. All experiments in this paper were repeated multiple times. This particular experiment was repeated 3 times and in all replicates the katG mutant was sensitized to translation inhibitors but not levofloxacin. Shee et al examined Mtb treated with moxifloxacin and found ROS generation, but did not assess whether a Mtb katG mutant had impaired survival. Thus, in addition to differences in: i) the species studied and ii) the particular fluoroquinolone used, the two sets of experiments were designed to address different questions (ROS accumulation vs protection by katG) . A cell might accumulate ROS without a katG mutant having impaired survival if genetic redundancy exists – a result we indeed see in our clinical Mabs strains under some conditions (new data included in revised Figure 6A).  

      Line 269 Additional controls would bolster the conclusion: use of an antioxidant such as thiourea and an iron chelator (dipyridyl) both should reduce ROS effects.

      New experiments performed, revised Figure 5.

      Line 276 the word no is singular

      Text modified

      Line 284 this suggested ... in fact previous work suggested. This summary paragraph might go better as the first paragraph of the Discussion

      Text modified to specify that this is in reference to the work in this manuscript

      Lines 294-299 Most of this is redundant and should be deleted.

      Text modified

      Line 299 this species is vague

      Text modified

      Line 310 Do you want to discuss spoT?

      Text not modified

      Line 313 paragraph is largely redundant

      Text modified

      Line 314 controversial. As above, I would delete this, especially since it is not referenced and is unlikely to be true. If you believe it, you have the obligation to show why the ROS-lethality idea is untrue. If you are referring to Lewis and Imlay, there were almost a dozen supporting papers before 2013 and many after. This statement does not make the present work more important, so deletion costs you nothing.

      Text modified

      Line 314 direct disruption of targets. This is clearly not a general principle, because the quinolones rapidly kill while inhibition of gyrase by temperature-sensitive mutations does not (Kreuzer 1979 J.Bact; Steck 1985). Indeed, formation of drug-gyrase-DNA complexes is reversible: death is not.

      Text modified

      Line 318 as pointed out above, you have not brought this story up to date. The two papers mainly focused on Kohanski 2007, ignoring other available evidence.’’

      Text modified

      Line 326 you need to cite Shee 2022 AAC

      Text modified

      Line 342 the idea of mutants being protective is not novel, as several have been reported with E. coli studies. Thus, there is a general principle involved.

      We agree that this suggests a potential general principle

      Line 344. It depends on the inhibitor. For example, aminoglycosides are translation inhibitors and they also cause the accumulation of ROS.

      We agree that ROS generation depends on the inhibitor, and indeed upon other variables including drug concentration, growth conditions, and bacterial species as well.  

      Line 347. You need to point out the considerable data showing that the absence of catalase increases killing

      Text modified

      Line 363 look at Shee 2022 AAC and Jacobs 2021 AAC

      Text modified, reference added.

      Line 585 I suggest having a colleague provide critical comments on the manuscript and acknowledge that person.

      Text not modified

    1. earlier

      One issue: Our onset detection method is based on statistical significance, i.e., the onset is the earliest time point of a significant increase in the cohort (versus unrelated) smooth. One of our reviewers (McMurray) thinks this is not appropriate, because this means that more noisy data and/or data based on smaller samples would lead to later onsets, thus reducing comparability between experiments.

      We think of the use of significance as a feature, not a bug: For one, it reduces researcher degrees of freedom because the criterion is automatically determined. Also, this criterion is very broadly applicable (even to other data types, models,, tasks). Finally, we show in our simulation study that sample size and noise play little role in the coverage properties of our method (whereas they affect the bootstrap-based method of Stone et al. much more dramatically).

      Nevertheless, ... McMurray is still correct that our method conflates the two things, noise and early/late. In response, I have implemented an option in the package that allows you to specify a "magnitude threshold" for onset detection, which is not based on significance. It's called 'onset_criterion', and by default, it detects an onset when a magnitude of 0.075 logits is reached relative to the baseline (can be changed with 'onset_threshold').

      What does this mean for the RR? It seems to me that what is meant by "earlier" in your hypotheses is already connected to the influence of noise? i.e., data from lower-quality webcams can be much more noisy so it'll be harder to detect a significant difference in that condition. In other words, you need a larger effect in terms of proportions for it to be detected and this may only emerge later? If that is true, the default operation of the method (which uses significance) will indeed align well with your hypotheses.

      Still, this is something to keep in mind: (1) you might want to make the distinction between noise and early/late more clear in the RR hypothesis. And/or (2) you might want to preregister secondary analysis with a magnitude criterion rather than a significance-based one, in an attempt to separate noise from a magnitude-based increase in proportion of looks.

    1. nism aVital S

      I also thought about Semiotics of a Kitchen by Martha Rosler so much throughout this essay. Bertillon and Galton's legacy cast a long shadow over us as prospective archivists, and we need to think very carefully about how we operate in the world as archivists, especially in the age of AI.. AI feels like it can reinforce existing social biases and power structures that Galton birthed and this is already happening which scares me- especially because I have to hold myself accountable at making sure I don't let AI control or dominate me as an archivist. The last bit of Ernest Cole resonated with me heavily. As an archivist, we must think about histories and the people whose histories we may be painstakingly collecting that are constantly threatened to become eradicated, erased, and violently displaced. How do we make truth available to people in a way that they are the ones who get to tell their story?

    1. If the agent selects Male, my breasts are large enough, statistically speaking, in comparison to the normative male body-shape construct in the database, to trigger an anomaly warning and a highlight around my chest area. .d-undefined, .lh-undefined { background-color: rgba(0, 0, 0, 0.2) !important; }1Jonathan CalzadaIf they select Female, my groin area deviates enough from the statistical female norm to trigger the risk alert. In other words, I can’t win. This sociotechnical system is sure to mark me as “risky.d-undefined, .lh-undefined { background-color: rgba(0, 0, 0, 0.2) !important; }1Muhammad Khurram,” and that will trigger an escalation to the next level in the TSA security protocol..d-undefined, .lh-undefined { background-color: rgba(0, 0, 0, 0.2) !important; }11

      I think this is an interesting example of how technology with limited options benefits those who fit within societal standards and binary categories. However, those who do not fit the norm may be harmed by technologies like these. Because the system only had male or female as the options, this limited the narrator as they would be flagged either way. I think this says a lot about the way our sociocultural beliefs and gender norms are embedded within the very technology we deploy around us.

    1. Author response:

      (1) General Statements

      Our manuscript studies mechanisms of planar polarity establishment in vivo in the Drosophila pupal wing. Specifically we seek to understand mechanisms of ‘cell-scale signalling’ that is responsible for segregating core pathway planar polarity proteins to opposite cell edges. This is an understudied question, in part because it is difficult to address experimentally.

      We use conditional and restrictive expression tools to spatiotemporally manipulate core protein activity, combined with quantitative measurement of core protein distribution, polarity and stability. Our results provide evidence for a robust cell-scale signal, while arguing against mechanisms that depend on depletion of a limited pool of a core protein or polarised transport of core proteins on microtubules. Furthermore, we show that polarity propagation across a tissue is hard, highlighting the strong intrinsic capacity of individual cells to establish and maintain planar polarity.

      The original manuscript received three fair and thorough peer-reviews, which raised many important points. In response, we decided to embark on a full revision that attempts to answer all of the points. We have included new data to support our conclusions in Supplemental Figures 1, 2 and 5.

      Additionally in response to the reviewers we have revised the manuscript title, which is now ‘Characterisation of cell-scale signalling by the core planar polarity pathway during Drosophila wing development’.

      (2) Point-by-point description of the revisions

      We thank all of the reviewers for their thorough and thoughtful review of our manuscript. They raise many helpful points which have been extremely useful in assisting us to revise the manuscript.

      In response we have carried out a major revision of the manuscript, making numerous changes and additions to the text and also adding new experimental data. Specific changes are listed after our detailed response to each comment.

      Reviewer #1:

      […] Major points:

      The exact meaning of cell-scale signaling is not defined, but I infer that the authors use this term to describe how what happens on one side of a cell affects another side. The remainder of my critique depends on this understanding of the intended meaning.

      As the reviewer points out, it is important that the meaning of the term ‘cell-scale signalling’ is clear to the reader and in response to their comment we have had another go at defining it explicitly in the Introduction to the manuscript.

      Specifically, we use the term ‘cell-scale signalling’ to describe possible intracellular mechanisms acting on core protein segregation to opposite cell membranes during core pathway dependent planar polarisation. For example, this could be a signal from distal complexes at one side of the cell leading to segregation of proximal complexes to the opposite cell edge, or vice versa. See also our response to Reviewer #2 regarding the distinction between ‘molecular-scale’ and ‘cell-scale’ signalling. 

      Changes to manuscript: Revised definition of ‘cell-scale signalling’ in Introduction.

      The authors state that any tissue wide directional information comes from pre-existing polarity and its modification by cell flow, such that the de novo signaling paradigm "bypasses" these events and should therefore not be responsive to any further global cues. It is my understanding that this is not a universally accepted model, and indeed, the authors' data seem to suggest otherwise. For example, the image in Fig 5B shows that de novo induction restores polarity orientation to a predominantly proximal to distal orientation. If no global cue is active, how is this orientation explained?

      We assume that the reviewer’s point is that it is not universally accepted that de novo induction after hinge contraction leads to uncoupling from global cues (rather than that it is not accepted that hinge contraction remodels radial polarity to a proximodistal pattern). We are (we believe) the only lab that has used de novo induction as a tool, and we’re not aware of any debate in the literature about whether this bypasses global cues. Nevertheless, we accept that it is hard to prove there is no influence of global cues, when the nature of those cues and the time at which they act remain unclear. Below we summarise the reasons why we believe there are not significance effects of global cues in our experiments that would influence the interpretation of our results.

      First, our reading of the literature supports a broad consensus that an early radial core planar polarity pattern is realigned by cell flow produced by hinge contraction beginning at around 16h APF (e.g. Aigouy et al., 2010; Strutt and Strutt, 2015; Aw and Devenport, 2017; Butler and Wallingford, 2017; Tan and Strutt, 2025). Taken at face value, this suggests that there are ‘radial’ cues present prior to hinge contraction, maybe coming from the wing margin – arguably these radial cues could be Ft-Ds or Wnts or both, given they are expressed in patterns consistent with such a role (notwithstanding the published evidence arguing against roles for either of these cues). It then appears that hinge contraction supercedes these cues to convert a radial pattern to a proximodistal pattern – whether the radial cues that affect the core pathway earlier remain active after hinge contraction is unclear, although both Ft-Ds and Wnts appear to maintain their ‘radial’ patterns beyond the beginning of hinge contraction (e.g. Merkel et al., 2014; Ewen-Campen et al., 2020; Yu et al., 2020).

      We think that the reviewer is proposing the presence of a proximodistal cue that is active in the proximal region of the wing that we use for our experiments shown e.g. in Fig.5, and that this cue orients core polarity here (but not elsewhere in the wing) in a time window after 18h APF. Ft-Ds and Wnts do not seem to be plausible candidates as they are still in ‘radial’ patterns. This leaves either an unknown proximodistal cue (a gradient of some unknown signalling molecule?), or possibly some ability of hinge contraction to align proximodistal polarity specifically in this wing region but not elsewhere. We cannot definitively rule out either of these possibilities, but neither do we think there is sufficient evidence to justify invoking their existence to explain our observations.

      In particular, the reason that we don’t think there is a proximodistal cue in the proximal part of the wing after 18h APF, is that work from our lab shows that induction of Fz or Stbm expression at times around or after the start of hinge contraction (i.e. >16 h APF) results in increasing levels of trichome swirling with polarity not being coordinated with the tissue axis either proximally or distally (Strutt and Strutt, 2002; Strutt and Strutt 2007). Our simplest interpretation for this is that induction at these stages fails to establish the early radial pattern of core pathway polarity and hence hinge contraction cannot reorient radial to proximodistal. If hinge contraction alone could specify proximodistal polarity in the absence of the earlier radial polarity, then we would not expect to see swirling over much of the proximal wing (where the forces from hinge contraction are strongest (Etournay et al., 2015)).

      In this manuscript, our earliest de novo experiments begin with Fz induction at 18h APF (de novo 10h), then at 20h APF (de novo 8h) and at 22h APF (de novo 6h). The image in Fig. 5B, referred to by the reviewer, is of a wing where Fz is induced de novo at 22 h APF. In these wings, as expected, the core proteins localise asymmetrically in stereotypical swirling patterns throughout the wing surface (see Fig. 2M and also Strutt and Strutt, 2002; Strutt and Strutt 2007), but – usefully for our experiments – they broadly localise along the proximal-distal axis in the region analysed in Fig. 5B. Given the strong swirling in surrounding regions when inducing at >20h APF, we feel reasonably confident in assuming that the pattern is not due to a proximodistal cue present in the proximal wing.

      We appreciate that the original manuscript did not show images including the trichome pattern in adjacent regions, so this point would not have been clear, but we now include these in Supplementary Fig. 5. We have also added a note in the legend to Fig. 5B to clarify that the proximodistal pattern seen is local to this wing region. We apologise for this oversight and the confusion caused and appreciate the feedback.

      The 6 hr condition, that has only partial polarity magnitude, is quite disordered. Do the patterns at 8 and 10 hrs become more proximally-distally oriented? It is stated that they all show swirls, but please provide adult wing images, and the corresponding orientation outputs from QuantifyPolarity to help validate the notion that the global cues are indeed bypassed by this paradigm.

      In all three ‘normal’ de novo conditions (6h, 8h and 10h), regardless of the time of induction, the polarity orientation patterns of Fz-mKate2 in pupal and adult wings are very similar in the experimentally analysed region (Fig. S5B-E). The strong local hair swirling agrees with the previous published data (Strutt and Strutt, 2002; Strutt and Strutt 2007). Overall, we don’t see any evidence that the 10h de novo induction results in more proximodistally coordinated polarity than the 8h or 6h conditions. This is consistent with our contention that there is no global cue present at these stages, which presumably would have a stronger effect when core pathway activity was induced at earlier stages.

      Changes to manuscript: Added additional explanation of the ‘de novo induction’ paradigm and why we believe the resulting polarity patterns are unlikely to be influenced by any global signals in Introduction and Results section ‘Induced core protein relocalisation…’. Added quantification of polarity in the experiment region proximal to the anterior cross-vein in pupal wings (Fig.S5E-E’’’) and zoomed-out images of the surrounding region in adult wings showing that the polarity pattern does not become more proximodistal when induction time is longer, and also that there is not overall proximodistal polarity in proximal regions of the wing (Fig.S5B-D), arguing against an unknown proximodistal polarity cue at these stages of development.

      In the de novo paradigm, polarization is initiated immediately or shortly after heat shock induction. However, the results should be differently interpreted if the level of available Fz protein does not rise rapidly and then stabilize before the 6 hr time point, and instead continues to rise throughout the experiment. Western blots of the Fz::mKate2-sfGFP at time points after induction should be performed to demonstrate steady state prior to measurements. Otherwise, polarity magnitude could simply reflect the total available pool of Fz at different times after induction. Interpreting stability is complex, and could depend on the same issue, as well as the amount of recycling that may occur. Prior work from this lab using FRAP suggested that turnover occurs, and could result from recycling as well as replenishment from newly synthesized protein. 

      The reviewer raises an important point, which we agree could confound our experimental interpretations. As suggested we have now carried out western blotting and quantitation for Fz::mKate2-sfGFP levels and added these data to Fig.S1 (Fig. S1C,D). Quantified Fz is not significantly different between the three de novo polarity induction timings and not significantly different compared to constitutive Fz::mKate2-sfGFP expression (although there is a trend towards increasing Fz::mKate2-sfGFP protein levels with increasing induction times). These data are consistent with Fz::mKate2-sfGFP being at steady state in our experiments and that levels are sufficient to achieve normal polarity (as constitutive Fz::mKate2-sfGFP does so). Therefore it is unlikely that differing protein levels explain the differing polarity magnitudes at the different induction times. Interestingly, Fz::mKate2-sfGFP levels are lower than endogenous Fz levels, possibly due to lower expression or increased turnover/reduced recycling.

      Changes to manuscript: Added western blot analysis of Fz::mKate2-sfGFP expression under 10h, 8h and 6h induction conditions vs endogenous Fz expression and constitutive Fz::mKate2sfGFP expression (Fig.S1C-D) and discussed in Results section ‘Planar polarity establishment is…’.

      From the Fig 3 results, the authors claim that limiting pools of core proteins do not explain cellscale signaling, a result expected based on the lack of phenotypes in heterozygotes, but of course they do not test the possibility that Fz is limiting. They do note that some other contributing protein could be. 

      Previously published results from our lab (Strutt et al., 2016 Cell Reports; Supplemental Fig. S6E) show that in a heterozygous fz mutant background, Fz protein levels are not affected by halving the gene dosage when compared to wt, suggesting that Fz is most likely produced in excess and is not normally limiting, but that protein that cannot form complexes may be rapidly degraded. We have now added this information to the text.

      Changes to manuscript: Added explanation in text that Fz levels had previously been shown to not be dosage sensitive in Results section ‘Planar polarity establishment is…’ and also added a caveat to the Discussion about not directly testing Fz.

      In Fig 3, it is unclear why the authors chose to test dsh1/+ rather than dsh[null]/+. In any case, the statistically significant effect of Dsh dose reduction is puzzling, and might indicate that the other interpretation is correct. Ideally, a range including larger and smaller reductions would be tested. As is, I don't think limiting Dsh is ruled out. 

      Concerning the choice of dsh allele, we appreciate the query of the reviewer regarding use of dsh[1] instead of a null, as there might be a concern that dsh[1] would give a less strong phenotype. The answer is that over more than two decades we and others have never found any evidence that dsh[1] does not act as a ‘null’ for planar polarity in the pupal wing, and furthermore use of dsh[1] preserves function in Wg signalling – and we would prefer to rule out any phenotypic effects due to any potential cross-talk between the two pathways that might be seen using a complete null. To expand on this point, dsh[1] mutant protein is never seen at cell junctions (Axelrod 2001; Shimada et al., 2001; our own work), and by every criteria we have used, planar polarity is completely disrupted in hemizygous or homozygous mutants e.g. see quantifications of polarity in (Warrington et al., 2017 Curr Biol).

      In terms of the broader point, whether we can rule out Dsh being limiting, we were very careful to be clear that we did not see evidence for Dsh (or other core proteins) being limiting in terms of ‘rates of core pathway de novo polarisation’. When the reviewer says ‘the statistically significant effect of Dsh dose reduction is puzzling’ we believe they are referring to the data in Fig. 3J, showing a small but significantly different reduction in stable Fz in de novo 6h conditions (also seen in 8h de novo conditions, Fig. S3I). As Dsh is known to stabilise Fz in complexes (Strutt et al., 2011 Dev Cell; Warrington et al., 2017 Curr Biol), in itself this result is not wholly surprising. Nevertheless, while this shows that halving Dsh levels does modestly reduce Fz stability, it does not alter our conclusion that halving Dsh levels does not affect Fz polarisation rate under either 6h or 8h de novo conditions.

      Unfortunately, we do not have available to us a practical way of achieving consistent intermediate reductions in Dsh levels (e.g. a series of verified transgenes expressing at different levels). Levels of all the core proteins could be dialled down using transgenes, to see when the system breaks, and indeed we have previously published that lower levels of polarity are seen if Fmi levels are <<50% or if animals are transheterozygous for pk, stbm, dgo or dsh, pk, stbm, dgo simultaneously (Strutt et al., 2016 Cell Reports). However, it seems to be a trivial result that eventually the ability to polarise is lost if insufficient core proteins are present at the junctions. For this reason we have focused on a simple set of experiments reducing gene dosage singly by 50% under two de novo induction conditions, and have been careful to state our results cautiously. The assays we carried out were a great deal of work even for just the 5 heterozygous conditions tested.

      We believe that the experiments shown effectively make the point that there is no strong dosage sensitivity – and it remains our contention that if protein levels were the key to setting up cell-scale polarity, then a 50% reduction would be expected to show an effect on the rate of polarisation. We further note that as Fz::mKate2-sfGFP levels are lower than endogenous Fz levels (see above), the system might be expected to be sensitised to further dosage reductions, and despite this we failed to see an effect on rate of polarisation.

      We note that Reviewer #3 made a similar point about whether we can rule out dosage sensitivity on the basis of 50% reductions in protein level. To address the comments of both reviewers we had now added some further narrative and caveats in the text.

      In a similar vein, Reviewer #2 requested data on whether dosage reduction altered protein levels by the expected amount. We have now added further explanation/references and western blot data to address this.

      Changes to manuscript: Added more explanation of our choice of dsh[1] as an appropriate mutant allele to use in Results section ‘Planar polarity establishment is…’. Added some narrative and caveats regarding whether lowering levels more than 50% would add to our findings in the Discussion. Revised conclusions to be more cautious including altering section title to read ‘Planar polarity establishment is not highly sensitive to variation in protein levels of core complex components’.

      Also added westerns and text/references showing that for the tested proteins there is a reduction in protein levels upon removal of one gene dosage in Results section ‘Planar polarity establishment is…’ and Fig.S2.

      The data in Fig 5 are somewhat internally inconsistent, and inconsistent with the authors' interpretation. In both repolarization conditions, the authors claim that repolarization extends only to row 1, and row 1 is statistically different from non-repolarized row 1, but so too is row 3. Row 2 is not. This makes no sense, and suggests either that the statistical tests are inappropriate and/or the data is too sparse to be meaningful. 

      As we’re sure the reviewer appreciates, this was an extremely complex experiment to perform and analyse. We spent a lot of time trying to find the best way to illustrate the results (finally settling on a 2D vector representation of polarity) and how to show the paired statistical comparisons between different groups. Moreover, in the end we were only able to detect generally quite modest (statistically significant) changes in cell polarity under the experimental conditions.

      However, we note that failure to see large and consistent changes in polarity is exactly the expected result if it is hard to repolarise from a boundary – and this is of course the conclusion that we draw. Conversely, if repolarisation were easy, which was our expectation at least under de novo conditions without existing polarity, then we would have expected large and highly statistically significant changes in polarity across multiple cell rows. Hence we stand by our conclusion that ‘it is hard to repolarise from a boundary of Fz overexpression in both control and de novo polarity conditions’.

      Overall, we were trying to establish three points:

      (1) to demonstrate that repolarisation occurs from a boundary of overexpression i.e. from boundary 0 to row 0

      (2) to establish whether a wave of repolarisation occurs across rows 1, 2 and 3

      (3) to determine if in repolarisation in de novo condition it is easier to repolarise than in repolarisation in the control (already polarised) condition Taking each in turn:

      (1) To detect repolarisation from a boundary relative to the control condition, we have to compare row 0 in repolarisation condition (Fig.5G,K) vs control condition (Fig.5F,J). This comparison shows a significative repolarisation (p=0.0014). From now, row 0 in repolarisation condition is our reference for repolarisation occurring.

      (2) To determine if there is a wave of repolarisation in the repolarisation condition we have to compare row 0 vs row 1 to 3 in the repolarisation condition (Fig.5K). Row 1 is not significantly different to row 0, but rows 2 and 3 are different and the vectors show obviously lower polarity than row 0. Hence no wave of repolarisation is detected over rows 1 to 3.

      (3) To determine if it is easier to repolarise in the de novo condition, our reference for establishment of a repolarisation pattern is the polarisation condition in rows 0 to 3. So, we compare repolarisation condition vs repolarisation in de novo condition, row 0 vs row 0, row 1 vs row 1, row 2 vs row 2 and row 3 vs row 3 – in each case no significative difference in polarity is detected, supporting our conclusion that it is not easier to repolarise in the de novo condition.

      We agree that the variations in row 3 are puzzling, but there is no evidence that this is due to propagation of polarity from row 0, and so in terms of our three questions, it does not alter our conclusions.

      Changes to manuscript: We have extensively revised the text describing the results in Fig.5 to hopefully make the reasons for our conclusions clearer and also be more cautious in our conclusions in Results section ‘Induced core protein relocalisation…’. 

      For the related boundary intensity data in Fig 6, the authors need to describe exactly how boundaries were chosen or excluded from the analysis. Ideally, all boundaries would be classified as either meido-lateral (meaning anterior-posterior) or proximal-distal depending on angle. 

      We thank the reviewer for pointing out that this was not clear.

      All boundaries were classified following their orientation compared to the Fz over-expression boundary using hh-GAL4 expressed in the wing posterior compartment. Horizontal junctions were defined as parallel to the Fz over-expression boundary (between 0 and 45 degrees) and mediolateral junctions as junctions linking two horizontal boundaries (between 45 and 90 degrees).

      Changes to manuscript: The boundary classification detailed above has been added in the Materials and Methods.

      If the authors believe their Fig 5 and 6 analyses, how do they explain that hairs are reoriented well beyond where the core proteins are not? This would be a dramatic finding, because as far as I know, when core proteins are polarized, prehair orientation always follows the core protein distribution. Surprisingly, the authors do not so much as comment about this. The authors should age their wings just a bit more to see whether the prehair pattern looks more like the adult hair pattern or like that predicted by their protein orientation results.

      Again the reviewer makes an interesting point, and we agree that this is something that we should have more directly addressed in the manuscript.

      There are three reasons why we might expect adult trichomes to show a different effect from the measured core protein polarity pattern seen in our experiments:

      (i) we are assaying core protein polarity at 28h APF, but trichomes emerge at >32h APF, so there is still time for polarity to propagate a bit further from the boundary. We now have added data showing that by the point of trichome initiation, the wave of polarisation extends 3-4 cell rows (Fig.S5A).

      (ii) it has long been known that a strong localisation of core proteins at a cell edge is not required for polarisation of trichome polarity from a boundary. For instance, in Strutt & Strutt 2007 we show clones of cells overexpressing Fz causing propagation through pk[pk-sple] mutant tissue where there is no detectable core protein polarity. We were following up prior observations of Adler et al., 2000 in the wing and Lawrence et al., 2004 in the abdomen.

      (iii) there is evidence to suggest that the polarity of adult trichomes is locally coupled, possibly mechanically. This point is hard to prove without live imaging taking in both initial core protein localisation, the site of actin-rich trichome initiation and then the final orientation of the much larger microtubule filled trichome, and we’re not aware that such data exist. However, Wong & Adler 1993 (JCB) showed that over a number of hours trichomes become much larger and move towards the centre of the cell, presumably becoming decoupled from any core protein cue. The images in Guild … & Tilney, 2005 (MBoC)  are also interesting to look at in this regard. Finally, septate junction proteins have been implicated in local alignment of trichomes, independently of the core pathway (Venema … & Auld, 2004 Dev Biol).

      Changes to manuscript: Added new data in Fig.S5A showing where trichomes initiate under 6h de novo induction conditions, for comparison to core protein localisation and adult trichome data in Fig.5. Added some text explaining why adult trichome repolarisation might be stronger than the observed effects on core protein localisation in Discussion. 

      Minor points:

      As the authors know, there is a model in the literature that suggests microtubule trafficking provides a global cue to orient PCP. The authors' repolarization data in Fig 4 make a reasonably convincing case against a role for no role for microtubules in cell-scale signaling, but do not rule out a role as a global cue. The authors should be careful of language such as "...MTs and core proteins being oriented independently of each other" that would appear to possibly also refer to a role as a global cue. 

      Thank you for pointing out that this was not clear. We have now modified the text to hopefully address this.

      Changes to manuscript: Text updated in Results section ‘Microtubules do not provide…’.

      Significance:

      There are two negative conclusions and one positive conclusion made by the authors. Provided the above points are addressed, the negative conclusions, that core proteins are not limiting and that microtubules are not involved in cell-scale signaling are solid. The positive conclusion is more nebulous - the authors say that cell-scale signaling is strong relative to cell-cell signaling - but how strong is strong? Strong relative to their prior expectations? I'm not sure how to interpret such a conclusion. Overall, we learn something from these results, though it fails to reveal anything about mechanism. These results will be of some interest to those studying PCP.

      The reviewer raises an interesting point, which is how do you compare the strength of two different processes, even if both processes affect the same outcome (in this case cell polarity). Repolarisation from a boundary has not been carefully studied at the level of core protein localisation in any previous study to our knowledge – this is one of the important novel aspects of this study. Hence there is not a baseline for defining strong repolarisation. Similarly, there has been no investigation of the nature of ‘cell-scale signalling’. This was a considerable challenge for us in writing the manuscript, and we have done our best to find appropriate language that hopefully conveys our message adequately. Minimally our work may provide a baseline for helping to define the ‘strengths’ of these processes in future studies.

      One of our main points is that we can generate an artificial boundary of Fz expression, where Fz levels are at least several fold higher than in the neighbouring cell (e.g. compare Fig.4N’ and O’) and only two rows of cells show a significant change in polarity relative to controls. Even when the tissue next to the overexpression domain is still in the process of generating polarity (de novo condition) then the boundary has little effect on polarity in neighbouring cell rows. This was a result that surprised us, and we tried to convey that by using language to suggest cell-scale signalling was stronger than cell-cell signalling i.e. stronger in terms of the ability to define the final direction of polarity.

      Changes to manuscript: In the revised manuscript we have reviewed our use of language and now avoid saying ‘strong’ but instead use terms such as ‘effective’ and ‘robust’ in e.g. Results section ‘Induced core protein relocalisation…’, the Discussion and we have also changed the title of the manuscript to avoid claiming a ‘strong’ signal.

      Reviewer #2:

      […] Critique

      The experiments described in this paper are of high quality with a sophisticated level of design and analysis. However, there needs to be some recalibration of the extent of the conclusions that can be drawn (see below). Moreover, a limitation of this paper is that, despite the quality of their data, they cannot give a molecular hint about the nature of their proposed cell-scale signal. Below are a two key points that the authors may want to clarify.

      (1) The first set of repolarisation experiment is performed after the global cell rearrangements that have been shown to act as global signal. However, this approach does not exclude the possible contribution of an unknown diffusible global signal.

      A similar point was raised by Reviewer 1. For the convenience of this reviewer, we’ll summarise the arguments against such an unknown cue again below. More broadly, both reviewers asking a similar question indicates that we have failed to lay out the evidence in sufficient detail. In our defence, we have used the same ‘de novo’ paradigm in three previous publications (Strutt and Strutt 2002, 2007; Brittle et al 2022) without attracting (overt) controversy. We have now added text to the Introduction and Results that goes into more detail, as well as more experimental evidence (Fig.S5).

      Firstly, it is worth noting that the global cues acting in the wing are poorly understood, with mostly negative evidence against particular cues accruing in recent years. This makes it a hard subject to succinctly discuss. Secondly, we accept that it is hard to prove there is no influence of global cues, when the nature of those cues and the time at which they act remain unclear. Below we summarise the reasons why we believe there are not significance effects of global cues in our experiments that would influence the interpretation of our results.

      First, our reading of the literature supports a broad consensus that an early radial core planar polarity pattern is realigned by cell flow produced by hinge contraction beginning at around 16h APF (e.g. Aigouy et al., 2010; Strutt and Strutt, 2015; Aw and Devenport, 2017; Butler and Wallingford, 2017; Tan and Strutt, 2025). Taken at face value, this suggests that there are ‘radial’ cues present prior to hinge contraction, maybe coming from the wing margin – arguably these radial cues could be Ft-Ds or Wnts or both, given they are expressed in patterns consistent with such a role (notwithstanding the published evidence arguing against roles for either of these cues). It then appears that hinge contraction supercedes these cues to convert a radial pattern to a proximodistal pattern – whether the radial cues that affect the core pathway earlier remain active after hinge contraction is unclear, although both Ft-Ds and Wnts appear to maintain their ‘radial’ patterns beyond the beginning of hinge contraction (e.g. Merkel et al., 2014; Ewen-Campen et al.,2020; Yu et al., 2020).

      We think that the reviewers are proposing the presence of a proximodistal cue that is active in the proximal region of the wing that we use for our experiments shown e.g. in Fig.5, and that this cue orients core polarity here (but not elsewhere in the wing) in a time window after 18h APF. Ft-Ds and Wnts do not seem to be plausible candidates as they are still in ‘radial’ patterns. This leaves either an unknown proximodistal cue (a gradient of some unknown signalling molecule?), or possibly some ability of hinge contraction to align proximodistal polarity specifically in this wing region but not elsewhere. We cannot definitively rule out either of these possibilities, but neither do we think there is sufficient evidence to justify invoking their existence to explain our observations.

      In particular, the reason that we don’t think there is a proximodistal cue in the proximal part of the wing after 18h APF, is that work from our lab shows that induction of Fz or Stbm expression at times around or after the start of hinge contraction (i.e. >16 h APF) results in increasing levels of trichome swirling with polarity not being coordinated with the tissue axis either proximally or distally (Strutt and Strutt, 2002; Strutt and Strutt 2007). Our simplest interpretation of this is that induction at these stages fails to result in the early radial pattern of core pathway polarity being established and hence a failure of hinge contraction to reorient radial to proximodistal. If hinge contraction alone could specify proximodistal polarity in the absence of the earlier radial polarity, then we would not expect to see swirling over much of the proximal wing (where the forces from hinge contraction are strongest, Etournay et al., 2015).

      In this manuscript, our earliest de novo experiments begin at 18h APF (de novo 10h), then at 20h APF (de novo 8h) and at 22h APF (de novo 6h). The image in Fig. 5B referred to by Reviewer 1, is of a wing where Fz is induced de novo at 22 h APF. In these wings, as expected, the core proteins localise asymmetrically in stereotypical swirling patterns throughout the wing surface (see Fig. 2M and also Strutt and Strutt, 2002; Strutt and Strutt 2007), but – usefully for our experiments – they broadly localise along the proximal-distal axis in the region analysed in Fig. 5B. Given the strong swirling in surrounding regions when inducing at >20h APF, we feel reasonably confident in assuming that the pattern is not due to a proximodistal cue present in the proximal wing. We appreciate that the original manuscript did not show images including the trichome pattern in adjacent regions, so this point would not have been clear, but we now include these in Supplementary Fig.S5. We have also added a note in the legend to Fig. 5B to clarify that the proximodistal pattern seen is local to this wing region.

      Changes to manuscript: Text extended in Introduction and Results to better explain why we believe the de novo conditions that we use most likely result in a polarity pattern that is not significantly influenced by ‘global cues’. Now show zoomed-out images of the surrounding region around the experiment region proximal to the anterior cross-vein region in adult wings, showing that the polarity pattern does not become more proximodistal when induction time is longer, and also that there is not overall proximodistal polarity in proximal regions of the wing, arguing against an unknown proximodistal polarity cue at these stages of development (Fig.S5B-E’’’).

      (2) The putative non-local cell scale signal must be more precisely defined (maybe also given a better name). It is not clear to me that one can separate cell-scale from molecular-scale signal.

      Local signals can redistribute within a cell (or membrane) so local signals are also cell-scale. Without a clear definition, it is difficult to interpret the results of the gene dosage experiments. The link between gene dosage and cell-scale signal is not rigorously stated. Related to this, the concluding statement of the introduction is too cryptic.

      We thank the reviewer for raising this, as again a similar comment was made by Reviewer 1, so we are clearly falling short in defining the term. We have now had another attempt in the Introduction.

      To more specifically answer the point made by the reviewer regarding molecular vs cellular, we are essentially being guided here by the prior computational modelling work, as at the biological level the details are still being worked out. A specific class of previous models only allowed ‘signals’ between core proteins to act ‘locally’, meaning within a cell junction, and within the models there was no explicit mechanism by which proteins on other junctions could ‘detect’ the polarity of a neighbouring junction (e.g. Amonlirdviman et al., 2005; Le Garrec et al., 2006; Fischer et al., 2013). Other models implicitly or explicitly encode a mechanism by which cell junctions can be influenced by the polarity of other junctions (e.g. Meinhardt, 2007; Burak and Shraiman, 2009; Abley et al., 2013; Shadkhoo and Mani, 2019), for instance by diffusion of a factor produced by localisation of particular planar polarity proteins.

      We agree with the reviewer that a cell-scale signal will depend on ‘molecules’ and thus could be called ‘molecular-scale’, but here by ‘molecular-scale’ we mean signals that at the range of the sizes of molecules i.e. nanometers, rather than cell-scale signals that act at the size of cells i.e. micrometers. A caveat to our definition is that we implicitly include interactions that occur locally on cell junctions (<1 µm range) within ‘molecular-scale’, but this is a shorter range than ‘cellular-scale’ which requires signals acting over the diameter of a cell (3-5 µm). Nevertheless, we think the concept of ‘molecular-scale’ vs ‘cell-scale’ is a helpful one in this context, and have attempted to address the issue through a more careful definition of the terms.

      Changes to manuscript: Text revised in Introduction and legend to Fig.1 to more carefully define ‘cell-scale signalling’ and to distinguish it from ‘molecular-scale signalling’. Final sentence of Introduction also altered so we no longer cryptically speculate on the nature of the cell-scale signal but leave this to the Discussion.

      Minor comments. 

      Some of the (clever) genetic manipulation may need more details in the text. For example:

      - Need to specify if the hs-flp approach induces expression throughout the tissue.

      We apologise for the lack of clarity. In all the experiments, the hs-FLP transgene is present in all cells, and heat-shock results in ubiquitous expression. 

      Changes to manuscript: We have clarified this in the Results and Materials and Methods.

      - Need to specify in the text that in the unpolarised condition the tissue is both dsh and fz mutant.

      The reviewer is of course correct and we have updated this point in the text. The full genotype for the unpolarised condition is: w dsh<sup>1</sup> hsFLP22/y;; Act>>fz-mKate2sfGFP, fz<sup>P21</sup>/fz<sup>P21</sup> (see Table S1). So this line is mutant for dsh and fz with induced expression of Fz-mKate2sfGFP. 

      Changes to manuscript: We have clarified this in the relevant part of the Results.

      - Need to specify in the text that the experiment illustrated in Fig 5 is with hh-gal4. 

      As noted by the reviewer, we continued to use the same hh-GAL4 repolarisation paradigm as in Fig.4 and this info was in the legend to Fig.5 legend. However, we agree it is helpful to be explicit about this in the main text.

      Changes to manuscript: We have added this to this section of the Results.

      - Need to address a possible shortcoming of the hh experiment, that the AP boundary is a region of high tension.

      It is true that the AP boundary is under high tension in the wing disc (e.g. Landsberg et al., 2009). But we are not aware of any evidence that this higher tension persists into the pupal wing. In separate studies we have labelled for Myosin II in pupal wings (Trinidad et al 2025 Curr Biol; Tan & Strutt 2025 Nature Comms), and as far as we have noticed have not seen preferentially higher levels on the AP boundary. We think if tension were higher, the cell boundaries would appear straighter than in surrounding cells (as seen in the wing disc) and this is not evident in our images.

      - Need to dispel the possibility that there is no residual polarisation (e.g. of other components) in fz1 mutant (I assume this is the case).

      We use the null allele fz[P21] through this work, and we and others have consistently reported a complete loss of polarisation of other core proteins or downstream components in this background. The caveat to this is that core proteins that persist at cell junctions always appear at least slightly punctate in mutant backgrounds for other core proteins, and so any automated detection algorithm will always find evidence of individual cell polarity above a baseline level of uniform distribution. Hence we tend to use lack of local coordination of polarity (variance of cell polarity angle) as an additional measure of loss of polarisation, in addition to direct measures of average cell polarity. (We discuss this in the QuantifyPolarity manuscript Tan et al 2021 e.g. Fig.S6).

      Changes to manuscript: We now include in the Materials and Methods section ‘Fly genetics…’ a much more extensive explanation of the evidence for specific mutant alleles being ‘null’ for planar polarity function (including dsh1 as raised by Reviewer 1), specifically that they result in no detectable planar polarisation of either other core proteins or downstream effectors, and added appropriate references.

      - Need to provide evidence that 50% gene dosage commensurately affect protein level. 

      This is a good suggestion. In the case of Stbm, we have already published a western blot showing that a reduction in gene dosage results in reduced protein levels (Strutt et al 2016, Fig.S6). We have now performed western blots to quantify protein levels upon reduction of fmi, pk and dgo levels (we actually used EGFP-dgo for the latter, as we don’t have antibodies that can detect endogenous Dgo on western blots).

      Changes to manuscript: When presenting the dosage reduction experiments, we now refer back to Strutt et al., 2016 explicitly for Stbm, and have added western blot data for Fmi, Pk and EGFPDgo in new Fig.S2.

      - I am surprised that the relationship with microtubule polarity was never investigated. Is this true? 

      We agree this is a point that needed further clarification, as Reviewer 1 made a related point regarding the two possible roles for microtubules, one being as a mediator of a global cue upstream of the core pathway, and the second (which we investigate in this manuscript) as a mediator of a cell-scale signal downstream of the core pathway.

      Both the Uemura and Axelrod groups have published on potential upstream function as a global cue mediator in the Drosophila wing (e.g. Shimada et al., 2006; Harumoto et al., 2010; Matis et al., 2014).

      Both groups have also looked out whether core pathway components could affect orientation of microtubules (Harumoto et al., 2010; Olofsson at al., 2014; Sharp and Axelrod 2016). Notably Harumoto et al., 2010 observed that in 24h APF wings, loss of Fz or Stbm did not alter microtubule polarity from a proximodistal orientation consistent with the microtubules aligning along the long cell axis in the absence of other cues. However, this did not rule out an instructive effect of Fz or Stbm on microtubule polarity during core pathway cell-scale signalling. The Axelrod lab manuscripts saw interesting effects of Pk protein isoforms on microtubule polarity, albeit not throughout the entire wing, which hinted at a potential role in cell-scale signalling. Taken together this prior work was the motivation for our directed experiments to specifically test whether the core pathway might generate cell-scale polarity by instructing microtubule polarity.

      Changes to manuscript: We have revised the Results section ‘Microtubules do not…’ to make a clearer distinction regarding possible ‘upstream’ and ‘downstream’ roles of microtubules in Drosophila core pathway planar polarity and the motivation for our experiments investigating the latter.

      - The authors suggest that polarity does not propagate as a wave. And yet the range measured in adult is longer than in the pupal wing. Explain. 

      Again an excellent point, also made by Reviewer 1, which we have now addressed explicitly in the manuscript. For the convenience of this reviewer, we lay out the reasons why we think the propagation of polarity seen in the adult is further than seen for core protein localisation.

      There are three reasons why we might expect adult trichomes to show a different effect from the measured core protein polarity pattern seen in our experiments:

      (i) we are assaying core protein polarity at 28h APF, but trichomes emerge at >32h APF, so there is still time for polarity to propagate a bit further from the boundary. We now have added data showing that by the point of trichome initiation, the wave of polarisation extends 3-4 cell rows (Fig.S5A).  

      (ii) it has long been known that a strong localisation of core proteins at a cell edge is not required for polarisation of trichome polarity from a boundary. For instance, in Strutt & Strutt 2007 we show clones of cells overexpressing Fz causing propagation through pk[pk-sple] mutant tissue where there is no detectable core protein polarity. We were following up prior observations of Adler et al 2000 in the wing and Lawrence et al 2004 in the abdomen.

      (iii) there is evidence to suggest that the polarity of adult trichomes is locally coupled, possibly mechanically. This point is hard to prove without live imaging taking in both initial core protein localisation, the site of actin-rich trichome initiation and then the final orientation of the much larger microtubule filled trichome, and we’re not aware that such data exist. However, Wong & Adler 1993 (JCB) showed that over a number of hours trichomes become much larger and move towards the centre of the cell, presumably becoming decoupled from any core protein cue. The images in Guild … & Tilney, 2005 (MBoC)  are also interesting to look at in this regard. Finally, septate junction proteins have been implicated in local alignment of trichomes, independently of the core pathway (Venema … & Auld, 2004 Dev Biol).

      Changes to manuscript: Added new data in Fig.S5A showing where trichomes initiate under 6h de novo induction conditions, for comparison to core protein localisation and adult trichome data in Fig.5. Added some text explaining why adult trichome repolarisation might be stronger than the observed effects on core protein localisation in Discussion. 

      - The discussion states that the cell-intrinsic system remains to be fully characterised, implying that it has been partially characterised. What do we know about it? 

      As the reviewer probably realises, we were attempting to side-step a long speculative discussion about the various hints and ideas in the literature by grouping them under the umbrella of ‘remaining to be fully characterised’. We would argue that this current manuscript is the first to attempt to systematically investigate the nature of ‘cell-scale signalling’. The lack of prior work is probably due to two factors (i) pioneering theoretical work showed that a sufficiently strong global signal coupled with ‘local’ (i.e. confined to one cell junction) protein interactions was sufficient to polarise cells without the need to invoke the existence of a cell-scale signal; (ii) there is no easy way to identify cell-scale signals as their loss results in loss of polarity which will also occur if other (i.e. more locally acting) core pathway functions are compromised.

      The main investigation of the potential for cell-scale signalling has been another set of theory studies (Burak and Shraiman 2009; Abley et al., 2013; Shadkhoo and Mani 2019) which have considered the possibility of diffusible signals. In our present work we have further considered the possibility of a ‘depletion’ model, based on the pioneering theory work of Hans Meinhardt, and as discussed above the possibility that microtubules could mediate a cell-scale signal.

      Changes to manuscript: We have revised the Discussion to hopefully be clearer about the current state of knowledge.

      Reviewer #3:

      […] Major comments

      The data are clearly presented and the manuscript is well written. The conclusions are well supported by the data. 

      (1) The authors use a system to de novo establish PCP, which has the advantage of excluding global cues orienting PCP and thus to focus on the cell-intrinsic mechanisms. At the same time, the system has the limitation that it is unclear to what extent de novo PCP establishment reflects 'normal' cell scale PCP establishment, in particular because the Gal4/UAS expression system that is used to induce Fz expression will likely result in much higher Fz levels compared with the endogenous levels. The authors should briefly discuss this limitation. 

      We apologise if this wasn’t clear. We only used GAL4/UAS overexpression when we were generating an artificial boundary of Fz expression with hh-GAL4 to induce repolarisation. The de novo induction system involves Fz::mKate2-sfGFP being expressed directly under an Act5C promoter without use of GAL4/UAS. In response to a comment from Reviewer 1 we have now carried out western blot analysis which shows that Fz::mKate2-sfGFP levels under Act5C are actually lower than endogenous Fz levels. As we achieve normal levels of polarity, similar to what we measure in wild-type conditions when measured using QuantifyPolarity, we assume that therefore Fz levels are not limiting under these conditions. However, we note that lower than normal levels of Fz might sensitise the system to perturbation, which in fact would be advantageous in our study, as it might for instance have been expected to more readily reveal dosage sensitivity of other components.

      Changes to manuscript: We now describe the levels of expression achieved using the de novo induction system (Fig.S1C-D) and discuss possible consequences in the relevant Results sections and Discussion.

      (2) Fig. 3. The authors use heterozygous mutant backgrounds to test the robustness of de novo PCP establishment towards (partial) depletion in core PCP proteins. The authors conclude that de novo polarization is 'extremely robust to variation in protein level'. Since the authors (presumably) lowered protein levels by 50%, this conclusion appears to be somewhat overstated. The authors should tune down their conclusion. 

      Reviewer 1 makes a similar point about whether we can argue that the lack of sensitivity to a 50% reduction in protein levels actually rules out the depletion model. To address the comments of both reviewers we had now added some further narrative and caveats in the text.

      We nevertheless believe that the experiments shown effectively make the point that there is no strong dosage sensitivity – and it remains our contention that if protein levels were the key to setting up cell-scale polarity, then a 50% reduction would be expected to show an effect on the rate of polarisation. We further note that as Fz::mKate2-sfGFP levels are lower than endogenous Fz levels, the system might be expected to be sensitised to further dosage reductions, and despite this we fail to see an effect on rate of polarisation.

      In a similar vein, Reviewer 2 requested data on whether dosage reduction altered protein levels by the expected amount. We have now added further explanation/references and western blot data to address this.

      Changes to manuscript: Added some narrative and caveats regarding whether lowering levels more than 50% would add to our findings in the Discussion. Revised conclusions to be more cautious including altering section title to read ‘Planar polarity establishment is not highly sensitive to variation in protein levels of core complex components.

      Also added westerns and text/references showing that for the tested proteins there is a reduction in protein levels upon removal of one gene dosage in Results section ‘Planar polarity establishment is…’ and Fig.S2.

      Minor comments :

      (1) Page 3. The authors mention and reference that they used the PCA method to quantify cell polarity magnification and magnitude. It would help the unfamiliar reader, if the authors would briefly describe the principle of this method. 

      Changes to manuscript: More details have been added in Materials & Methods.

      Significance:

      The manuscript contributes to our understanding of how planar cell polarity is established. It extends previous work by the authors (Strutt and Strutt, 2002,2007) that already showed that induction of core PCP pathway activity by itself is sufficient to induce de novo PCP. This manuscript further explores the underlying mechanisms. The authors test whether de novo PCP establishment depends on an 'inhibitory signal', as previously postulated (Meinhardt, 2007), but do not find evidence. They also test whether core PCP proteins help to orient microtubules (which could enhance cell intrinsic polarization of core PCP proteins), but, again, do not find evidence, corroborating previous work (Harumoto et al, 2010). The most significant finding of this manuscript, perhaps, is the observation that local de novo PCP establishment does not propagate far through the tissue. A limitation of the study is that the mechanisms establishing intrinsic cell scale polarity remain unknown. The work will likely be of interest to specialists in the field of PCP.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The study by Yu et al investigated the role of protein N-glycosylation in regulating T-cell activation and functions is an interesting work. By using genome-wide CRISPR/Cas9 screenings, the authors found that B4GALT1 deficiency could activate expression of PD-1 and enhance functions of CD8+ T cells both in vitro and in vivo, suggesting the important roles of protein N-glycosylation in regulating functions of CD8+ T cells, which indicates that B4GALT1 is a potential target for tumor immunotherapy.

      Strengths:

      The strengths of this study are the findings of novel function of B4GALT1 deficiency in CD8 T cells.

      Weaknesses:

      However, authors did not directly demonstrate that B4GALT1 deficiency regulates the interaction between TCR and CD8, as well as functional outcomes of this interaction, such as TCR signaling enhancements.

      We are very sorry that we did not highlight our results in Fig. 5f-h enough. In those figures, we demonstrated the interaction between TCR and CD8 increased significantly in B4GALT1 deficient T-cells, by FRET assays. To confirm the important role of TCR-CD8 interaction in mediating the functions of B4GALT1 in regulating T-cell functions, such as in vitro killing of target cells, we artificially tethered TCR and CD8 by a CD8β-CD3ε fusion protein and tested its functions in both WT and B4GALT1 knockout CD8<sup>+</sup> T-cell. Our results demonstrate that such fusion protein could bypass the effect of B4GALT1 knockout in CD8<sup>+</sup>T-cells (Fig. 5g-h). Together with the results that B4GALT1 directly regulates the galactosylation of TCR and CD8, those results strongly support the model that B4GALT1 modulates T-cell functions mainly by galactosylations of TCR and CD8 that interfere their interaction.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors identify the N-glycosylation factor B4GALT1 as an important regulator of CD8 T-cell function.

      Strengths:

      (1) The use of complementary ex vivo and in vivo CRISPR screens is commendable and provides a useful dataset for future studies of CD8 T-cell biology.

      (2) The authors perform multiple untargeted analyses (RNAseq, glycoproteomics) to hone their model on how B4GALT1 functions in CD8 T-cell activation.

      (3) B4GALT1 is shown to be important in both in vitro T-cell killing assays and a mouse model of tumor control, reinforcing the authors' claims.

      Weaknesses:

      (1) The authors did not verify the efficiency of knockout in their single-gene KO lines.

      Thank reviewer for reminding. We verified the efficiency of some gRNAs by FACS and Surveyor assay. We will add those data in supplementary results in revised version later.

      (2) As B4GALT1 is a general N-glycosylation factor, the phenotypes the authors observe could formally be attributable to indirect effects on glycosylation of other proteins.

      please see response to reviewer #1.

      (3) The specific N-glycosylation sites of TCR and CD8 are not identified, and would be helpful for site-specific mutational analysis to further the authors' model.

      Thank reviewer for suggestion! Unfortunately, there are multiple-sites of TCR and CD8 involved in N-glycosylation (https://glycosmos.org/glycomeatlas). We worry that mutations of all these sites may not only affect glycosylation of TCR and CD8 but also other essential functions of those proteins.

      (4) The study could benefit from further in vivo experiments testing the role of B4GALT1 in other physiological contexts relevant to CD8 T cells, for example, autoimmune disease or infectious disease.

      Thank reviewer for this great suggestion to expand the roles of B4GALT1 in autoimmune and infection diseases. However, since in current manuscript we are mainly focusing on tumor immunology, we think we should leave these studies for future works.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Response to Reviewer’s Comments

      We thank all three reviewers for their thoughtful and detailed comments, which will help us to improve the quality and clarity of our manuscript.


      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __ Summary: In this work, Tripathi et al address the open question of how the Fat/Ds pathway affects organ shape, using the Drosophila wing as a model. The Fat/Ds pathway is a conserved but complex pathway, interacting with Hippo signalling to affect growth and providing planar cell polarity that can influence cellular dynamics during morphogenesis. Here, authors use genetic perturbations combined with quantification of larval, pupal, and adult wing shape and laser ablation to conclude that the Ft/Ds pathway affects wing shape only during larval stages in a way that is at least partially independent of its interaction with Hippo and rather due to an effect on tissue tension and myosin II distribution. Overall the work is clearly written and well presented. I only have a couple major comments on the limitations of the work.

      Major comments: 1. Authors conclude from data in Figures 1 and 2 that the Fat/Ds pathway only affects wing shape during larval stages. When looking at the pupal wing shape analysis in Figure 2L, however, it looks there is a difference in wt over time (6h-18h, consistent with literature), but that difference in time goes away in RNAi-ds, indicating that actually there is a role for Ds in changing shape during pupal stages, although the phenotype is clearly less dramatic than that of larval stages. No statistical test was done over time (within the genotype), however, so it's hard to say. I recommend the authors test over time - whether 6h and 18h are different in wild type and in ds mutant. I think this is especially important because there is proximal overgrowth in the Fat/Ds mutants, much of which is contained in the folds during larval stages. That first fold, however, becomes the proximal part of the pupal wing after eversion and contracts during pupal stages to elongate the blade (Aiguoy 2010, Etournay 2015). Also, according to Trinidad Curr Biol 2025, there is a role for Fat/Ds pathway in pupal stages. All of that to say that it seems likely that there would be a phenotype in pupal stages. It's true it doesn't show up in the adult wing in the experiments in Fig 1, but looking at the pupal wing itself is more direct - perhaps the very proximal effect is less prominent later, as there is potential for further development after 18hr before adulthood and the most proximal parts are likely anyway excluded in the analysis.

      Response: Our main purpose in examining pupal wing shape was to emphasize that wings lacking ds are visibly abnormal even at early pupal stages. The reviewer makes the point that the change in shape from 6h to 18h APF is greater in control wings than in RNAi-ds wings. We have added quantitation of this to the revised manuscript as suggested. This difference could be interpreted as indicating that Ds-Fat signaling actively contributes to wing shape during pupal morphogenesis. However, given the genetic evidence that Ds-Fat signaling influences wing shape only during larval growth, we favor the interpretation that it reflects consequences of Ds-Fat action during larval stages – eg, overgrowth of the wing, particularly the proximal wing and hinge as occurs in ds and fat mutants, could result in relatively less elongation during the pupal hinge contraction phase. This wouldn’t change our key conclusions, but it is something that we discuss in a revised manuscript.

      I think there needs to be a mention and some discussion of the fact that the wing is not really flat. While it starts out very flat at 72h, by 96h and beyond, there is considerable curvature in the pouch that may affect measurements of different axis and cell shape. It is not actually specified in the methods, so I assume the measurements were taken using a 2D projection. Not clear whether the curvature of the pouch was taken into account, either for cell shape measurements presented in Fig 4 or for the wing pouch dimensional analysis shown in Fig 3, 6, and supplements. Do perturbations in Ft/Ds affect this curvature? Are they more or less curved in one or both axes? Such a change could affect the results and conclusions. The extent to which the fat/ds mutants fold properly is another important consideration that is not mentioned. For example, maybe the folds are deeper and contain more material in the ds/fat mutants, and that's why the pouch is a different shape? At the very least, this point about the 3D nature of the wing disc must be raised in discussion of the limitations of the study. For the cell shape analysis, you can do a correction based on the local curvature (calculated from the height map from the projection). For the measurement of A/P, D/V axes of the wing pouch, best would be to measure the geodesic distance in 3D, but this is not reasonable to suggest at this point. One can still try to estimate the pouch height/curvature, however, both in wild type and in fat/ds mutants.

      Response: The wing pouch measurements were done on 2D projections of wing discs that were already slightly flattened by coverslips, so there is not much curvature outside of the folds. We will revise the methods to make sure this is clear. While we recognize that the absolute values measured can be affected by this, our conclusions are based on the qualitative differences in proportions between genotypes and time points, and we wouldn’t expect these to differ significantly even if 3D distances were measured. Obtaining accurate 3D measures is technically more challenging - it requires having spacers matching the thickness of the wing disc, which varies at different time points and genotypes, and then measuring distances across curved surfaces. What we propose to address this is to do a limited set of 3D measures on wild-type and dsmutant wing discs at early and late stages and which we expect will confirm our expectation that the conclusions of our analysis are unaffected, while at the same time providing an indication of how much curvature affects the values obtained. We will also make sure the issue of wing disc curvature and folds is discussed in the text.

      Minor comments: 1. The analysis of the laser ablation is not really standard - usually one looks at recoil velocity or a more complicated analysis of the equilibrium shape using a model (e.g Shivakumar and Lenne 2016, Piscitello-Gomez 2023, Dye et al 2021). One may be able to extract more information from these experiments - nevertheless, I doubt the conclusions would change, given that that there seems to be a pretty clear difference between wt and ds (OPTIONAL).

      Response: We will add measurements of recoil velocities to complement our current analysis of circular cuts.

      Figure 7G: I think you also need a statistical test between RNAi-ds and UAS-rokCA+RNAi-ds.

      Response: We include this statistical test in the revised manuscript (it shows that they are significantly different).

      In the discussion, there is a statement: "However, as mutation or knock down of core PCP components, including pk or sple, does not affect wing shape... 59." Reference 59 is quite old and as far as I can tell shows neither images nor quantifications of the wing shape phenotype (not sure it uses "knockdown" either - unless you mean hypomorph?). A more recent publication Piscitello-Gomez et al Elife 2023 shows a very subtle but significant wing shape phenotype in core PCP mutants. It doesn't change your logic, but I would change the statement to be more accurate by saying "mutation of core PCP components has only subtle changes in adult wing shape"

      Response: Thank-you for pointing this out, we have revised the manuscript accordingly.

      **Referee cross-commenting**

      Reviewer2: Reviewer 2 makes the statement: "The distance along the AP boundary from the pouch border to DV midline is topologically comparable to the PD length of the adult wing. The distance along the DV boundary from A border to P border is topologically comparable to the AP length of the adult wing."

      I disagree - the DV boundary wraps around the entire margin of the adult wing (as correctly drawn with the pink line in Fig 2A). It is not the same as the wide axis of the adult wing (perpendicular to the AP boundary). It is not trivial to map the proximal-distal axis of the larval wing to the proximal-distal axis of the adult, due to the changes in shape that occur during eversion. Thus, I find it much easier to look at the exact measurement that the authors make, and it is much more standard in the field, rather than what the reviewer suggests. Alternatively, one could I guess measure in the adult the ratio of the DV margin length (almost the circumference of the blade?) to the AP boundary length. That may be a more direct comparison. Actually the authors leave out the term "boundary" - what they call AP is actually the AP boundary, not the AP axis, and likewise for the DV - what they measure is DV boundary, but I only noticed that in the second read-through now. Just another note, these measurements of the pouch really only correspond to the very distal part of the wing blade, as so much of the proximal blade comes from the folds in the wing disc. Therefore, a measurement of only distal wing shape would be more comparable.

      Response: We thank Reviewer 1 for their comments here. In terms of the region measured, we measure to the inner Wg ring in the disc, the location of this ring in the adult is actually more proximal than described above (eg see Fig 1B of Liu, X., Grammont, M. & Irvine, K. D. Roles for scalloped and vestigial in regulating cell affinity and interactions between the wing blade and the wing hinge. Developmental Biology 228, 287–303 (2000)), and this defines roughly the region we have measured in adult wings (with the caveat noted above that the measurements in the disc can be affected by curvature and the hinge/pouch fold, which we will address).

      Reviewer 2 states that authors cannot definitively conclude anything about mechanical tension from their reported cutting data because the authors have not looked at initial recoil velocity. I strongly disagree. __The wing disc tissue is elastic on much longer timescales than what's considered after laser ablation (even hours), and the shape of the tissue after it equilibrates from a circular cut (1-2min) can indeed be used to infer tissue stresses (see Dye et al Elife 2021, Piscitello-Gomez et al eLife 2023, Tahaei et al arXiv 2024).__ In the wing disc, the direction of stresses inferred from initial recoil velocity are correlated with the direction of stresses inferred from analysing the equilibrium shape after a circular cut. Rearrangements, a primary mechanism of fluidization in epithelia, does not occur within 1'. Analysing the equilibrium shape after circular ablation may be more accurate for assessing tissue stresses than initial recoil velocity - in Piscitello-Gomez et al 2023, the authors found that a prickle mutation (PCP pathway) affected initial recoil velocity but not tissue stresses in the pupal wing. Such equilibrium circular cuts have also been used to analyze stresses in the avian embryo, where it correlates with directions of stress gathered from force inference methods (Kong et al Scientific Reports 2019). The Tribolium example noted by the reviewer is on the timescale of tens to hundreds of minutes - much longer than the timescale of laser ablation retraction. It is true the analysis of the ablation presented in this paper is not at the same level as those other cited papers and could be improved. But I don't think the analysis would be improved by additional experiments doing timelapse of initial retraction velocity.

      Response: Thank-you, we agree with Reviewer 1 here.

      Reviewer 2 states "If cell anistropy is caused by polarized myosin activity, that activity is typically polarized along the short edges not long edges" Not true in this case. Myosin II accumulates along long boundaries (Legoff and Lecuit 2013). "Therefore, interpreting what causes the cell anistropy and how DS regulates it is difficult," Agreed - but this is well beyond the scope of this manuscript. The authors clearly show that there is a change of cell shape, at least in these two regions. Better would be to quantify it throughout the pouch and across multiple discs. Similar point for myosin quantifications - yes, polarity would be interesting and possible to look at in these data, and it would be better to do so on multiple discs, but the lack of overall myosin on the junctions shown here is not nothing. Interpreting what Ft/Ds does to influence tension and myosin and eventually tissue shape is a big question that's not answered here. I think the authors do not claim to fully understand this though, and maybe further toning down the language of the conclusions could help.

      Response: We agree with Reviewer 1 here and will also add quantitation of myosin across multiple discs and will include higher magnification myosin images and polarity tests.

      Reviewer 3: I agree with many of the points raised by Reviewer 3, in particular that relevant for Fig 1. The additional experiments looking at myosin II localization and laser ablation in the other perturbations (Hippo and Rok mutants/RNAi) would certainly strengthen the conclusions.

      Response: Reviewer 3 comment on Fig 1 requests Ab stains to assess recovery of expression after downshift, which we will do.

      We will add examination of myosin localization in hpo RNAi wing discs, and in the ds/rok combinations. We note that the effects of Rok manipulations on myosin and on recoil velocity have been described previously (eg Rauskolb et al 2014).

      Reviewer #1 (Significance (Required)): I think the work provides a clear conceptual advance, arguing that the Ft/Ds pathway can influence mechanical stress independently of its interaction with Hippo and growth. Such a finding, if conserved, could be quite important for those studying morphogenesis and Fat function in this and other organisms. For this point, the genetic approach is a clear strength. Previous work in the Drosophila wing has already shown an adult wing phenotype for Ft/Ds mutations that was attributed to its role in the larval growth phase, as marked clones show aberrant growth in mutants. The novelty of this work is the dissection of the temporal progression of this phenotype and how it relates to Hippo and myosin II activation. It remains unclear exactly how Ft/Ds may affect tissue tension, except that it involves a downregulation of myosin II - the mechanism of that is not addressed here and would involve considerable more work. I think the temporal analysis of the wing pouch shape was quite revealing, providing novel information about how the phenotype evolves in time, in particular that there is already a phenotype quite early in development. As mentioned above, however, the lack of consideration of the wing disc as a 3D object is a potential limitation. While the audience is likely mostly developmental biologists working in basic research, it may also interest those studying the pathway in other contexts, including in vertebrates given its conservation and role in other processes.

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __ The manuscript begins with very nice data from a ts sensitive period experiment. Instead of a ts mutation, the authors induced RNAi in a temperature dependent manner. The results are striking and strong. Knockdown of FT or DS during larval stages to late L3 changed shape while knockdown of FT or DS during later pupal stages did not. This indicates they are required during larval, not pupal stages of wing development for this shape effect. They did shift-up or shift-down at "early pupa stage" but precisely what stage that means was not described anywhere in the manuscript. White prepupal? Time? Likewise a shift-down was done at "late L3" but that meaning is also vague. Moreover, I was surprised to see they did not do a shift-up at the late L3 stage, to give completeness to the experiment. Why?

      Response: We have added more precise descriptions of the timing, and we will also add the requested late L3 shift-up experiment.

      Looking at the "shape" of the larval wing pouch they see a difference in the mutants. The pouch can be approximated as an ellipse, but with differing topology to the adult wing. Here, they muddled the analysis. The adult wing surface is analogous to one hemisphere of the larval wing pouch, ie., either dorsal or ventral compartment. The distance along the AP boundary from the pouch border to DV midline is topologically comparable to the PD length of the adult wing. The distance along the DV boundary from A border to P border is topologically comparable to the AP length of the adult wing. They confusingly call this latter metric the "DV length" and the former metric the "AP length" , and in fact they do not measure the PD length but PD+DP length. Confusing. Please change to make this consistent with earlier analysis of the adult and invert the reported ratio and divide by two.

      Then you would find the larval PD/AP ratio is smaller in the FT and DS mutants than wildtype, which resembles the smaller PD/AP ratio seen in the mutant adult wings. Totally consistent and also provides further evidence with the ts experiments that FT and DS exert shape effects in the larval phase of life.

      Response: As noted by Reviewer 1 in cross-referencing, some of the statements made by Reviewer 2 here are incorrect, eg “The distance along the DV boundary from A border to P border is topologically comparable to the AP length of the adult wing.” They are correct where they note that the A-P length we measure in the discs is actually equivalent to 2x the adult wing length, since we are measuring along both the dorsal and ventral wing, but this makes no difference to the analysis as the point is to compare shape between time points and genotypes, not to make inferences based on the absolute numbers obtained. The numerical manipulations suggested are entirely feasible but we think they are unnecessary.

      The remainder of the manuscript has experimental results that are more problematic, and really the authors do not figure out how the shape effect in larval stages is altered. I outline below the main problems.

      1. They compare the FT DS shape phenotypes to those of mutants or knockdowns in Hippo pathway genes (Hippo is known to be downstream of FT and DS). They find these Hippo perturbations do have shape effects trending in same direction as FT and DS effects. Knockdown reduces the PD/AP ratio while overexpressing WARTS increases the PD/AP ratio. The effect magnitudes are not as strong, but then again, they are using hypomorphic alleles and RNAi, which often induces partial or hypomorphic phenotypes. The effect strength is comparable when wing pouches are young but then dissipates over time, while FT and DS effects do not dissipate over time. The complexity of the data do not negate the idea that Hippo signaling is also playing some role and could be downstream of FT and DS in all of this. But the authors really downplay the data to the point of stating "These results imply that Ds-Fat influences wing pouch shape during wing disc growth separately from its effects on Hippo signaling." I think a more expansive perspective is needed given the caveats of the experiments.

      Response: Our results emphasize that the effects of Ds-Fat on wing shape cannot be explained solely by effects on Hippo signaling, eg as we stated on page 7 “These observations suggest that Hippo signaling contributes to, but does not fully explain, the influence of ds or fat on adult wing shape.” We also note that impairment of Hippo signaling has similar effects in younger discs, but very different effects in older discs, which clearly indicates that they are having very different effects during disc growth; we will revise the text to make sure our conclusions are clear.

                    The reviewer wonders whether some of the differences could be due to the nature of the alleles or gene knockdown. First, the *ex*, *ds*, and *fat* alleles that we use are null alleles (eg see FlyBase), so it is not correct to say that we use only hypomorphic alleles and RNAi. We do use a hypomorphic allele for wts, and RNAi for hpo, for the simple reason that null alleles in these genes are lethal, so adult wings could not be examined. A further issue that is not commented on by the reviewer, but is more relevant here, is that there are multiple inputs into Hippo signaling, so of course even a null allele for ex, ds or fat is not a complete shutdown of Hippo signaling. Nonetheless, one can estimate the relative impairment of Hippo signaling by measuring the increased size of the wings, and from this perspective the knockdown conditions that we use are associated with roughly comparable levels of Hippo pathway impairment, so we stand by our results. We do however, recognize that these issues could be discussed more clearly in the text, and will do so in a revised manuscript.
      

      Puzzlingly, this lack of taking seriously a set of complex results does not transfer to another set of experiments in which they inhibit or activate ROK, the rho kinase. When ROK is perturbed, they also see weak effects on shape when compared to FT or DS perturbation. This weakness is seen in adults, larvae, clones and in epistasis experiments. The epistasis experiment in particular convincingly shows that constitutuve ROK activation is not epistatic to loss of DS; in fact if anything the DS phenotype suppresses the ROK phenotype. These results also show that one cannot simply explain what FT and DS are doing with some single pathway or effector molecule like ROK. It is more complex than that.

      What I really think was needed were experiments combining FT and DS knockdown with other mutants or knockdowns in the Hippo and Rho pathways, and even combining Hippo and Rho pathway mutants with FT or DS intact, to see if there are genetic interactions (additive, synergistic, epistatic) that could untangle the phenotypic complexity.

      Response: We’re puzzled by these comments. First, we never claimed that what Fat or Ds do could be explained simply by manipulation of Rok (eg, see Discussion). Moreover, examination of wings and wing discs where ds is combined with Rho manipulations is in Fig 7, and Hippo and Rho pathway manipulation combinations are in Fig S5. We don’t think that combining ds or fat mutations with other Hippo pathway mutations would be informative, as it is well established that Ds-Fat are upstream regulators of Hippo signaling.

      Laser cutting experiments were done to see if there is anisotropy in tissue tension within the wing pouch. This was to test a favored idea that FT and DS activity generates anisotropy in tissue tension, thereby controlling overall anisotropic shape of the pouch. However there is a fundamental flaw to their laser cutting analysis. Laser cutting is a technique used to measure mechanical tension, with initial recoil velocity directly proportional to the tissue's tension. By cutting a small line and observing how quickly the edges of the cut snap apart, people can quantify the initial recoil velocity and infer the stored mechanical stress in the tissue at the time of ablation. Live imaging with high-speed microscopy is required to capture the immediate response of the tissue to the cut since initial recoil velocity occurs in the first few seconds. A kymograph is created by plotting the movement of the tissue edges over this time scale, perpendicular to the cut. The initial recoil velocity is the slope of the kymograph at time zero, representing how fast the severed edges move apart. A higher recoil velocity indicates higher mechanical tension in the tissue. However, the authors did not measure this initial recoil velocity but instead measured the distance between the severed edges at one time point: 60 seconds after cutting. This is much later than the time point at which the recoil usually begins to dissipate or decay. This decay phase typically lasts a minute or two, during which time the edges continue to separate but at a progressively slower rate. This time-dependent decay of the recoil reveals whether the tissue behaves more like a viscous fluid or an elastic solid. Therefore, the distance metric at 60 seconds is a measurement of both tension and the material properties of the cells. One cannot know then whether a difference in the distance is due to a difference in tension or fluidity of the cells. If the authors made measurements of edge separation at several time points in the first 10 seconds after ablation, they can deconvolute the two. Otherwise their analysis is inconclusive. Anisotropy in recoil could be caused by greater tissue fluidity along one axis. Observing a gradient of cell fluidity in a tissue along one axis of a tissue has been observed in the amnioserosa of Tribolium for example. (Related and important point - was the anisotropy of recoil oriented along the PD or AP axis or not oriented to either axis, this key point was never stated)..

      The authors cannot definitiviely conclude anything about mechanical tension from their reported cutting data.

      Response: As noted by Reviewer 1 in cross-commenting, there is no fluidity on a time scale of 1 minute in the wing disc, and circular ablations are an established methods to investigate tissue stress. We choose the circular ablation method in part because it interrogates stress over a larger area, whereas cutting individual junctions is subject to more variability, particularly as the orientation of the junction (eg radial vs tangential) impacts the tension detected in the wing disc. Nonetheless, we will add recoil measurements to the revised manuscript to complement our circular ablations, which we expect will provide independent confirmation of our results and address the Reviewer’s concern here.

      They measured the eccentricity of wing pouch cells near the pouch border, and found they were highly anisotropic compared to DS mutant cells at comparable locations. Cells were elongated but again what if either axis (PD or AP) they were elongated along was never stated. If cell anistropy is caused by polarized myosin activity, that activity is typically polarized along the short edges not long edges. Thus, recoil velocity after laaser cutting would be stonger along the axis aligned with short cell edges. It looks like the cutting anisotropy they see is greater along the axis aligned with long cell edges. Of course, if the cell anisotropy is caused by a pulling force exerted by the pouch boundary, then it would stretch the cells. This would in fact fit their cutting data. But then again, the observed cell anisotropy could also be caused by variation in the fluid-solid properties of the wing cells as discussed earlier. Compression of the cells then would deform them anisotropically and produce the anisotropic shapes that were observed, Therefore, interpreting what causes the cell anistropy and how DS regulates it is difficult,

      Response: As noted by Reviewer 1 in cross-commenting, it is well established that tension and myosin are higher along long edges in the proximal wing. However, we acknowledge that we could do a better job of making the location and orientation of the regions shown in these experiments clear and, we will address this in a revised manuscript.

      The imaging and analysis of the myosin RLC by GFP tagging is also flawed. SQH-GFP is a tried and true proxy for myosin activity in Drosophila. Although the authors image the wing pouch of wildtype and DS mutants. they did so under low magnification to image the entire pouch. This gives a "low-res" perspective of overall myosin but what they needed to do was image at high magnification in that proximal region of the pouch and see if Sqh-GFP is polarized in wildtype cells along certain cell edges aligned with an axis. And if such a polarity is observed, is it present or absent in the DS mutant. From the data shown in Figure 5, I cannot see any significant difference between wildtype and knocked down samples at this low resolution. Any difference, if there is any, is not really interpretable.

      Response: We agree that examination of myosin localization at high resolution to see if it is polarized is a worthwhile experiment. We did in fact do this, and myosin (Sqh:GFP) appeared unpolarized in ds mutants. However, the levels of myosin were so low that we didn’t feel confident in our assessment, so we didn’t include it. We now recognize that this was a mistake, and we will include high resolution myosin images and assessments of (lack of) polarity in a revised manuscript to address this comment.

      In conclusion, the manuscript has multiple problems that make it imposiible for the authors to make the claims they make in the current manuscript. And even if they calibrated their interpretations to fit the data, there is not much of a simple clear picture as to how FT and DS regulate pouch eccentricity in the larval wing.

      Response: We think that the legitimate issues raised are addressable, as described above, while some of the criticisms are incorrect (as noted by Reviewer 1).

      Reviewer #2 (Significance (Required)): This manuscript describes experiments studying the role that the protocadherins FAT and DACHSOUS play in determining the two dimensional "shape" of the fruit fly wing. By "shape", the manuscript really means how much the wing's outline, when approximated as an ellipse, deviates from a circle. The elliptical approximations of FT and DS mutant wings more closely resemble a circle compared to the more eccentric wildtype wings. This suggests the molecules contribute to anisotropic growth in some way. A great deal of attention has been paid on how FT and DS regulate overall organ growth and planar cell polarity, and the Irvine lab has made extensive contributions to these questions over the years. Somewhat understudied is how FT and DS regulate wing shape, and this manuscript focuses on that. It follows up on an interesting result that the Irvine lab published in 2019, in which mud mutants randomized spindle pole orientation in wing cells but did not change the eccentricity of wings, ruling out biased cell division orientation as a mechanism for the anisotropic growth.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __ Summary The authors investigate the mechanisms underlying epithelial morphogenesis using the Drosophila wing as a model system. Specifically, they analyze the contribution of the conserved Fat/Ds pathway to wing shape regulation. The main claim of the manuscript is that Ds/Fat controls wing shape by regulating tissue mechanical stress through MyoII levels, independently of Hippo signaling and tissue growth.

      Major Comments To support their main conclusions, the authors should address the following major points and consider additional experiments where indicated. Most of the suggested experiments are feasible within a reasonable timeframe, while a few are more technically demanding but would substantially strengthen the manuscript's central claims.

      Figure 1: The authors use temperature-sensitive inactivation of Fat or Ds to determine the developmental window during which these proteins regulate wing shape. To support this claim, it is essential to demonstrate that upon downshift during early pupal stages, Ds or Fat protein levels are restored to normal. For consistency, please include statistical analyses in Figure 1P and ensure that all y-axis values in shape quantifications start at 1.

      Response: We will do the requested antibody stains for Fat (Ds antibody is unfortunately no longer available, but the point made by the reviewer can be addressed by Fat as the approach and results are the same for both genes). We have also added the requested statistical analysis to Fig 1P, and adjusted the scales as requested.

      Figure 2: The authors propose that wing shape is regulated by Fat/Ds during larval development. However, Figure 2L suggests that wing elongation occurs in control conditions between 6 and 12 h APF, while this elongation is not observed upon Ds RNAi. The authors should therefore perform downshift experiments while monitoring wing shape during the pupal stage to substantiate their main claim. In addition, equivalent data for Fat loss of function should be included to support the assertion that Fat and Ds act similarly.

      Response: As noted in our response to point 1 of Reviewer 1, we agree that there does seem to be relatively more elongation in control wings than in ds RNAi wings, but we think this likely reflects effects of ds on growth during larval stages, and we will revise the manuscript to comment on this.

      We will also add the suggested examination of fat RNAi pupal wings.

      The suggested examination of pupal wing shape in downshift experiments is unfortunately not feasible. Our temperature shift experiments expressing ds or fat RNAi are done using the UAS-Gal4-Gal80tssystem. We also use the UAS-Gal4 system to mark the pupal wing. If we do a downshift experiment, then expression of the fluorescent marker will be shut down in parallel with the shut down of ds or fat RNAi, so the pupal wings would no longer be visible.

      Figure 3: The authors state that "These observations indicate that Ds-Fat signaling influences wing shape during the initial formation of the wing pouch, in addition to its effects during wing growth." This conclusion is not fully supported, as the authors only examine wing shape at 72 h AEL. At this stage, fat or ds mutant wings already display altered morphology. The authors could only make this claim if earlier time points were fully analyzed. In fact, the current data rather suggest that Ds function is required before 72 h AEL, as a rescue of wing shape is observed between 72 and 120 h AEL.

      Response: First, I think we are largely in agreement with the Reviewer, as the basis for our saying that DS-Fat are likely required during initial formation of the wing pouch is that our data show they must be required before 72 h AEL. Second, 72 h is the earliest that we can look using Wg expression as a marker, as at earlier stages it is in a ventral wedge rather than a ring around the future wing pouch + DV line (eg see Fig 8 of Tripathi, B. K. & Irvine, K. D. The wing imaginal disc. Genetics (2022) doi:10.1093/genetics/iyac020.). We can revise the text to make sure this is clear.

      Figure 4: The authors state that "The influence of Ds-Fat on wing shape is not explained by Hippo signaling." However, this conclusion is not supported by their data, which show that partial loss of ex or hippo causes clear defects in wing shape. In addition, the initial wing shape is affected in wts and ex mutants, and hypomorphic alleles were used for these experiments. Therefore, the main conclusion requires revision. It would be useful to include a complete dataset for hippo RNAi, ex, and wts conditions in Figure S1. The purpose and interpretation of the InR^CA experiments are also unclear. While InR^CA expression can increase tissue growth, Hippo signaling has functions beyond growth control. Whether Hippo regulates tissue shape through InR^CA-dependent mechanisms remains to be clarified.

      Response: As noted in our response to point 1 of Reviewer 2 - our results emphasize that the effects of Ds-Fat on wing shape cannot be explained solely by effects on Hippo signaling, eg as we stated on page 7 “These observations suggest that Hippo signaling contributes to, but does not fully explain, the influence of ds or fat on adult wing shape.” We also note that impairment of Hippo signaling has similar effects in younger discs, but very different effects in older discs, which clearly indicates that they are having very different effects during disc growth. We will make some revisions to the text to make sure that our conclusions are clear throughout.

      While we used a hypomorphic allele for wts, because null alleles are lethal, the ex allele that we used is described in Flybase as an amorph, not a hypomorph, and as noted in our response to Reviewer 2, we will add some discussion about relative strength of effects on Hippo signaling.

      In Fig S1, we currently show adult wings for ex[e1] and RNAi-Hpo, and wing discs for wts[P2]/wts[x1], and for ex[e1]. The wts combination does not survive to adult so we can’t include this. We will however, add hpo RNAi wing discs as requested.

                    The purpose of including InR^CA experiments is to try to separate effects of Hippo signaling from effects of growth, because InR signaling manipulation provides a distinct mechanism for increasing growth. We will revise the text to try to make sure this is clearer.
      

      Figure 5: This figure presents images of MyoII distribution, but no quantification across multiple samples is provided. Moreover, the relationship between changes in tissue stress and MyoII levels remains unclear. Performing laser ablation and MyoII quantification on the same samples would provide stronger support for the proposed conclusions.

      Response: We will revise the quantitation so that it presents analysis of averages across multiple discs, rather than representative examples of single discs.

      Both the myosin imaging, and the laser ablation were done on the same genotypes (wildtype and ds) at the same ages (108 h AEL) so we think it is valid to directly compare them. Moreover, the imaging conditions for laser ablation and myo quantification are different, so it’s not feasible to do them at the same time (For ablations we do a single Z plane and a single channel (has to include Ecad, or an equivalent junctional marker) on live discs, so that fast imaging can be done. For Myo imaging we do multiple Z stacks and multiple channels (eg Ecad and Myo), which is not compatible with the fast imaging needed for analysis of laser ablations).

      Figure 6: It is unclear when Rok RNAi and Rok^CA misexpression were induced. To substantiate their claims, the authors should measure both MyoII levels and mechanical tension under the different experimental conditions in which wing shape was modified through Rok modulation (i.e. the condition shown in Fig. 7G). For comparison, fat and ds data should be added to Fig 6H. Overall, the effects of Rok modulation appear milder than those of Fat manipulation. Given that Dachs has been shown to regulate tension downstream of Fat/Ds, it would be informative to determine whether tissue tension is altered in dachs mutant wings and to assess the relative contribution of Dachs- versus MyoII-mediated tension to wing shape control. It would also be interesting to test whether Rok activation can rescue dachs loss-of-function phenotypes.

      Response: In these Rok experiments there was no separate temporal control of Rok RNAi or Rok^CA expression, they were expressed under nub-Gal4 control throughout development.

      We will add examination of myosin in combinations of ds RNAi and rok manipulation as in Fig 7G to a revised manuscript.

      Data for fat and ds comparable to that shown in Fig 6H is already presented in Fig 3D, and we don’t think its necessary to reproduce this again in Fig 6H.

      We agree that the effects of Rok manipulations are milder than those of Fat manipulations; as we try to discuss, this could be because the pattern or polarity of myosin is also important, not just the absolute level, and we will add assessment of myosin polarity.

      The suggestion to also look at dachs mutants is reasonable, and we will add this. In addition, we plan to add an "activated" Dachs (a Zyxin-Dachs fusion protein previously described in Pan et al 2013) that we anticipate will provide further evidence that the effects of Ds-Fat are mediated through Dachs. We will also add the suggested experiment combining Rok activation with dachs loss-of-function.

      Figure 7: The authors use genetic interactions to support their claim that Fat controls wing shape independently of Hippo signaling. However, these interactions do not formally exclude a role for Hippo. Moreover, previous work has shown that tissue tension regulates Hippo pathway activity, implying that any manipulation of tension could indirectly affect Hippo and growth. To provide more direct evidence, the authors should further analyze MyoII localization and tissue tension under the various experimental conditions tested (as also suggested above).

      Response: As discussed above, our data clearly show that Fat has effects independently of Hippo signaling that are crucial for its effects on wing shape, but we did not mean to imply that the regulation of Hippo signaling by Fat makes no contribution to wing shape control, and we will revise the text to make this clearer. We will also add additional analysis of Myosin localization , as described above.

      Reviewer #3 (Significance (Required)): How organ growth and shape are controlled remains a fundamental question in developmental biology, with major implications for our understanding of disease mechanisms. The Drosophila wing has long served as a powerful and informative model to study tissue growth and morphogenesis. Work in this system has been instrumental in delineating the conserved molecular and mechanical processes that coordinate epithelial dynamics during development. The molecular regulators investigated by the authors are highly conserved, suggesting that the findings reported here are likely to be of broad biological relevance.

      Previous studies have proposed that anisotropic tissue growth regulates wing shape during larval development and that such anisotropy induces mechanical responses that promote MyoII localization (Legoff et al., 2013, PMID: 24046320; Mao et al., 2013, PMID: 24022370). The Ds/Fat system has also been shown to regulate tissue tension through the Dachs myosin, a known modulator of the Hippo/YAP signaling pathway. As correctly emphasized by the authors, the respective contributions of anisotropic growth and mechanical tension to wing shape control remain only partially understood. The current study aims to clarify this issue by analyzing the role of Fat/Ds in controlling MyoII localization and, consequently, wing shape. This represents a potentially valuable contribution. However, the proposed mechanistic link between Fat/Ds and MyoII localization remains insufficiently explored. Moreover, the role of MyoII is not fully discussed in the broader context of Dachs function and its known interactions with MyoII (Mao et al., 2011, PMID: 21245166; Bosveld et al., 2012, PMID: 22499807; Trinidad et al., 2024, PMID: 39708794). Most importantly, the experimental evidence supporting the authors' conclusions would benefit from further strengthening. It should also be noted that disentangling the relative contributions of anisotropic growth and MyoII polarization to tissue shape and size remains challenging, as MyoII levels are known to increase in response to anisotropic growth (Legoff et al., 2013; Mao et al., 2013), and mechanical tension itself can modulate Hippo/YAP signaling (Rauskolb et al., 2014, PMID: 24995985).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Overall Response.

      We would like to thank the reviewers for their analysis of the manuscript. From their comments it is clear that our manuscript was not. We completely rewrote the manuscript to focus on the central core question which was how does Adam13 regulates gene expression in general and TFap2a in particular leading to the expression of Calpain8 a protein required for CNC migration.

      The following model will be the central line of our story. It will address all of the proteins involved and mechanistical evidences that link Adam13 to one of its proven effector target Calpain8.

      • *

      *Reviewer #1 (Evidence, reproducibility and clarity (Required)): **

      In this manuscript, Pandey et al. show that the ADAM13 protein modulates histone modifications in cranical neural crest and that the Arid3a protein binds the Tfap2a promoter in an Adam13-dependent manner and has promoter-specific effects on transcription. Furthermore, they show that the Adam13 and human ADAM9 proteins associated with histone modifiers as well as proteins involved in RNA splicing. Although the manuscript is mostly clearly written and the figures well assembled, it reads like a couple of separate and unfinished stories.*

      I believe that our story line was not clear and that the overarching questions was not well stated. We have made every effort to change this in the revised manuscript. I would like to include a figure that explains the story.

      In short:

      1 We knew that Adam13 could regulate gene expression in CNC via its cytoplasmic domain.

      2 We also knew that this required Adam13 interaction with Arid3a and that a direct target with the transcription factor TFAP2a which in turn regulates functional targets that we had identified including the protocadherin PCNS and the protease Calpain8.

      Our goal was to understand the mechanism allowing Adam13 to regulate gene expression.

      3 This first part of this manuscript shows how Adam13 modulates Histone modification in vivo in the CNC globally as well as specifically on the Tfap2a promoter. This results I an Open chromatin.

      4 Using Chip we show that Adam13 and Arid3a both bind to the Tfap2a promoter and that Arid3a binding to the first ATG depends on Adam13.

      5 Using Luciferase reporter we show that both Adam13 and Arid3a can induce expression at the first ATG.

      *They show using immunocytochemistry and qPCR that ADAM13 knockouts in CNCs afffects histone modifications. Here ChIP-seq or Cut-n-Run experiments would be more appropriate and would result in a more comprehensive understanding of the changes mediated. *

      I agree but we did not have the fund and now I have nobody working in the lab to do this experiment. These are also likely to overlap with the RNAseq data that we have and would simply add more open leads. We selected to go after the only direct target that we know which is TFAP2a and focus on this gene to understand the mechanism.

      We believe that the Chip PCR experiment are sufficient for this story.

      *The immunohistochemistry assays should at least be verified further using western blotting or other more quantiative methods. *

      Immunofluorescence and statistical analysis is a valid quantification method. Western blot of CNC explants is not trivial and requires a large amount of material. Given the small overall change we also would not expect to be able to detect the change over the noise of western blot. The Chip PCR confirms our finding in a completely independent manner.

      *The authors then show that ADAM13 interacts with a number of histone modifiers such as KDM3B, KDM4B and KMT2A but strangely they do not follow up this interesting observation to map the interactions further (apart from a co-ip with KMT2A), the domains involved, the functional role of the interactions or how they mediate the changes in chromatin modifications. *

      We selected KMT2a because it is expressed in the Hek293T cells. KMT2D has been shown to regulate CNC development in Xenopus and is responsible for the Kabuki syndrome in human. We used aphafold to predict interaction and found that Adam13 interact with the Set domain. In addition we see multiple Set- containing domain protein in our mass spec data. The mass spec is done on Human hek293T cells that express a subset of KMT proteins. We now include evidence that Adam13 interact with the KMT2D SET domain (new figure 5D)

      The authors then show that ADAM13 affects expression of the TFAP2a gene in a promoter specific manner - affecting expression from S1 but not S2.

      It is the S1 but not S3. Adam13 has no effect on S2.

      • They further show that ADAM13 affects the binding of the Arid3 transcription fator to the S1-promoter but not to the S3 promoter. However, ADAM13 was present at both promoters. Absence of ADAM13 resulted in increased H3K9me2/3 and decreased H3K4me3 at the S1 promoter whereas only H3K4me3 was changed at the S2 promoter*

      S3 not S2*. Unfortunately, they do not show how this is mediated or through which binding elements this takes place. Why is ADAM13 present at both promoters but only affects Arid3 binding at S1? *

      We agree this is a very interesting question that could be the subject of an entire publication. Promoter deletion and mutation to identify which site are bound by and modulated by Adam13/Arid3a is not trivial.

      *The authors claim that transfecting Arid3a and Adam13 together further increases expression from a reporter (Fig 4E) but this is not true as no statistical comparison is done between the singly transfected and double transfected cells. *

      This is correct, there is a small increase that is not significant with both. The fact that both proteins can induce the promoter suggest but does not prove that they can have additive roles. The loss of function experiment shows that the human Arid3a expressed in Hek293T cells is important for Adam13 increases of S1. It is possible that the dose of the endogenous Arid3a is sufficient to get full activity of Adam13.* Then the authors surprisingly start investigating association of proteins with the two isoforms of TFAP2a which in the mind of this reviewer is a different question entirely. *

      We agree and have removed this part of the manuscript.

      *They find a number of proteins involved in splicing. And the observation that ADAM13 also interacts with splicing factors is really irrelevant in terms of the story that they are trying to tell. Transcription regulation and splicing are different processes and although both affect the final outcome, mRNA, they need to be investigated separately. The link is at least not very clear from the manuscript. Again, the effects on splicing are not further investigated through functional analysis and as presented the data presented is too open-ended and lacking in clarity. *

      We agree that beside the different activities of the TFap2a isoform, the rest of the splicing regulation could be a separate study. We were interested to understand how these two isoforms could activate Calpain8 so differently this is why we looked at LC/MS/MS. We have removed this part of the story from the manuscript.*

      Additional points: 1. In the abstract they propose that the ADAMs may act as extracellular sensors. This is not substantiated by the results. *

      As an extracellular protein translocating into the nucleus it is a possibility that we propose, but I agree this is not investigated in this manuscript. We will modify the text.* 2. Page 5, line 16: what is referred to by 6 samples 897 proteins? Were 6 samples analyzed for each condition? The number of repeats for the mass spec analysis is not clear from the text nor are the statistical parameters used to analyse the data. This is also true for the mass spec presented in the part on TFAP2aL-S1 and Adam13 regulate splicing. Statistics and repeats are not presented. *

      In general we provide biological triplicate and use the statistical function of Scaffold to identify proteins that are significantly enriched or absent in each samples.

      When we specify 6 samples it means 6 independent proteins samples were analyzed and used for our statistic. We use Scafold T-test with a p value less than 0.05. Peptides were identified with 95% confidence and proteins with 99% confidence.* 3. Page 6, line 19: set domain should be SET domain. *

      Yes* 4. The number of repeats in the RNA sequencing of the CNCs is not clear from the text. *

      Three biological replicates (Different batch of embryos from different females).* 5. The explanation of Figure C is a bit lacking. There are two forms of TFAP2a, L and S, but only one is presented in the figure. Do both forms have the extra S1-3 exons? Also, at the top of the figure it is not clear that the boxes are part of a continuous DNA sequence. Also, it is not clear which codon is not coding. *

      Xenopus laevis are pseudo tetraploid giving in most cases L and S genes in addition to the 2 alleles from being diploid. The TFAP2a gene structure is conserved between both aloalleles and is similar to the human gene. For promoter analysis and Chip PCR we chose one of the alloallele (L), given that the RNAseq data showed that both genes and variant behave the same in response to Adam13. This only becomes important in loss of function experiment in which both L and S version need to be knock down or Knock out.

      * In the sashimi plot there are green and pink shaded areas. What do they denote? What exactly is lacking in the MO13 mutant - seems that a particular exon is missing suggesting skipping?*

      MO13 is a morpholino that bocks the translation of Adam13 (Already characterized with >90% of the protein absent) but does not affect Adam13 mRNA expression.* 7. Page 11, line 9: „with either MbC or MbC and MO13" needs to be rephrased. *

      Will do *8. Page 11, line 19: „the c-terminus of....and S3) and" should be „the C-terminus of...and S3 and". ** 9. Page 15, line 10: substrateS 10. Page 16, line 23: the sentence „increases H3K9 to the promoter of the most upstream" needs revision. 11. Page 26, line 12: Here the authors say: „for two samples two-tail unpaired". What does this mean? Statistics should not be performed on fewer than three samples. In legnd to Figure 6 it indicates that T-test was performed on two samples. 12. The discussion should be shortened and simplified. 13. Figure 1 legend. How many images were quantitated for each condition? *

      At least 3 images per condition. For 3 independent experiments. (9 images per condition).* 14. Figure 2 has a strange order of panels where G is below B. 15. Figure 6 legend, line 12. „proteins that were significantly enriched in either of the 2 samples" is not very clear. What exactly does this mean?

      Reviewer #1 (Significance (Required)):

      If the authors follow up on either the transcription-part of the story, or the splicing part of the story, they are likely to have important results to present. However, in the present format the paper is lacking in focus as both issues are mixed together without a clear end-result. *

      We have entirely changed the paper according to these comments.

      *

      • *

      *Reviewer #2 (Evidence, reproducibility and clarity (Required)): **

      Panday et al seeks to determine the function of ADAM13 in regulating histone modifications, gene expression and splicing during cranial neural crest development. Specifically, the authors tested how Adam13, a metalloprotease, could modify chromatin by interaction with Arid3a and Tfap2a and RNA splicing and gene expression. They then utilize knockouts in Xenopus and HEK293T cells followed by immunofluorescence, IPs, BioID, luciferase assays, Mass spec and RNA assays. Although there is some strong data in the BioID and luciferase experiments, the manuscript tells multiple stories, linking together too many things to make a compelling story. The result is a paper that is very difficult to read and understand the take home message. In addition, some of the conclusions are not supported by the data. This unfortunately means it is not ready for publication. However, I have added below some suggestions that would strengthen the manuscript. My comments are below: *

      Clarity is clearly an issue here. The new version is entirely re-written.

      Here is the take home message:

      We knew that Adam13 could regulate gene expression via its cytoplasmic domain. One of the key targets was identified as Calpain8, a protein critical for CNC migration. We subsequently showed that Adam13 and Arid3a regulated Tfap2a expression which in turn regulated Calpain8.

      In this manuscript we investigated 1) how Adam13 regulates TFAP2a and 2) how Tfap2a controls Calpain8 expression.

      The take home message is that Adam13 bind to Histone methyl transferase and changes the histone methylation code overall in the CNC and in particular at the TFAP2a promoter. This results in more open chromatin. We further find that Adam13 binds to the Tfap2a promoter in vivo and is important for Arid3a binding to the first start. Tfap2a that include this N-terminus sequence regulates Capn8 expression.*

      Major comments: 1. I think it would be better to split out the chromatin modification function from the splicing in two separate papers. While there is a connection, having it all together makes the story difficult to follow. *

      Agree but I believe that the S1 vs S3 story of Tfap2a is important for the overall story. The new paper does not emphasize splicing.* 2. The immunofluorescence of H3K9me2/3, in Figure 1, 2, 3 following Adam13 knockdown is not convincing. There seems to be a strong edge effect especially in Figure 2 and 3. *

      The statistical analysis shows that the results, while modest, are significant (Three independent experiments using 3 different females and 3 explants for each condition were analyzed). The edge effect observed is eliminated by the mask that we use that normalize the expression to either DAPI or Snai2. The edge effect is seen in both control and KD as well. These are further confirmed by the Chip PCR on one direct target.

      Similarly the Arid3a expression in Supp Figure 1 if anything seems increased.

      We have previously shown that Arid3a expression is not affected by Adam13 KD (Khedgikar et al). Our point here is simply that the difference in Tfap2a cannot be explained by a decrease in Arid3a expression. It is not a critical figure and was eliminated in the new manuscript.

      *It would be better to quantify by western blot and not by fluorescent intensity since it is difficult to determine what a small change in fluorescent intensity means in vivo. *

      Not all antibodies used here work by western blot and the quantity of material required for western blot is much larger than IF. Given the small overall changes and the variability observed in Western blot it is not a viable alternative.

      IF is a quantitative method that has been used widely to assert increase or decrease of protein level or post translational modification. The fact that the same post translational modification that we see in cranial neural crest explants can also be seen by ChipPCR on the Tfap2a promoter confirm this observation.

      *Also, it does not say in the text or the figure legend what these are, Xenopus explants of CNC? *

      These are CNC explants. It is now clearly stated in the figure legend.* 3. The rationale for isolating KMT2A from the other chromatin modifiers in the dataset is not clear. *

      The new manuscript is clarifying that point. Because we are using Hek293T cells in this assay, which are human embryonic kidney derived instead of Xenopus Cranial neural crest cells, we are not interested in a specific protein but rather a family of protein that can modify histones (KMT and KDM). Our rational is if Adam13 can bind to KMT2 via the SET domain, it is likely to interact with KTM2 that are expressed in the CNC. KMT2A and D are expressed in the CNC. This is why we selected KMT2a here (Hek293T). We now include 1 co-IP with the Set domain of Xenopus KMT2D (new figure 5D)

      From the RNA-seq in Supp Figure 2 it is not changed as much as likely some of the others.

      The new manuscript addresses this point. We did not show or expect that the loss of Adam13 would affect mRNA expression of Kmt2.

      *Also, the arrow seems to indicate that it is right above the cutoff. What about other proteins with ATPase activity? That is the top hit in the Dot plot nuclear function. Would be helpful to write out Adam13 cytoplasm/nucleus here. *

      We have used another set of proteomics data that does not include the cytoplasmic/nuclear extract to simplify the results. We hope that the changes make it more obvious.

      Given that we are looking at Chromatin remodeling enzyme here we did not chose to investigate further in this report the ATPase. This is such a wide category that it could lead us away from the main story here.* 4. The splicing information, while interesting would be better as a different manuscript. The sashimi plot requires more explanation as written. *

      We agree and think that a simple representation of the fold change of the different isoform is more obvious. It is now a minor part of figure 1 and the legend has been improved to describe the method here.

      How do you tell if the interactions are changed from this?

      I do not understand this question. The sashimi plot indicate the read through from the mRNA that goes from one exon to the next quantifying the specific exon usage. It can therefore be quantified and compared between different conditions.

      • The authors argue there is a reduction of Tfap2a in Figure 3H but half the explant is not expressing sox9 in the Adam13 knockdown. How is this kind of experiment controlled when measure areas that don't have any fluorescence because of the nature of the explants? *

      We have removed this figure as we had already shown previously by western blot that Tfap2a protein decreased in MO13 embryos. As noted on the histogram, the fluorescence is only measured in Sox9 positive cells in each explant. Three independent experiments with 3 explants for each. We also have seen a decrease by Western blot and mRNA expression (Both RNAseq and realtime PCR). In most of our explants, the vast majority of the cells are positive for Snai2 and Sox9, while those that are negative are positive for Sox3 (data not shown here). There is always less signal in the center of the explant possibly due to the penetration of antibody or interference with the signal by the cells pigment or yolk autofluorescence. Our control explants have the same effect so our quantification is valid.* 5. The use of a germ line Xenopus mutant for Adam13 is great but how were these knockouts validated? *

      All of the KO were validated by sequencing, RNAseq and protein expression. These are now included in the supplemental figure 1.

      *More information is required here. The Chip-qPCR has a lot of variability between the samples, especially in the H3K9me2/3. *

      All ChipPCR were performed on Xenopus embryos. The variability is tested by statistical analysis and is either significant or not.

      Because these are in cell lines, this should be more consistent.

      They are not in cell lines but in Xenopus embryos.

      • In addition, it is difficult to understand what this means for cranial neural crest cells when assaying in HEK293T cells with the luciferase assay. *

      We use Luciferase assay in Hek293T cells to test if Xenopus protein can induce a specific reporter (Gain of function). We also use luciferase reporter in Xenopus to test if they can perceive the loss of a specific protein (For example Adam13).

      Our result show that Adam13 or Arid3a expression in Hek293T cells can induce the TFAP2S1 reporter. * 6. The migration assay shows only an example of what it looks like to have defective migration. But it would be better to show control embryos, embryos with Adam13 knockdown and what the rescues look like so the reader can make their own conclusion.*

      We can certainly include this but have published this assay in multiple publication before. The picture is a single example, the histogram shows that statistical validation.

      • The argument from the section above suggests the S1 isoform is the primary one but S3 in this assay also rescues, please explain what this result means since it seems to suggest that even though these isoforms have different activity the function is similar in terms of the ability to rescue defective migration. *

      The result in Hek293T cells shows that only TFAP2aS1 can induce Calpain8, while both S1 and S3 can partially rescue CNC migration in embryos lacking Adam13. The issue here is the dose of mRNA injected for each variant might be too high. Adam13 proteolytic activity is also critical, so we do not expect a complete rescue. The fact that S1 is significantly better at rescuing than S3 is relevant here. It is possible that if we were to decrease the dose of each mRNA we would find one in which S3 no longer rescues but S1 does.

      * The next section again talks about yet another protein Calpain-8. Here the authors use MO13 for luciferase assays instead of HEK293 cells. The authors do not explain why they decided to switch from cells to MO.*

      Calpain8 is one of the validate target of Adam13 that can rescue CNC migration (Cousin et al Dev Cell). We use the luciferase reporter corresponding to the Xenopus Capn8 reporter to show 1 in vivo that loss of Adam13 reduce its expression (Similar to the Capn8 gene). We then went in vitro using Hek293T cells for gain of function experiment that shows that only the Tfaps2S1 variant can induce it while S3 does not.

      We hope that the graphical summary and the new manuscript make this clear.* 8. The experiment to IP RNA supports only the correlation that Adam9 and Adam13 bind RNA and RNA binding proteins to regulate splicing. This conclusion presented is not supported by the data presented here. While there is a sentence about why Adam9 was chosen here, it would be preferred to focus on Adam13 as the rest of the manuscript is focused on Adam13. The conclusions are generalized to all ADAMs, but ADAM13 and ADAM9 are the only ADAMs investigated in the manuscript *

      This figure is no longer included. For each of the protein classes that we identify by Masspec we try to find a validation. RNA-IP is simply a validation that Adam13 and Adam9 can bind to complexes that include RNA in a cytoplasmic domain dependent fashion. The conclusion that Adam13 and possibly ADAM9 might be involved in regulating splicing is 1) that the protein associated with Adam13 are include multiple splicing factors, 2) that the RNAseq analysis shows abnormal splicing in CNC missing Adam13 and 3) that the form of TFAP2a induced by Adam13 (S1) associate significantly more with splicing factor than the S3 isoform.

      We agree that the generalization to other ADAM is not demonstrated here but only suggested. We selected ADAM9 and ADAM19 because we have shown that they can each rescue Adam13 function in the CNC. Unfortunately there are no ADAM19 antibody that work by IP on the market. We have tested multiple company and multiple cell lines.

      We believe that the ADAM9 experiment is critical to show that the protein associated with Adam13 are not simply the result of overexpressing a different species protein sin ADAM9 is the endogenous protein.*

      Minor comments 1. The manuscript using a lot of abbreviations (PCNS, NI, MO, SH3) and lingo that are unclear to a general reader. Please define acronyms when first used, as well as be clear on which model is being used in each experiment. *

      We have corrected this* 2. Similarly, the figures are not labeled such that a reader would be able to understand ie MO13 should be Adam13 knockdown etc. *

      We have corrected this in the legend

      • Please identify the genes on the heatmap and some highlighted genes from volcano plot from the RNA-seq.*

      The volcano plot is from MS/MS not RNAseq. We have list of all of the genes and/or proteins corresponding to each figure in tables

      We now have a figure from the RNAseq and a subset of genes of interest are show. *4. Why use the flag tag in Figure 5? *

      We used Flag-tagged construct to only immunoprecipitated the variants and not the endogenous TFPA2a in these experiments. Also we used RFP-Flag to eliminate any protein that bound to the tag or the antibody.

      This figure is no longer in the manuscript.* 5. Is the data in figure 4A-D the same as Supp. Figure 4A-D? *

      These are independent biological replicates of the same experiment.* 6. Please italicize gene symbols - e.g. "key transcription factors that exemplify CNC, such as the SOX9, FOXD3, SNAI1, SNAI2, and TFAP2 family". *

      We clearly have missed some, we are using italicized for gene, and regular for proteins. It might not be clear in the text when we are referring to genes and proteins. We will correct this in the rewrite. 7. Please review the manuscript for grammatical and typographical errors. * We have used all available software including Word and Grammarly. We will try to improve on the next version. **Cross-commenting**

      I think the two reviewers on one the same page on this manuscript.

      Reviewer #2 (Significance (Required)):

      If more solid, would be a conceptual advance in role of Adam13 in mediating chromatin modification and transcription factors, adds to exiting work from this lab, good for a specialize audience, my expertise is in in neural crest development, non-mammalian modes, epigenetic regulators.*

      • *
    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors describe a new computational method (SegPore), which segments the raw signal from nanopore direct RNA-Seq data to improve the identification of RNA modifications. In addition to signal segmentation, SegPore includes a Gaussian Mixture Model approach to differentiate modified and unmodified bases. SegPore uses Nanopolish to define a first segmentation, which is then refined into base and transition blocks. SegPore also includes a modification prediction model that is included in the output. The authors evaluate the segmentation in comparison to Nanopolish and Tombo (RNA002) as well as f5c and Uncalled 4 (RNA004), and they evaluate the impact on m6A RNA modification detection using data with known m6A sites. In comparison to existing methods, SegPore appears to improve the ability to detect m6A, suggesting that this approach could be used to improve the analysis of direct RNA-Seq data.

      Strengths:

      SegPore address an important problem (signal data segmentation). By refining the signal into transition and base blocks, noise appears to be reduced, leading to improved m6A identification at the site level as well as for single read predictions. The authors provide a fully documented implementation, including a GPU version that reduces run time. The authors provide a detailed methods description, and the approach to refine segments appears to be new.

      Weaknesses:

      The authors show that SegPore reduces noise compared to other methods, however the improvement in accuracy appears to be relatively small for the task of identifying m6A. To run SegPore, the GPU version is essential, which could limit the application of this method in practice.

      As discussed in Paragraph 4 of the Discussion, we acknowledge that the improvement of SegPore combined with m6Anet over Nanopolish+m6Anet in bulk in vivo analysis is modest. This outcome is likely influenced by several factors, including alignment inaccuracies caused by pseudogenes or transcript isoforms, the presence of additional RNA modifications that can affect signal baselines, and the fact that m6Anet is specifically trained on Nanopolish-derived events. Additionally, the absence of a modification-free (in vitro transcribed) control sample in the benchmark dataset makes it challenging to establish true k-mer baselines.

      Importantly, these challenges do not exist for in vitro data, where the signal is cleaner and better defined. As a result, SegPore achieves a clear and substantial improvement at the single-molecule level, demonstrating the strength of its segmentation approach and its potential to significantly enhance downstream analyses. These results indicate that SegPore is particularly well suited for benchmarking and mechanistic studies of RNA modifications under controlled experimental conditions, and they provide a strong foundation for future developments.

      We also recognize that the current requirement for GPU acceleration may limit accessibility in some computational environments. To address this, we plan to further optimize SegPore in future versions to support efficient CPU-only execution, thereby broadening its applicability and impact.

      Reviewer #2 (Public review):

      Summary:

      The work seeks to improve detection of RNA m6A modifications using Nanopore sequencing through improvements in raw data analysis. These improvements are said to be in the segmentation of the raw data, although the work appears to position the alignment of raw data to the reference sequence and some further processing as part of the segmentation, and result statistics are mostly shown on the 'data-assigned-to-kmer' level.

      As such, the title, abstract and introduction stating the improvement of just the 'segmentation' does not seem to match the work the manuscript actually presents, as the wording seems a bit too limited for the work involved.

      The work itself shows minor improvements in m6Anet when replacing Nanopolish' eventalign with this new approach, but clear improvements in the distributions of data assigned per kmer. However, these assignments were improved well enough to enable m6A calling from them directly, both at site-level and at read-level.

      A large part of the improvements shown appear to stem from the addition of extra, non-base/kmer specific, states in the segmentation/assignment of the raw data, removing a significant portion of what can be considered technical noise for further analysis. Previous methods enforced assignment of (almost) all raw data, forcing a technically optimal alignment that may lead to suboptimal results in downstream processing as datapoints could be assigned to neighbouring kmers instead, while random noise that is assigned to the correct kmer may also lead to errors in modification detection.

      For an optimal alignment between the raw signal and the reference sequence, this approach may yield improvements for downstream processing using other tools.

      Additionally, the GMM used for calling the m6A modifications provides a useful, simple and understandable logic to explain the reason a modification was called, as opposed to the black models that are nowadays often employed for these types of tasks.

      Weaknesses:

      The manuscript suggests the eventalign results are improved compared to Nanopolish. While this is believably shown to be true (Table 1), the effect on the use case presented, downstream differentiation between modified and unmodified status on a base/kmer, is likely limited for during downstream modification calling the noisy distributions are often 'good enough'. E.g. Nanopolish uses the main segmentation+alignment for a first alignment and follows up with a form of targeted local realignment/HMM test for modification calling (and for training too), decreasing the need for the near-perfect segmentation+alignment this work attempts to provide. Any tool applying a similar strategy probably largely negates the problems this manuscript aims to improve upon. Should a use-case come up where this downstream optimisation is not an option, SegPore might provide the necessary improvements in raw data alignment.

      Thank you for this thoughtful comment. We agree that many current state-of-the-art (SOTA) methods perform well on benchmark datasets, but we believe there is still substantial room for improvement. Most existing benchmarks are based on limited datasets, primarily focusing on DRACH motifs in human and mouse transcriptomes. However, m6A modifications can also occur in non-DRACH motifs, where current models tend to underperform. Furthermore, other RNA modifications, such as pseudouridine, inosine, and m5C, remain less studied, and their detection is likely to benefit from more accurate and informative signal modeling.

      It is also important to emphasize that raw signal segmentation and RNA modification detection are fundamentally distinct tasks. SegPore focuses on improving the segmentation step by producing a cleaner and more interpretable signal, which provides a stronger foundation for downstream analyses. Even if RNA modification detection algorithms such as m6Anet can partially compensate for noisy segmentation in specific cases, starting from a more accurate signal alignment can still lead to improved accuracy, robustness, and interpretability—particularly in challenging scenarios such as non-canonical motifs or less characterized modifications.

      Scientific progress in this field is often incremental, and foundational improvements can have a significant long-term impact. By enhancing raw signal segmentation, SegPore contributes an essential building block that we expect will enable the development of more accurate and generalizable RNA modification detection algorithms as the community integrates it into more advanced workflows.

      Appraisal:

      The authors have shown their methods ability to identify noise in the raw signal and remove their values from the segmentation and alignment, reducing its influences for further analyses. Figures directly comparing the values per kmer do show a visibly improved assignment of raw data per kmer. As a replacement for Nanopolish' eventalign it seems to have a rather limited, but improved effect, on m6Anet results. At the single read level modification modification calling this work does appear to improve upon CHEUI.

      Impact:

      With the current developments for Nanopore based modification calling largely focusing on Artificial Intelligence, Neural Networks and the likes, improvements made in interpretable approaches provide an important alternative that enables deeper understanding of the data rather than providing a tool that plainly answers the question of wether a base is modified or not, without further explanation. The work presented is best viewed in context of a workflow where one aims to get an optimal alignment between raw signal data and the reference base sequence for further processing. For example, as presented, as a possible replacement for Nanopolish' eventalign. Here it might enable data exploration and downstream modification calling without the need for local realignments or other approaches that re-consider the distribution of raw data around the target motif, such as a 'local' Hidden Markov Model or Neural Networks. These possibilities are useful for a deeper understanding of the data and further tool development for modification detection works beyond m6A calling.

      Reviewer #3 (Public review):

      Summary:

      Nucleotide modifications are important regulators of biological function, however, until recently, their study has been limited by the availability of appropriate analytical methods. Oxford Nanopore direct RNA sequencing preserves nucleotide modifications, permitting their study, however many different nucleotide modifications lack an available base-caller to accurately identify them. Furthermore, existing tools are computationally intensive, and their results can be difficult to interpret.

      Cheng et al. present SegPore, a method designed to improve the segmentation of direct RNA sequencing data and boost the accuracy of modified base detection.

      Strengths:

      This method is well described and has been benchmarked against a range of publicly available base callers that have been designed to detect modified nucleotides.

      Weaknesses:

      However, the manuscript has a significant drawback in its current version. The most recent nanopore RNA base callers can distinguish between different ribonucleotide modifications, however, SegPore has not been benchmarked against these models.

      The manuscript would be strengthened by benchmarking against the rna004_130bps_hac@v5.1.0 and rna004_130bps_sup@v5.1.0 dorado models, which are reported to detect m5C, m6A_DRACH, inosine_m6A and PseU.

      A clear demonstration that SegPore also outperforms the newer RNA base caller models will confirm the utility of this method.

      Thank you for highlighting this important limitation. While Dorado, the new ONT basecaller, is publicly available and supports modification-aware basecalling, suitable public datasets for benchmarking m5C, inosine, m6A, and PseU detection on RNA004 are currently lacking. Dorado’s modification-aware models are trained on ONT’s internal data, which is not publicly released. Therefore, it is currently not feasible to directly evaluate or compare SegPore’s performance against Dorado for these RNA modifications.

      We would also like to emphasize that SegPore’s primary contribution lies in raw signal segmentation, which is an upstream and foundational step in the RNA modification detection pipeline. As more publicly available datasets for RNA004 modification detection become accessible, we plan to extend our work to benchmark and integrate SegPore with modification detection tasks on RNA004 data in future studies.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Comments based on Author Response

      “However, it is valid to compare them on the segmentation task, where SegPore exhibits better performance (Table 1).”

      This dodges the point of the actual use case of this approach, as Nanopolish indeed does not support calling modifications for this kind of data, but the general approach it uses might, if adapted for this data, nullify the gains made in the examples presented.

      We respectfully disagree with the comment that the advantages demonstrated by SegPore could be “nullified”. Although SegPore’s performance is indeed more modest in in vivo datasets, it shows substantially better performance than CHEUI in in vitro data, clearly demonstrating that improved segmentation directly contributes to more accurate RNA modification estimation.

      It is worth noting that CHEUI relies on Nanopolish’s segmentation results for m6A detection. Despite this, SegPore outperforms CHEUI, further supporting the conclusion that segmentation quality has a meaningful impact on downstream modification calling.

      In conclusion, based on our current experimental results, SegPore is particularly well suited for RNA modification analysis from in vitro transcribed data, where its improved segmentation provides a clear advantage over existing methods.

      Further comments

      (2) “(2) Page 3  employ models like Hidden Markov Models (HMM) to segment the signal, but they are prone to noise and inaccuracies”

      “That's the alignment/calling part, not the segmentation?”

      “Current methods, such as Nanopolish, employ models like Hidden Markov Models (HMM) to segment the signal”

      I get the impression the word 'segment' has a different meaning in this work than what I'm used to based on my knowledge around Nanopolish and Tombo, see the deeper code examples further down below.

      Additionally, in Nanopolish there is a clear segmentation step (or event detection) without any HMM, then a sort of dynamic timewarping step that aligns the segments and re-combines some segments into a single segment where necessary afterwards. I believe the HMM in Nanopolish is not used at all unless modification calling, but if you can point out otherwise I'm open for proof.

      Now I believe it is the meaning of 'segmenting the signal' that confuses me, and now the clarification makes it a bit odd as well:

      “Nanopolish and Tombo align the raw signal to the reference sequence to determine which portion of the signal corresponds to each k-mer. We define this process as the segmentation task, referred to as "eventalign" in Nanopolish.”

      So now it's clearly stated the raw signal is being 'aligned' and then the process is suddenly defined as the 'segmentation task', and again referred to as "eventalign". Why is it not referred to as the 'alignment task' instead?

      I understand the segmentation and alignment parts are closely connected but to me, it seems this work picks the wrong word for the problem being solved.

      “Unlike Nanopolish and Tombo, which directly align the raw signal to the reference sequence,…”

      Looking at their code, I believe both Nanopolish and Tombo actually do segment the data first (or "event detection"), then they align the segments/events they found, and finally multiple events aligned to the same section are merged. See for yourself:

      Nanopolish:

      https://github.com/jts/nanopolish/blob/master/src/nanopolish_squiggle_read.cpp<br /> Line 233:

      cpp

      trim_and_segment_raw(fast5_data.rt, trim_start, trim_end, varseg_chunk, varseg_thresh);

      event_table et = detect_events(fast5_data.rt, *ed_params);

      Line 270:

      cpp

      // align events to the basecalled read

      std::vector event_alignment = adaptive_banded_simple_event_align(*this, *this->base_model[strand_idx], read_sequence);

      Where event detection is further defined at line 268 here:

      https://github.com/jts/nanopolish/blob/master/src/thirdparty/scrappie/event_detection.c

      Tombo:

      https://github.com/nanoporetech/tombo/blob/master/tombo/resquiggle.py

      line 1162 and onwards shows a ‘segment_signal’ call and the results are used in a ‘find_adaptive_base_assignment’ call, where ‘segment_signal’ starting at line 1057 tries to find where the signal jumps from a series of similar values to another (start of a base change in the pore), stored in ‘valid_cpts’, and the ‘find_adaptive_base_assignment’ tries to align the resulting segment values to the expected series of values:

      python

      valid_cpts, norm_signal, new_scale_values = segment_signal(

      map_res, num_events, rsqgl_params, outlier_thresh, const_scale)

      event_means = ts.compute_base_means(norm_signal, valid_cpts)

      dp_res = find_adaptive_base_assignment(

      valid_cpts, event_means, rsqgl_params, std_ref, map_res.genome_seq,

      start_clip_bases=map_res.start_clip_bases,

      seq_samp_type=seq_samp_type, reg_id=map_res.align_info.ID)

      These implementations are also why I find the choice of words for what is segmentation and what is alignment a bit confusing in this work, as both Tombo and Nanopolish do a similar, clear segmentation step (or an "event detection" step), followed by the alignment of the segments they determined. The terminology in this work appears to deviate from these.

      We thank the reviewer for the detailed comments!

      First of all, we sincerely apologize for our earlier misunderstanding regarding how Nanopolish and Tombo operate. Based on a closer examination of their source codes, we now recognize that both tools indeed include a segmentation step based on change-point detection methods, after which the resulting segments are aligned to the reference sequence. We have revised the relevant text in the manuscript accordingly:

      - “Current methods, such as Nanopolish, employ change-point detection methods to segment the signal and use dynamic programming methods and HMM to align the derived segments to the reference sequence,”

      - “We define this process as the segmentation and alignment task (abbreviated as the segmentation task), which is referred to as “eventalign” in Nanopolish.”

      - “In SegPore, we segment the raw signal into small fragments using a Hierarchical Hidden Markov Model (HHMM) and align the mean values of these fragments to the reference, where each fragment corresponds to a sub-state of a k-mer. By contrast, Nanopolish and Tombo use change-point–based methods to segment the signal and employ dynamic programming approaches together with profile HMMs to align the resulting segments to the reference sequence.”

      Regarding terminology, we originally borrowed the term “segmentation” from speech processing, where it refers to dividing continuous audio signals into meaningful units. In the context of nanopore signal analysis, segmentation and alignment are often tightly coupled steps. Because of this and because our initial focus was on methodological development rather than terminology, we used the term “segmentation task” to describe the combined process of signal segmentation and alignment.

      However, we now recognize that this terminology may cause confusion. Changing every instance of “segmentation” to “segmentation and alignment” or “alignment” would require substantial rewriting of the manuscript. Therefore, in this revision, we have clearly defined “segmentation task” as referring to the combined process of segmentation and alignment. We apologize for any earlier confusion and will adopt the term “alignment” in future work for greater clarity.

      (3) I think I do understand the meaning, but I do not understand the relevance of the Aj bit in the last sentence. What is it used for?

      Based on the response and another close look at Fig1, it turns out the j refers to extremely small numbers 1 and 2 in step 3. You may want in improve readability for these.

      Thank you for the suggestion. We have added subscripts to all nucleotides in the reference sequence in Figure 1A and revised the legend to clarify the notation and improve readability. Specifically, we now include the following explanation:

      “For example, A<sub>j</sub> denotes the base ‘A’ at the j-th position on the reference sequence. In this example, A<sub>1</sub> and A<sub>2</sub> refer to the first and second occurrences of ‘A’ in the reference sequence, respectively. Accordingly, μ<sub>1</sub> and μ<sub>2</sub> are aligned to A<sub>1</sub>, while μ<sub>3</sub> is aligned to A<sub>2</sub>”.

      (6) “We chose to use the poly(A) tail for normalization because it is sequence-invariant- i.e., all poly(A) tails consist of identical k-mers, unlike transcript sequences which vary in composition. In contrast, using the transcript region for normalization can introduce biases: for instance, reads with more diverse k-mers (having inherently broader signal distributions) would be forced to match the variance of reads with more uniform k-mers, potentially distorting the baseline across k-mers.”

      While the next part states there was a benchmark showing SegPore still works without this normalization, I think this answer does not touch upon the underlying issue I'm trying to point out here.

      - The biases mentioned here due to a more diverse (or different) subsets of k-mers in a read indeed affects the variance of the signal overall.

      - As I pointed out in my earlier remark here, this can be resolved using an approach of 'general normalization', 'mapping to expected signal', 'theil-sen fitting of scale and offset', 're-mapping to expected signal', as Tombo and Nanopolish have implemented.<br /> - Alternatively, one could use the reference sequence (using the read mapping information) and base the expected signal mean and standard deviation on that instead.

      - The polyA tail stability as an indicator for the variation in the rest of the signal seems a questionable assumption to me. A 'noisy' pore could introduce a large standard deviation using the polyA tail without increasing the deviations on the signal induced by the variety of k-mers, rather it would be representative for the deviations measured within a single k-mer segment. I thought this possible discrepancy is to be expected from a worn out pore, hence I'd imagine reads sequenced later in a run to provide worse results using this method.

      In the current version it is not the statement that is unclear, it is the underlying assumption of how this works that I question.

      We thank the reviewer for raising this important point and for the insightful discussion. Our choice of using the poly(A) tail for normalization is based on the working hypothesis that the poly(A) signal reflects overall pore-level variability and provides a stable reference for signal scaling. We find this to be a practical and effective approach in most experimental settings.

      We agree that more sophisticated strategies, such as “general normalization” or iterative fitting to the expected signal (as implemented in Tombo and Nanopolish), could in principle generate a "better" normalization. However, these approaches are significantly more challenging to implement in practice. This is because signal normalization and alignment are mutually dependent processes: baseline estimates for k-mers influence alignment accuracy, while alignment accuracy, in turn, affects baseline calculation. This interdependence becomes even more complex in the presence of RNA modifications, which alter signal distributions and further confound model fitting.

      It is worth noting that this limitation is already evident in our results. As shown in Figure 4B (first and second k-mers), Nanopolish produces more dispersed baselines than SegPore, even for these unmodified k-mers, suggesting inherent limitations in its normalization strategy. Ideally, baselines for the same k-mer should remain highly consistent across different reads.

      In contrast, poly(A)-based normalization offers a simpler and more robust solution that avoids this circular dependency. Because poly(A) sequences are compositionally homogeneous, they enable reliable estimation of scaling parameters without assumptions about k-mer composition or modification state. Regarding the reviewer’s concern about pore instability, we mitigate this issue by including only high-quality, confidently mapped reads in our analysis, which reduces the likelihood of incorporating signals from degraded or “noisy” pores.

      We fully agree that exploring more advanced normalization strategies is an important direction for future work, and we plan to investigate such approaches as the field progresses.

      (8) “In the remainder of this paper, we refer to these resulting events as the output of eventalign analysis or the segmentation task.”

      Picking only one descriptor rather than two alternatives would be easier to follow (and I'd prefer the first).

      Thank you for the suggestion. We have revised the sentence to:

      “In the remainder of this paper, we refer to these resulting events as the output of eventalign analysis, which also represents the final output of the segmentation and alignment task.”

      (9) “Additionally, a complete explanation of how the weighted mean is computed is provided in Section 5.3 of Supplementary Note 1. It is derived from signal points that are assigned to a given 5mer.”

      I believe there's no more mention of a weighted mean, and I don't get any hits when searching for 'weight'. Is that intentional?

      We apologize for the misplacement of the formulas. We have updated Section 5.3 of Supplementary Note 1 to clarify the definition of the weighted mean. Because multiple current signal segments may be aligned to a single k-mer, we computed the weighted mean for each k-mer across these segments, where the weight corresponds to the number of data points assigned to “curr” state in each event.

      (17) Response: We revised the sentence to clarify the selection criteria: "For selected 5mers “that exhibit both a clearly unmodified and a clearly” “modified signal component”, “SegPore reports the modification rate at each site,” “as well as the modification state of that site on individual reads.””

      So is this the same set described on page 13 ln 343 or not?

      “Due to the differences between human (Supplementary Fig. S2A) and mouse (Supplementary Fig. S2B), only six 5mers were found to have m6A annotations in the test data's ground truth (Supplementary Fig. S2C). For a genomic location to be identified as a true m6A modification site, it had to correspond to one of these six common 5mers and have a read coverage of greater than 20.”

      I struggle to interpret the 'For selected 5mers' part, as I'm not sure if this is a selection I'm supposed to already know at this point in the text or if it's a set just introduced here. If the latter, removing the word 'selected' would clear it up for me.

      We apologize for the confusion. What we mean is that when pooling signals aligned to the same k-mer across different genomic locations and reads, only a subset of k-mers exhibit a bimodal distribution — one peak corresponding to the unmodified state and another to the modified state. Other k-mers show a unimodal distribution, making it impossible to reliably estimate modification levels. We refer to the subset of k-mers that display a bimodal distribution as the “selected” k-mers.

      The “selected k-mers” described on page 13, line 343, must additionally have ground truth labels available in both the training and test datasets. There are 10 k-mers with ground truth annotations in the training data and 11 in the test data, and only 6 of these k-mers are shared between the two datasets, therefore only those 6 overlapping k-mers are retained for evaluation. These 6 k-mers satisfy both criteria: (1) exhibiting a bimodal distribution and (2) having ground truth annotations in both training and test sets.

      To improve clarity, we have removed the term “selected” from the sentence.

      (21) "Tombo used the "resquiggle" method to segment the raw signals, and we standardized the segments using the “poly(A)” tail to ensure a fair comparison “(See” “preprocessing section in Materials and Methods)."”

      In the Materials and Methods:

      “The raw signal segment corresponding to the poly(A) tail is used to standardize the raw signal for each read.”

      I cannot find more detailed information here on what the standardization does, do you mean to refer to Supplementary Note 1, Section 3 perhaps?

      Thank you for pointing this out. Yes, the standardization procedure is described in detail in Supplementary Note 1, Section 3. Tombo itself does not segment and align the raw signal on the absolute pA scale, which can result in very large variance in the derived events if the raw signal is used directly. To ensure a fair comparison, we therefore applied the same preprocessing steps to Tombo’s raw signals as we did for SegPore, using only the event boundary information from Tombo while standardizing the signal in the same way.

      We have revised the sentence for clarity as follows:

      “Tombo used the "resquiggle" method to segment the raw signals, but the resulting signals are not reported on the absolute pA scale. To ensure a fair comparison with SegPore, we standardized the segments using the poly(A) tail in the same way as SegPore (See preprocessing section in Materials and Methods).”

      (22A) The table shown does help showing the benchmark is unlikely to be 'cheated'. However I am suprised to see the Avg std for Nanopolish and Tombo going up instead of down, as I'd expect the transition values to increase the std, and hence, removing them should decrease these values. So why does this table show the opposite?

      I believe this table is not in the main text or the supplement, would it not be a good idea to cover this point somewhere in the work?

      Thank you for this insightful comment. In response, we carefully re-examined our analysis and identified a bug in the code related to boundary removal for Nanopolish. We have now corrected this issue and included the updated results in Supplementary Table S1 of the revised manuscript. As shown in the updated table, the average standard deviations decrease after removing the boundary regions for both Nanopolish and Tombo.

      We have now included this table in Supplementary Table S1 in the revised manuscript and added the following clarification:

      “It is worth noting that the data points corresponding to the transition state between two consecutive 5-mers are not included in the calculation of the standard deviation in SegPore’s results in Table 1. However, their exclusion does not affect the overall conclusion, as there are on average only ~6 points per 5-mer in the transition state (see Supplementary Table S1 for more details).”

      (22B) As mentioned in 2), I'm happy there's a clear definition of what is meant but I found the chosen word a bit odd.

      We apologize for the earlier unclear terminology. We now refer to it as the segmentation and alignment task, abbreviated as the segmentation task.

      (23) Reading back I can gather that from the text earlier, but the summation of what is being tested is this:

      “including Tombo, MINES (31), Nanom6A (32), m6Anet, Epinano (33), and CHEUI (20). “

      next, the identifier "Nanopolish+m6Anet" is, aside from the figure itself, only mentioned in the discussion. Adding a line that explains that "Nanopolish+m6Anet" is the default method of running m6Anet and "SegPore+m6Anet" replaces the Nanopolish part for m6Anet with Segpore, rather than jumping straight to "SegPore+m6Anet", would clarify where this identifier came from.

      Thank you for the helpful suggestion. We have added the identifier to the revised manuscript as follows:

      “Given their comparable methodologies and input data requirements, we benchmarked SegPore against several baseline tools, including Tombo, MINES (31), Nanom6A (32), m6Anet, Epinano (33), and CHEUI (20). By default, MINES and Nanom6A use eventalign results generated by Tombo, while m6Anet, Epinano, and CHEUI rely on eventalign results produced by Nanopolish. In Fig. 3C, ‘Nanopolish+m6Anet’ refers to the default m6Anet pipeline, whereas ‘SegPore+m6Anet’ denotes a configuration in which Nanopolish’s eventalign results are replaced with those from SegPore.”

      (24) For completeness I'd expect tickmarks and values on the y-axis as well.

      Thank you for the suggestion. We have updated Figures 3A and 3B in the revised manuscript to include tick marks and values on the y-axis as requested.

      (25) Considering this statement and looking back at figure 3a and 3b, wouldn't this be easier to observe if the histograms/KDE's were plotted with overlap in a single figure?

      We appreciate the suggestion. However, we believe that overlaying Figures 3A and 3B into a single panel would make the visualization cluttered and more difficult to interpret.

      (29) Please change the sentence in the text to make that clear. As it is written now (while it's the same number of motifs, so one might guess it) it does not seem to refer to that particular set of motifs and could be a new selection of 6 motifs.

      We appreciate the suggestion and have revised the sentence for clarity as follows:

      “We evaluated m6A predictions using two approaches: (1) SegPore’s segmentation results were fed into m6Anet, referred to as SegPore+m6Anet, which works for all DRACH motifs and (2) direct m6A predictions from SegPore’s Gaussian Mixture Model (GMM), which is limited to the six selected 5-mers shown in Supplementary Fig. S2C that exhibit clearly separable modified and unmodified components in the GMM (see Materials and Methods for details). ”

      (31) I think we have a different interpretation of the word 'leverage', or perhaps what it applies to. I'd say it leverages the jiggling if there's new information drawn from the jiggling behaviour. It's taking it into account if it filters for it. The HHMM as far as I understand tries to identify the jiggles, and ignore their values for the segmentation etc. So while one might see this as an approach that "leverages the hypothesis", I don't see how this HHMM "leverages the jiggling property" itself.

      Thank you for the helpful suggestion. We have replaced the word “leverages” with “models” in the revised manuscript.

      New points

      pg6ln166: “…we extract the aligned raw signal segment and reference sequence segment from Nanopolish's events [...] we extract the raw signal segment corresponding to the transcript region for each input read based on Nanopolish's poly(A) detection results.”

      It is not clear as to why this different approach is applied for these two cases in this part of the text.

      Thank you for pointing this out. The two approaches refer to different preprocessing strategies for in vivo and in vitro data.

      For in vivo data, a large proportion of reads do not span the full-length transcript and often map only to a portion of the reference sequence. Moreover, because a single gene can generate multiple transcript isoforms, a read may align equally well to several possible transcripts. Therefore, we extract only the raw signal segment that corresponds to the mapped portion of the transcript for each read.

      In contrast, for in vitro data, the transcript sequence is known precisely. As a result, we can directly extract all raw signals following the poly(A) tail and align them to the complete reference sequence.

      pg10ln259: An important distinction from classical global alignment algorithms is that one or multiple base blocks may align with a single 5mer.”

      If there was usually a 1:1 mapping the alignment algorithm would be more or less a direct match, so I think the multiple blocks aligning to a 5mer thing is actually quite common.

      Thank you for the comment. The “classical global alignment algorithm” here refers to the Needleman–Wunsch algorithm used for sequence alignment. Our intention was to highlight the conceptual difference between traditional sequence alignment and nanopore signal alignment. In classical sequence alignment, each base typically aligns to a single position in the reference. In contrast, in nanopore signal alignment, one or multiple signal segments — corresponding to varying dwell times of the motor protein — can align to a single 5-mer.

      We have revised the sentence as follows:

      “An important distinction from classical global alignment algorithms (Needleman–Wunsch algorithm)……”

      pg13ln356: "dwell time" is not defined or used before, I guess it's effectively the number of raw samples per segment but this should be clarified.

      Thank you for pointing this out. We have now added a clear definition of dwell time in the text as follows:

      "such as the normalized mean μ_i, standard deviation σ_i, dwell time l_i (number of data points in the event)."

      pg13ln358: “Feature vectors from 80% of the genomic locations were used for training, while the remaining 20% were set aside for validation.”

      I assume these are selected randomly but this is not explicitly stated here and should be.

      Yes, they are randomly selected. We have revised the sentence as follows:

      “Feature vectors from a randomly selected 80% of the genomic locations were used for training, while the remaining 20% were set aside for validation.”

      pg18ln488: The manuscript now evaluates RNA004 and compares against f5c and Uncalled4. It mentions the differences between RNA004 and RNA002, namely kmer size and current levels, but does not explain where the starting reference model values for the RNA004 model come from: In pg18ln492 they state "RNA004 provides reference values for 9mers", then later they seem to use a 5mer parameter table (pg19ln508), are they re-using the same table from RNA002 or did they create a 5mer table from the 9mer reference table?

      We apologize for the confusion. The reference model table for RNA004 9-mers is obtained from f5c (the array named ‘rna004_130bps_u_to_t_rna_9mer_template_model_builtin_data’in  https://raw.githubusercontent.com/hasindu2008/f5c/refs/heads/master/src/model.h).

      Author response image 1.

      We have revised the subsection header “5-mer parameter table” in the Method to “5-mer & 9-mer parameter table” to highlight this and added a paragraph about how to obtain the 9-mer parameter table:

      “In the RNA004 data analysis (Table 2), we obtained the 9-mer parameter table from the source code of f5c (version 1.5). Specifically, we used the array named ‘rna004_130bps_u_to_t_rna_9mer_template_model_builtin_data’ from the following file: https://raw.githubusercontent.com/hasindu2008/f5c/refs/heads/master/src/model.h (accessed on 17 October 2025).”

      Also, in page 18 line 195, we added the following sentence:

      “The 9-mer parameter table in pA scale for RNA004 data provided by f5c (see Materials and Methods) was used in the analysis.”

      pg19ln520: “Additionally, due to the differences of the k-mer motifs between human and mouse (Supplementary Fig. S2), six shared 5mers were selected to demonstrate SegPore's performance in modification prediction directly.”

      "the differences" - in occurrence rates, as I gather from the supplementary figure, but it would be good to explicitly state it in this sentence itself too.

      Thank you for the helpful suggestion. We agree that the original sentence was vague. The main reason for selecting only six 5-mers is the difference in the availability of ground truth labels for specific k-mer motifs between human and mouse datasets. We have revised the sentence accordingly:

      “Additionally, due to the differences in the availability of ground truth labels for specific k-mer motifs between human and mouse (Supplementary Fig. S2), six shared 5-mers were selected to directly demonstrate SegPore’s performance in modification prediction.”

      pg24ln654: “SegPore codes current intensity levels”

      "codes" is meant to be "stores" I guess? Perhaps "encodes"?

      Thank you for the suggestion. We have now replaced it with “encodes” in the revised manuscript.

      Lastly, looking at the feedback from the other reviewers comment:

      The 'HMM' mentioned in line 184 looks fine to me, the HHMM is 2 HMM's in a hierarchical setup and the text now refers to one of these HMM layers. If this is to be changed it would need to state the layer (e.g. "the outer HHMM layer") throughout the text instead.

      We agree with this assessment and believe that the term “inner HMM” is accurate in this context, as it correctly refers to one of the two HMM layers within the HHMM structure. Therefore, we have decided to retain the current terminology.

      Reviewer #3 (Recommendations for the authors):

      I recommend the publication of this manuscript, provided that the following comments are addressed.

      Page 5, Preprocessing: You comment that the poly(A) tail provides a stable reference that is crucial for the normalisation of all reads. How would this step handle reads that have interrupted poly(A) tails (e.g. in the case of mRNA vaccines that employ a linker sequence)? Or cell types that express TENT4A/B, which can include transcripts with non-A residues in the poly(A) tail: https://www.science.org/doi/full/10.1126/science.aam5794.

      It depends on Nanopolish’s ability to reliably detect the poly(A) tail. In general, the poly(A) region produces a long stretch of signals fluctuating around a current level of ~108.9 pA (RNA002) with relatively stable variation, which allows it to be identified and used for normalization.

      For in vivo data, if the poly(A) tail is interrupted (e.g., due to non-A residues or linker sequences), two scenarios are possible:

      (1) The poly(A) tail may not be reliably detected, in which case the corresponding read will be excluded from our analysis.

      (2) Alternatively, Nanopolish may still recognize the initial uninterrupted portion of the poly(A) signal, which is typically sufficient in length and stability to be used for signal normalization.

      For in vitro data, the poly(A) tails are uninterrupted, so this issue does not arise.

      All analyses presented in this study are based exclusively on reads with reliably detected poly(A) tails.

      Page 7, 5mer parameter table: r9.4_180mv_70bps_5mer_RNA is an older kmer model (>2 years). How does your method perform with the newer RNA kmer models that do permit the detection of multiple ribonucleotide modifications? Addressing this comment would be beneficial, however I understand that it would require the generation of new data, as limited RNA004 datasets are available in the public domain.

      “r9.4_180mv_70bps_5mer_RNA” is the most widely used k-mer model for RNA002 data. Regarding the newer k-mer models, we believe the reviewer is referring to the “modification basecalling” models available in Dorado, which are specifically designed for RNA004 data. At present, SegPore can perform RNA modification estimation only on RNA002 data, as this is the platform for which suitable training data and ground truth annotations are available. Evaluating SegPore’s performance with the newer RNA004 modification models would require new datasets containing known modification sites generated with RNA004 chemistry. Since such data are currently unavailable, we have not yet been able to assess SegPore under these conditions. This represents an important future direction for extending and validating our method.

      The Methods and Results sections contain redundant information -please streamline the information in these sections and reduce the redundancy.

      We thank the reviewer for this suggestion and acknowledge that there is some overlap between the Methods and Results sections. However, we feel that removing these parts could compromise the clarity and readability of the manuscript, especially given that Reviewer 2 emphasized the need for clearer explanations. We therefore decided to retain certain methodological descriptions in the Results section to ensure that key steps are understandable without requiring the reader to constantly cross-reference the Methods.

      Minor comments

      Please be consistent when referring to k-mers and 5-mers (sometimes denoted as 5mers - please change to 5-mers throughout).

      We have revised the manuscript to ensure consistency and now use “5-mers” throughout the text.

      Introduction

      Lines 80 - 112: Please condense this section to roughly half the length (1-2 paragraphs). In general, the results described in the introduction should be very brief, as they are described in full in the results section.

      Thank you for the suggestion. We have condensed the original three paragraphs into a single, more concise paragraph as follows:

      "SegPore is a novel tool for direct RNA sequencing (DRS) signal segmentation and alignment, designed to overcome key limitations of existing approaches. By explicitly modeling motor protein dynamics during RNA translocation with a Hierarchical Hidden Markov Model (HHMM), SegPore segments the raw signal into small, biologically meaningful fragments, each corresponding to a k-mer sub-state, which substantially reduces noise and improves segmentation accuracy. After segmentation, these fragments are aligned to the reference sequence and concatenated into larger events, analogous to Nanopolish’s “eventalign” output, which serve as the foundation for downstream analyses. Moreover, the “eventalign” results produced by SegPore enhance interpretability in RNA modification estimation. While deep learning–based tools such as m6Anet classify RNA modifications using complex, non-transparent features (see Supplementary Fig. S5), SegPore employs a simple Gaussian Mixture Model (GMM) to distinguish modified from unmodified nucleotides based on baseline current levels. This transparent modeling approach improves confidence in the predictions and makes SegPore particularly well-suited for biological applications where interpretability is essential."

      Line 104: Please change "normal adenosine" to "adenosine".

      We have revised the manuscript as requested and replaced all instances of “normal adenosine” with “adenosine” throughout the text.

      Materials and Methods

      Line 176: Please reword "...we standardize the raw current signals across reads, ensuring that the mean and standard deviation of the poly(A) tail are consistent across all reads." To "...we standardize the raw current signals for each read, ensuring that the mean and standard deviation are consistent across the poly(A) tail region."

      We have changed sentence as requested.

      “Since the poly(A) tail provides a stable reference, we standardize the raw current signals for each read, ensuring that the mean and standard deviation are consistent across the poly(A) tail region.”

      Line 182: Please describe the RNA translocation hypothesis, as this is the first mention of it in the text. Also, why is the Hierachical Hidden Markov model perfect for addressing the RNA translocation hypothesis? Explain more about how the HHMM works and why it is a suitable choice.

      We have revised the sentence as requested:

      “The RNA translocation hypothesis (see details in the first section of Results) naturally leads to the use of a hierarchical Hidden Markov Model (HHMM) to segment the raw current signal.”

      The motivation of the HHMM is explained in detail in the the first section “RNA translocation hypothesis” of Results. As illustrated in Figure 2, the sequencing data suggest that RNA molecules may translocate back and forth (often referred to as jiggling) while passing through the nanopore. This behavior results in complex current fluctuations that are challenging to model with a simple HMM. The HHMM provides a natural framework to address this because it can model signal dynamics at two levels. The outer HMM distinguishes between two major states — base states (where the signal corresponds to a stable sub-state of a k-mer) and transition states (representing transitions from one base state to the next). Within each base state, an inner HMM models finer signal variation using three states — “curr”, “prev”, and “next” — corresponding to the current k-mer sub-states and its neighboring k-mer sub-states. This hierarchical structure captures both the stable signal patterns and the stochastic translocation behavior, enabling more accurate and biologically meaningful segmentation of the raw current signal.

      Line 184: do you mean HHMM? Please be consistent throughout the text.

      As explained in the previous response, the HHMM consists of two layers: an outer HMM and an inner HMM. The term “HMM” in line 184 is meant to be read together with “inner” at the end of line 183, forming the phrase “inner HMM.” It seems the reviewer may have overlooked this when reading the text.

      Line 203: please delete: "It is obviously seen that".

      We have removed the phrase “It is obviously seen that” from the sentence as requested. The revised sentence now reads:

      “The first part of Eq. 2 represents the emission probabilities, and the second part represents the transition probabilities.”

      Line 314, GMM for 5mer parameter table re-estimation: "Typically, the process is repeated three to five times until the5mer parameter table stabilizes." How is the stabilisation of the 5mer parameter table quantified? What is a reasonable cut-off that would demonstrate adequate stabilisation of the 5mer parameter table? Please add details of this to the text.

      We have revised the sentence to clarify the stabilization criterion as follows:

      “Typically, the process is repeated three to five times until the 5-mer parameter table stabilizes (when the average change of mean values of all 5-mers is less than 5e-3).”

      Results

      Line 377: Please edit to read "Traditional base calling algorithms such as Guppy and Albacore assume that the RNA molecule is translocated unidirectionally through the pore by the motor protein."

      We have revised the sentence as:

      “In traditional basecalling algorithms such as Guppy and Albacore, we implicitly assume that the RNA molecule is translocated through the pore by the motor protein in a monotonic fashion, i.e., the RNA is pulled through the pore unidirectionally.”

      Line 555, m6A identification at the site level: "For six selected m6A motifs, SegPore achieved an ROC AUC of 82.7% and a PR AUC of 38.7%, earning the third best performance compared with deep leaning methods m6Anet and CHEUI (Fig. 3D)." So SegPore performs third best of all deep learning methods. Do you recommend its use in conjunction with m6Anet for m6A detection? Please clarify in the text. This will help to guide users to possible best practice uses of your software.

      Thank you for the suggestion. We have added a clarification in the revised manuscript to guide users.

      “For practical applications, we recommend taking the intersection of m6A sites predicted by SegPore and m6Anet to obtain high-confidence modification sites, while still benefiting from the interpretability provided by SegPore’s predictions.”

      Figures.

      Figure 1A please refer to poly(A) tail, rather than polyA tail.

      We have updated it to poly(A) tail in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The study by Pinho et al. presents a novel behavioral paradigm for investigating higher-order conditioning in mice. The authors developed a task that creates associations between light and tone sensory cues, driving mediated learning. They observed sex differences in task acquisition, with females demonstrating faster-mediated learning compared to males. Using fiber photometry and chemogenetic tools, the study reveals that the dorsal hippocampus (dHPC) plays a central role in encoding mediated learning. These findings are crucial for understanding how environmental cues, which are not directly linked to positive/negative outcomes, contribute to associative learning. Overall, the study is well-designed, with robust results, and the experimental approach aligns with the study's objectives. 

      Strengths: 

      (1) The authors develop a robust behavioral paradigm to examine higher-order associative learning in mice. 

      (2) They discover a sex-specific component influencing mediated learning, with females exhibiting enhanced learning abilities. 

      (3) Using fiber photometry and chemogenetic techniques, the authors identify the dorsal hippocampus but not the ventral hippocampus, which plays a crucial for encoding mediated learning.

      We appreciate the strengths highlighted by the Reviewer and the valuable and complete summary of our work.

      Weaknesses: 

      (1) The study would be strengthened by further elaboration on the rationale for investigating specific cell types within the hippocampus.  

      We thank the Reviewer for highlighting this important point. In the revised manuscript, we have added new information (Page 11, Lines 27-34) to specifically explain the rational of studying the possible cell-type specific involvement in sensory preconditioning.

      (2) The analysis of photometry data could be improved by distinguishing between early and late responses, as well as enhancing the overall presentation of the data.  

      According to the Reviewer comment, we have included new panels in Figure 3E and the whole Supplementary Figure 4, which separates the photometry data across different preconditioning and conditioning sessions, respectively. Overall, this data suggests that there are no major changes on cell activity in both hippocampal regions during the different sessions as similar light-tone-induced enhancement of activity is observed. These findings have been incorporated in the Results Section (Page 12, Lines 13-15, 19-20 and 35-36).

      (3) The manuscript would benefit from revisions to improve clarity and readability.

      Based on the fair comment, we have gone through the text to increase clarity and readability.

      Reviewer #2 (Public review): 

      Summary: 

      Pinho et al. developed a new auditory-visual sensory preconditioning procedure in mice and examined the contribution of the dorsal and ventral hippocampus to learning in this task. Using photometry they observed activation of the dorsal and ventral hippocampus during sensory preconditioning and conditioning. Finally, the authors combined their sensory preconditioning task with DREADDs to examine the effect of inhibiting specific cell populations (CaMKII and PV) in the DH on the formation and retrieval/expression of mediated learning. 

      Strengths: 

      The authors provide one of the first demonstrations of auditory-visual sensory preconditioning in male mice. Research on the neurobiology of sensory preconditioning has primarily used rats as subjects. The development of a robust protocol in mice will be beneficial to the field, allowing researchers to take advantage of the many transgenic mouse lines. Indeed, in this study, the authors take advantage of a PV-Cre mouse line to examine the role of hippocampal PV cells in sensory preconditioning. 

      We acknowledge the Reviewer´s effort and for highlighting the strengths of our work.

      Weaknesses: 

      (1) The authors report that sensory preconditioning was observed in both male and female mice. However, their data only supports sensory preconditioning in male mice. In female mice, both paired and unpaired presentations of the light and tone in stage 1 led to increased freezing to the tone at test. In this case, fear to the tone could be attributed to factors other than sensory preconditioning, for example, generalization of fear between the auditory and visual stimulus.

      We thank the comment raised by the Reviewer. At first, we were hypothesizing that female mice were somehow able to associate light and tone although they were presented separately during the preconditioning sessions. Thus, we designed new experiments (shown in Supplementary Figure 2D) to test if we would observe data congruent with our initial hypothesis or with fear generalization as proposed by the reviewer. We have performed a new experiment comparing a Paired group with two additional control groups that are (i) an Unpaired group where we increased the time between the light and tone presentations and (ii) an experimental group where the light was absent during the conditioning. Clearly, the new results indicate the presence of fear generalization in female mice aswe found a significant cue-induced increase on freezing responses in all the experimental groups tested. In accordance with the Reviewer’s suggestion, we can conclude that mediated learning is not correctly observed in female mice using the protocol described (i.e. with 2 conditioning sessions). All these new results forced us to reorganize the structure and the figures of the manuscript to focus more in male mice in the Main Figures whereas showing the data with female mice in Supplementary Figures. Overall, our data clearly revealed the necessity to have adapted behavioral protocols for each sex demonstrating sex differences in sensory preconditioning, which was added in the Discussion Section (Page 15, lines 12-37).

      (2) In the photometry experiment, the authors report an increase in neural activity in the hippocampus during both phase 1 (sensory preconditioning) and phase 2 (conditioning). In the subsequent experiment, they inhibit neural activity in the DH during phase 1 (sensory preconditioning) and the probe test, but do not include inhibition during phase 2 (conditioning). It was not clear why they didn't carry forward investigating the role of the hippocampus during phase 2 conditioning. Sensory preconditioning could occur due to the integration of the tone and shock during phase two, or retrieval and chaining of the tonelight-shock memories at test. These two possibilities cannot be differentiated based on the data. Given that we do not know at which stage the mediate learning is occurring, it would have been beneficial to additionally include inhibition of the DH during phase 2. 

      Following the Reviewer’s valuable comment, we have conducted a new experiment where we have chemogenetically inhibited the CaMKII-positive neurons of the dHPC during the conditioning to explore their involvement in mediated learning formation. Notably, the inhibition of principal neurons of the dHPC during conditioning does not impair the formation ofthe mediated learning in our hands. These new results are now shown in Supplementary Figure 7G and added in the Results section (Page 13, Lines 19-23).

      (3) In the final experiment, the authors report that inhibition of the dorsal hippocampus during the sensory preconditioning phase blocked mediated learning. While this may be the case, the failure to observe sensory preconditioning at test appears to be due more to an increase in baseline freezing (during the stimulus off period), rather than a decrease in freezing to the conditioned stimulus. Given the small effect, this study would benefit from an experiment validating that administration of J60 inhibited DH cells. Further, given that the authors did not observe any effect of DREADD inhibition in PV cells, it would also be important to validate successful cellular silencing in this protocol.  

      According to the Reviewer comments, we have performed new experiments to validate the use of J60 to inhibit hippocampal cells that are shown in Supplementary Figure 7 E-F for CaMKII-positive neurons, in which J60 administration tends to decrease the frequency of calcium events both in the dHPC and vHPC. Furthermore, in Supplementary Figure 8 B-C we show that J60 is also able to modify calcium events in PV-positive interneurons. Although,the best method to validate the use of DREADD (i.e. to inhibit hippocampal cell activity) could be electrophysiology recordings, we lack this technique in our laboratory. Thus, in order to adress the reviewer comment, we decided to combine the DREADD modulation through J60 administration with photometry recordings, where several tendencies are confirmed. In addition, a similar approach has been used in another preprint of the lab (https://doi.org/10.1101/2025.08.29.673009), where there is an increase of phospho-PDH, a marker of neuronal inhibition upon J60 administration in the dHPC, as well as in other experiments conducted from a collaborator lab where they were able to observe a modulation of SOM-positive interneurons activity upon J60 administration (PhD defense of Miguel Sabariego, University Pompeu Fabra, Barcelona). 

      Reviewer #3 (Public review): 

      Summary: 

      Pinho et al. investigated the role of the dorsal vs ventral hippocampus and the gender differences in mediated learning. While previous studies already established the engagement of the hippocampus in sensory preconditioning, the authors here took advantage of freely-moving fiber photometry recording and chemogenetics to observe and manipulate sub-regions of the hippocampus (dorsal vs. ventral) in a cell-specific manner. The authors first found sex differences in the preconditioning phase of a sensory preconditioning procedure, where males required more preconditioning training than females for mediating learning to manifest, and where females displayed evidence of mediated learning even when neutral stimuli were never presented together within the session. 

      After validation of a sensory preconditioning procedure in mice using light and tone neutral stimuli and a mild foot shock as the unconditioned stimulus, the authors used fiber photometry to record from all neurons vs. parvalbumin_positive_only neurons in the dorsal hippocampus or ventral hippocampus of male mice during both preconditioning and conditioning phases. They found increased activity of all neurons, as well as PV+_only neurons in both sub-regions of the hippocampus during both preconditioning and conditioning phases. Finally, the authors found that chemogenetic inhibition of CaMKII+ neurons in the dorsal, but not ventral, hippocampus specifically prevented the formation of an association between the two neutral stimuli (i.e., light and tone cues), but not the direct association between the light cue and the mild foot shock. This set of data: (1) validates the mediated learning in mice using a sensory preconditioning protocol, and stresses the importance of taking sex effect into account; (2) validates the recruitment of dorsal and ventral hippocampi during preconditioning and conditioning phases; and (3) further establishes the specific role of CaMKII+ neurons in the dorsal but not ventral hippocampus in the formation of an association between two neutral stimuli, but not between a neutralstimulus and a mild foot shock. 

      Strengths: 

      The authors developed a sensory preconditioning procedure in mice to investigate mediated learning using light and tone cues as neutral stimuli, and a mild foot shock as the unconditioned stimulus. They provide evidence of a sex effect in the formation of light-cue association. The authors took advantage of fiber-photometry and chemogenetics to target sub-regions of the hippocampus, in a cell-specific manner and investigate their role during different phases of a sensory conditioning procedure. 

      We thank the Reviewer for the extensive summary of our work and for giving interesting value to some of our findings.

      Weaknesses: 

      The authors went further than previous studies by investigating the role of sub-regions of the hippocampus in mediated learning, however, there are several weaknesses that should be noted: 

      (1) This work first validates mediated learning in a sensory preconditioning procedure using light and tone cues as neutral stimuli and a mild foot shock as the unconditioned stimulus, in both males and females. They found interesting sex differences at the behavioral level, but then only focused on male mice when recording and manipulating the hippocampus. The authors do not address sex differences at the neural level. 

      We appreciate the comment of the Reviewer. Indeed, thanks to other Reviewer comments during this revision process (see Point 1 of Reviewer #2), we performed an additional experiment that reveals that using the described protocol in female mice we observed fear generalization rather than mediated learning responding. This data pointed to the need of sex-specific changes in the behavioral protocols to measure sensory preconditioning. The revised version of the manuscript, although highlighting these sex differences in behavioral performance (see Supplementary Figure 2), is more focused in male mice and, accordingly, all photometry or chemogenetic experiments are performed using male mice. In future studies, once we are certain to have a sensory preconditioning paradigm working in female mice, it will be very interesting to study if the same hippocampal mechanisms mediating this behavior in male mice are also observed in female mice.  

      (2) As expected in fear conditioning, the range of inter-individual differences is quite high. Mice that didn't develop a strong light-->shock association, as evidenced by a lower percentage of freezing during the Probe Test Light phase, should manifest a low percentage of freezing during the Probe Test Tone phase. It would interesting to test for a correlation between the level of freezing during mediated vs test phases. 

      Thanks to the comment raised by the reviewer, we generated a new set of data correlating mediated and direct fear responses. As it can be observed in Supplementary Figure 3, there is a significant correlation between mediated and direct learning in male mice (i.e. the individuals that freeze more in the direct learning test, correlate with the individuals that express more fear response in the mediated learning test). In contrast, this correlation is absent in female mice, further confirming what we have explained above. We have highlighted this new analysis in the Results section (Page 11, Lines 20-24).

      (3) The use of a synapsin promoter to transfect neurons in a non-specific manner does not bring much information. The authors applied a more specific approach to target PV+ neurons only, and it would have been more informative to keep with this cell-specific approach, for example by looking also at somatostatin+ inter-neurons. 

      The idea behind using a pan neuronal promoter was to assess in general terms how neuronal activity in the hippocampus is engaged during different phases of the lighttone sensory preconditioning. However, the comment of the Reviewer is very pertinent and, as suggested, we have generated some new data targeting CaMKII-positive neurons (see Point 4 below). Finally, although it could be extremely interesting, we believe that targeting different interneuron subtypes is out of the scope of the present work. However, we have added this in the Discussion Section as a future perspective/limitation of our study (Page 17, Lines 9-24).   

      (4) The authors observed event-related Ca2+ transients on hippocampal pan-neurons and PV+ inter-neurons using fiber photometry. They then used chemogenetics to inhibit CaMKII+ hippocampal neurons, which does not logically follow. It does not undermine the main finding of CaMKII+ neurons of the dorsal, but not ventral, hippocampus being involved in the preconditioning, but not conditioning, phase. However, observing CaMKII+ neurons (using fiber photometry) in mice running the same task would be more informative, as it would indicate when these neurons are recruited during different phases of sensory preconditioning. Applying then optogenetics to cancel the observed event-related transients (e.g., during the presentation of light and tone cues, or during the foot shock presentation) would be more appropriate.  

      We have generated new photometry data to analyze the activity of CaMKII-positive neurons during the preconditioning phase to confirm their engagement during the light-tone pairings. Thus, we infused a CaMKII-GCAMP calcium sensor into the dHPC and vHPC of mice and we recorded its activity during the 6 preconditioning sessions. The new results can be found in Figure 3 and explained in the Results section (Page 12, Lines 26-36). The results clearly show an engagement of CaMKII-positive neurons during the light-tone pairing observed both in the dHPC and vHPC. Finally, although the suggestion of performing optogenetic manipulations would be very elegant, we expect to have convinced the reviewer that our chemogenetic results clearly show and are enough to demonstrate the involvement of dHPC in the formation of mediated learning in the Light-Tone sensory preconditioning paradigm. However, we have added this in the Discussion Section as a future perspective/limitation of our study (Page 17, Lines 9-24).  

      (5) Probe tests always start with the "Probe Test Tone", followed by the "Probe Test Light". "Probe Test Tone" consists of an extinction session, which could affect the freezing response during "Probe Test Light" (e.g., Polack et al. (http://dx.doi.org/10.3758/s13420-013-0119-5)). Preferably, adding a group of mice with a Probe Test Light with no Probe Test Tone could help clarify this potential issue. The authors should at least discuss the possibility that the tone extinction session prior to the "Probe Test Light" could have affected the freezing response to the light cue. 

      We appreciate the comment raised by the reviewer. However, we think that our direct learning responses are quite robust in all of our experiments and, thus, the impact of a possible extinction based on the tone presentation should not affect our direct learning. However, as it is an important point, we have discussed it in the Discussion Section (Page 17, Lines 12-14).  

      Reviewer #4 (Public review): 

      Summary 

      Pinho et al use in vivo calcium imaging and chemogenetic approaches to examine the involvement of hippocampal sub-regions across the different stages of a sensory preconditioning task in mice. They find clear evidence for sensory preconditioning in male but not female mice. They also find that, in the male mice, CaMKII-positive neurons in the dorsal hippocampus: (1) encode the audio-visual association that forms in stage 1 of the task, and (2) retrieve/express sensory preconditioned fear to the auditory stimulus at test. These findings are supported by evidence that ranges from incomplete to convincing. They will be valuable to researchers in the field of learning and memory. 

      We appreciate the summary of our work and all the constructive comments raised by the Reviewer, which have greatly improved the clarity and quality of our manuscript.  

      Abstract 

      Please note that sensory preconditioning doesn't require the stage 1 stimuli to be presented repeatedly or simultaneously. 

      The reviewer is right, and we have corrected and changed that information in the revised abstract.  

      "Finally, we combined our sensory preconditioning task with chemogenetic approaches to assess the role of these two hippocampal subregions in mediated learning."  This implies some form of inhibition of hippocampal neurons in stage 2 of the protocol, as this is the only stage of the protocol that permits one to make statements about mediated learning. However, it is clear from what follows that the authors interrogate the involvement of hippocampal sub-regions in stages 1 and 3 of the protocol - not stage 2. As such, most statements about mediated learning throughout the paper are potentially misleading (see below for a further elaboration of this point). If the authors persist in using the term mediated learning to describe the response to a sensory preconditioned stimulus, they should clarify what they mean by mediated learning at some point in the introduction. Alternatively, they might consider using a different phrase such as "sensory preconditioned responding". 

      Considering the arguments of the Reviewer, we have modified our text in the Abstract and through the main text. Moreover, based on a comment of Reviewer #2 (Point 2) we have generated new data demonstrating that dHPC does not seem to be involved in mediated learning formation during Stage 2, as its inhibition does not impair sensory preconditioning responding. This new data can be seen in Supplementary Figure 7G.  

      Introduction 

      "Low-salience" is used to describe stimuli such as tone, light, or odour that do not typically elicit responses that are of interest to experimenters. However, a tone, light, or odour can be very salient even though they don't elicit these particular responses. As such, it would be worth redescribing the "low-salience" stimuli in some other terms. 

      Through the revised version of the manuscript, we have replaced the term “lowsalience” by “innocuous stimuli” or avoiding any adjective as we think is not necessary.  

      "These higher-order conditioning processes, also known as mediated learning, can be captured in laboratory settings through sensory preconditioning procedures2,6-11."  Higher-order conditioning and mediated learning are not interchangeable terms: e.g., some forms of second-order conditioning are not due to mediated learning. More generally, the use of mediated learning is not necessary for the story that the authors develop in the paper and could be replaced for accuracy and clarity. E.g., "These higher-order conditioning processes can be studied in the laboratory using sensory preconditioning procedures2,6-11." 

      According to the Reviewer proposal, we have modified the text. 

      In reference to Experiment 2, it is stated that: "However, when light and tone were separated on time (Unpaired group), male mice were not able to exhibit mediated learning response (Figure 2B) whereas their response to the light (direct learning) was not affected (Figure 2D). On the other hand, female mice still present a lower but significant mediated learning response (Figure 2C) and normal direct learning (Figure 2E). Finally, in the No-Shock group, both male (Figure 2B and 2D) and female mice (Figure 2C and 2E) did not present either mediated or direct learning, which also confirmed that the exposure to the tone or light during Probe Tests do not elicit any behavioral change by themselves as the presence of the electric footshock is required to obtain a reliable mediated and direct learning responses."  The absence of a difference between the paired and unpaired female mice should not be described as "significant mediated learning" in the latter. It should be taken to indicate that performance in the females is due to generalization between the tone and light. That is, there is no sensory preconditioning in the female mice. The description of performance in the No-shock group really shouldn't be in terms of mediated or direct learning: that is, this group is another control for assessing the presence of sensory preconditioning in the group of interest. As a control, there is no potential for them to exhibit sensory preconditioning, so their performance should not be described in a way that suggests this potential. 

      All these comments are very pertinent and also raised by Reviewer #2 (Point 1, see above). In the revised version of the manuscript, we have carefully changed, when necessary, our interpretation of the results (e.g. in the case of the No-Shock group). In addition, we have generated new data that confirm that using similar conditions (i.e. 2 conditioning sessions in our SPC) in female mice we observe fear generalization and not a confident sensory preconditioning responding. In our opinion, this is not discarding the presence of mediated learning in female mice but suggesting that adapted protocols must be used in each sex. These results forced us to change the organization of the Figures but we hope the reviewer would agree with all the changes proposed. In addition, we have re-wrote a paragraph in the Discussion Section to explain these sex differences (see Page 15, lines 12-37). 

      Methods - Behavior 

      I appreciate the reasons for testing the animals in a new context. This does, however, raise other issues that complicate the interpretation of any hippocampal engagement: e.g., exposure to a novel context may engage the hippocampus for exploration/encoding of its features - hence, it is engaged for retrieving/expressing sensory preconditioned fear to the tone. This should be noted somewhere in the paper given that one of its aims is to shed light on the broader functioning of the hippocampus in associative processes. 

      This general issue - that the conditions of testing were such as to force engagement of the hippocampus - is amplified by two further features of testing with the tone. The first is the presence of background noise in the training context and its absence in the test context. The second is the fact that the tone was presented for 30 s in stage 1 and then continuously for 180s at test. Both changes could have contributed to the engagement of the hippocampus as they introduce the potential for discrimination between the tone that was trained and tested. 

      We have now added these pertinent comments in a “Study limitations” paragraph found in the Discussion Section (Page 17, Lines 9-24). Indeed, the different changes of context (including the presence of background noise) have been implemented by the fact that during the setting up of the paradigm we had problems of fear generalization (also in male mice). Similarly, differences in cue exposure between the preconditioning phase and the test phase were also decided based on important differences between previous protocols used in rats compared to how mice are responding. Certainly, mice were not able to adapt their behavioral responses when shorter time windows exposing the cue were used as it clearly happens with rats [1].

      Results - Behavior 

      The suggestion of sex differences based on differences in the parameters needed to generate sensory preconditioning is interesting. Perhaps it could be supported through some set of formal analyses. That is, the data in supplementary materials may well show that the parameters needed to generate sensory preconditioning in males and females are not the same. However, there needs to be some form of statistical comparison to support this point. As part of this comparison, it would be neat if the authors included body weight as a covariate to determine whether any interactions with sex are moderated by body weight.  

      Regarding the comparison between male and female mice, although the comments of the Reviewer are pertinent and interesting, we think that with the new data generated is not appropriate to compare both sexes as we still have to optimize the SPC protocol for female mice. 

      What is the value of the data shown in Figure 1 given that there are no controls for unpaired presentations of the sound and light? In the absence of these controls, the experiment cannot have shown that "Female and male mice show mediated learning using an auditory-visual sensory preconditioning task" as implied by its title. Minimally, this experiment should be relabelled. 

      Based on the new data generated with female mice, we have decided to remove Figure 1 and re-organize the structure of the manuscript. We hope that the Reviewer would agree that this has improved the clarity of the manuscript.  

      "Altogether, this data confirmed that we successfully set up an LTSPC protocol in mice and that this behavioral paradigm can be used to further study the brain circuits involved in higherorder     conditioning."  Please insert the qualifier that LTSPC was successfully established in male mice. There is no evidence of LTSPC in female mice. 

      We fully agree with the Reviewer and our new findings further confirm this issue. Thus, we have changed the statement in the revised version of the manuscript.  

      Results - Brain 

      "Notably, the inhibition of CaMKII-positive neurons in the dHPC (i.e. J60 administration in DREADD-Gi mice) during preconditioning (Figure 4B), but not before the Probe Test 1 (Figure 4B), fully blocked mediated, but not direct learning (Figure  4D)." The right panel of Figure 4B indicates no difference between the controls and Group DPC in the percent change in freezing from OFF to ON periods of the tone. How does this fit with the claim that CaMKII-positive neurons in the dorsal hippocampus regulate associative formation during the session of tone-light exposures in stage 1 of sensory preconditioning? 

      To improve the quality of the figures and to avoid possible redundancies between panels, in the new version of the manuscript, we have decided to remove all the panels regarding the percentage of change. However, in our opinion regarding the issue raised by the Reviewer, the inhibition of the dHPC clearly induced an impairment of mediated learning as animals do not change their behavior (i.e. there is no significant increase of freezing between OFF and ON periods) when the tone appears in comparison with the other two groups. The graphs indicating the percentage of change (old version of the manuscript) was a different manner to show the presence of tone- or light-induced responses in each experimental group. Thus, a significant effect (shown by # symbol) meant that in that specific experimental group there was a significant change in behavior (freezing) when the cue (tone or light) appeared compared when there was no cue (OFF period). Thus, in the old panel 4B commented by the Reviewer, in our opinion, the absence of significance in the group where the dHPC has been inhibited during thepreconditioning, compared to the other groups, where a clear significant effect can be observed, indicate an impairment of mediated learning formation. However, to avoid any confusion, we have slightly modified the text to strictly mention what is being analyzed and/or shown in the graphs and, as mentioned, the graphs of percentage of change have been removed.  

      Discussion 

      "When low salience stimuli were presented separated on time or when the electric footshock was absent, mediated and direct learning were abolished in male mice. In female mice, although light and tone were presented separately during the preconditioning phase, mediated learning was reduced but still present, which implies that female mice are still able to associate the two low-salience stimuli." 

      This doesn't quite follow from the results. The failure of the female unpaired mice to withhold their freezing to the tone should not be taken to indicate the formation of a light-tone association across the very long interval that was interpolated between these stimulus presentations. It could and should be taken to indicate that, in female mice, freezing conditioned to the light simply generalized to the tone (i.e., these mice could not discriminate well between the tone and light). 

      As discussed above, we fully agree with the Reviewer and all the manuscript has been modified as described above. 

      "Indeed, our data suggests that when hippocampal activity is modulated by the specific manipulation of hippocampal subregions, this brain region is not involved during retrieval."  Does this relate to the results that are shown in the right panel of Figure 4B, where there is no significant difference between the different groups? If so, how does it fit with the results shown in the left panel of this figure, where differences between the groups are observed? 

      "In line with this, the inhibition of CaMKII-positive neurons from the dorsal hippocampus, which has been shown to project to the restrosplenial cortex56, blocked the formation of mediated learning." 

      Is this a reference to the findings shown in Figure 4B and, if so, which of the panels exactly? That is, one panel appears to support the claim made here while the other doesn't. In general, what should the reader make of data showing the percent change in freezing from stimulus OFF to stimulus ON periods? 

      In our opinion, as pointed above, the graphs indicating the percentage of change were a different manner to show the presence of tone- or light-induced behavioral responses in each experimental group. Thus, a significant effect (shown by # symbol) meant that in this specific experimental group there was a significant change in behavior (freezing) when the cue (tone or light appear) compared when there was no cue (OFF period). Thus, in the old panel 4B commented by the Reviewer, in our opinion, the absence of significance in the group where the dHPC has been inhibited during the preconditioning, compared to the other groups where a clear significant effect can be observed, indicates an impairment of mediated learning formation. In the revised version of the manuscript, we have rephrased these sentences to stick to what the graphs are showing and, as explained, the graphs of percentage of change have been removed.

      Reviewer #1 (Recommendations for the authors): 

      The authors may address the following questions: 

      (1) The study identifies major sex differences in the conditioning phase, with females showing faster learning. Since hormonal fluctuations can influence learning and behavior, it would be helpful for the authors to comment on whether they tracked the estrous cycle of the females and whether any potential effects of the cycle on mediated learning were considered. 

      This is a relevant and important point raised by the Reviewer. In our study we did not track the estrous cycle to investigate whether it exists any effect of the cycle on mediated learning, which could be an interesting project by itself. Although in the revised version of the manuscript we provide new information regarding the mediated learning performance in male and female mice, we agree with the reviewer that sex hormones may account for the observed sex differences. However, the aim of the present work was to explore potential sex differences in mediated learning responding rather than to investigate the specific mechanisms behind these potential sex differences. 

      For this reason and to avoid adding further complexity to our present study, we did not check the estrous cycle in the female mice, the testosterone levels in male mice or analyze the amount of sex hormones during different phases of the sensory preconditioning task. Indeed, we think that checking the estrous cycle in female mice would still not be enough to ascertain the role of sex hormones because checking the androgen levels in male mice would also be required. In line with this, meta-analysis of neuroscience literature using the mouse model as research subjects [2-4]  has revealed that data collected from female mice (regardless of the estrous cycle) did not vary more than the data from males. In conclusion, we think that using randomized and mixed cohorts of male and female mice (as in the present study) would provide the same degree of variability in both sexes. Nevertheless, we have added a sentence to point to this possibility in the Discussion Section (Page 15, lines 32-37). 

      (2) The rationale for including parvalbumin (PV) cells in the study could be clarified. Is there prior evidence suggesting that this specific cell type is involved in mediated learning? This could apply to sensory stimuli not used in the current study.

      In the revised version of the manuscript, we have better clarified why we targeted PV interneurons, specifically mentioning previous studies [5] (see Page 11, Lines 27-34). 

      (3) The photometry recordings from the dHPC during the preconditioning phase, shown in Figure 3, are presented as average responses. It would be beneficial to separate the early vs. late trials to examine whether there is an increase in hippocampal activity as the associative learning progresses, rather than reporting the averaged data. Additionally, to clarify the dynamics of the dHPC in associative learning, the authors could compare the magnitude of photometry responses when light and tone stimuli are presented individually in separate sessions versus when they are presented closely in time to facilitate associative learning.

      As commented above, according to the Reviewer’s comment, we have now included a new Supplementary Figure 4, which splits the photometry data by the different preconditioning and conditioning sessions. Overall, this data suggests that there are no major changes on cell activity in both hippocampal regions during the different sessions as similar light-tone-induced enhancement of activity is observed. There is only an interesting trend in the activity of Pan-Neurons over the onset of light during conditioning sessions. All this is included now in the Results Section (Page 12, Line 13-15).

      (4) The authors note that PV cell responses recorded with GCaMP were similar to general hippocampal neurons, yet chemogenetic manipulations of PV cells did not impact behavior. A more detailed discussion of this discrepancy would be helpful. 

      As suggested by the Reviewer, we have included additional Discussion to explain the potential discrepancy between the activity of PV interneurons assessed by photometry and its modulation by chemogenetics (see Page 16, Lines 27-33).   

      (5) All fiber photometry recordings were conducted in male mice. Given the sex differences observed in associative learning, the authors could expand the study to include dHPC responses in females during both preconditioning and conditioning sessions. 

      We appreciate the comment of the Reviewer. Indeed, thanks to other comments made by other Reviewers in this revision (see Point 1 of Reviewer #2), we are not still sure that we have an optimal protocol to study mediated learning in female mice due to sexspecific changes related to fear generalization. Thus, the revised version of the manuscript, although highlighting these sex differences in behavioral performance (see Supplementary Figure 2), is more focused in male mice and, accordingly, all photometry or chemogenetic experiments are performed exclusively using male mice. In future studies, once we would be sure to have a sensory preconditioning paradigm working in female mice, it will be very interesting to study if the same hippocampal mechanisms mediating this behavior in male mice are also observed in female mice. 

      Minor Comments: 

      (1) In the right panel of Figure 2A, females received only one conditioning session, so the "x2" should be corrected to "x1" conditioning to accurately reflect the data. 

      We thank the Reviewer for the comment that has been addressed in the revised version of the manuscript.  

      (2) The overall presentation of Figure 3 could be improved. For example, the y-axis in Panel B could be cut to a maximum of 3 rather than 6, which would better highlight the response data. Alternatively, including heatmap representations of the z-score responses could enhance clarity and visual impact.  

      We thank the Reviewer for the comment that has been addressed providing a new format for Figures 2 and 3 in the revised version of the manuscript.   

      (3) There are several grammatical errors throughout the manuscript. It is recommended that the authors use a grammar correction tool to improve the overall writing quality and readability.  

      We have tried to correct the grammar through all the manuscript.  

      Reviewer #2 (Recommendations for the authors):  

      (1) In the abstract the authors write that sensory preconditioning requires the "repeated and simultaneous presentation of two low-salience stimuli such as a light and a tone". Previous research has shown that sensory preconditioning can still occur if the two stimuli are presented serially, rather than simultaneously. Further, the tone and the light are not necessarily "low-salience", for example, they can be loud or bright. It would be better to refer to them as innocuous. 

      In the revised version of the abstract, we have included the modifications suggested by the Reviewer.   

      (2) The authors develop a novel automated tool for assessing freezing behaviour in mice that correlates highly with both manual freezing and existing, open-source freeze estimation software (ezTrack). The authors should explain how the new program differs from ezTrack, or if it provides any added benefit over this existing software. 

      We have added new information in the Results Section (Page 10, Lines 13-20 to better explain how the new tool to quantify freezing could improve existing software.  

      (3) In Experiment 1, the authors report a sex difference in levels of freezing between male and female mice when they are only given one session of sensory preconditioning. This should be supported by a statistical comparison of levels of freezing between male and female mice. 

      Based on the new results obtained with female mice, we have decided to remove the original Figure 1 of the manuscript as it is not meaningful to compare male and female mediated learning response if we do not have an optimal protocol in female mice.  

      (4) Why did the authors choose to vary the duration of the stimuli across preconditioning, conditioning, and testing? During preconditioning, the light-tone compound was 30s, in conditioning the light was 10s, and at test both stimuli were presented continuously for 3 min. Did the level of freezing vary across the three-minute probe session? There is some evidence that rodents can learn the timing of stimuli and it may be the case that freezing was highest at the start of the test stimulus, when it most closely resembled the conditioned stimulus. 

      Differences in cue exposure between the preconditioning phase and the test phase were decided based on important differences between previous protocols used in rats compared to how mice are responding. Indeed, mice were not able to adapt their behavioral responses when shorter time windows exposing the cue were used as it clearly happens with rats1. In addition, we have added a new graph to show the time course of the behavioral responses (see Figure 1 and 4 and Supplementary Figure 2) that correlate with the quantification of freezing responses shown by the percentage of freezing during ON and OFF periods.   

      (5) The title of Experiment 1 "Female and male mice show mediated learning using an auditory-visual sensory preconditioning task" - this experiment does not demonstrate mediated learning; it merely shows that animals will freeze more in the presence of a stimulus as compared with no stimulus. This experiment lacks the necessary controls to claim mediated learning (which are presented in Experiment 2) and should therefore be retitled something more appropriate.

      As stated above, based on the new results obtained with female mice, we have decided to remove the original Figure 1 of the manuscript as it is not meaningful to compare male and female mediated learning response if we do not have an optimal protocol in female mice.   

      (6) In Figure 2, why does the unpaired group show less freezing to the tone than the paired group given that the tone was directly paired with the shock in both groups? 

      We believe the Reviewer may have referred to the tone in error (i.e. there are no differences in the freezing observed to the tone) and (s)he might be talking about the freezing induced by the Light in the direct learning test. In this case, it is true that the direct learning (e.g. percentage of freezing) seems to be slightly lower in the unpaired group compared to the paired one, which could be due to a latent inhibition process caused by the different exposure of cues between paired and unpaired experimental groups. However, the direct learning in both groups is clear and significant and there are no significant differences between them, which makes difficult to extract any further conclusion. 

      (7) The stimuli in the design schematics are quite small and hard to see, they should be enlarged for clarity. The box plots also looked stretched and the colour difference between the on and off periods is difficult to discern. 

      We have included some important modification to the Figures in order to address the comments made by the Reviewer and improve its quality.   

      (8) The authors do not include labels for the experimental groups (paired, unpaired, no shock) in Figures 2B, 2D, 2C, and 2E. This made it very difficult to interpret the figure.  

      According to this suggestion, Figure 2 has been changed accordingly. 

      (9) The levels of freezing during conditioning should be presented for all experiments.  

      We have generated a new Supplementary Figure 9 to show the freezing levels during conditioning sessions. 

      (10) In the final experiment, the authors wrote that mice were injected with J60 or saline, but I could not find the data for the saline animals.  

      In the Results and Methods section, we have included a sentence to better explain this issue. In addition, we have added a new Supplementary Figure 7 to show the performance of all control groups.  

      (11) Please list the total number of animals (per group, per sex) for each experiment.  

      In the revised version of the manuscript, we have added this information in each Figure Legend.  

      Reviewer #3 (Recommendations for the authors): 

      I found this study very interesting, despite a few weaknesses. I have several minor comments to add, hoping that it would improve the manuscript: 

      (1) The terminology used is not always appropriate/consistent. I would use "freely moving fiber photometry" or simply "fiber photometry" as calcium imaging conventionally refers to endoscopic or 2-photon calcium imaging. 

      We thank the Reviewer for this comment that has been addressed and corrected in the revised version of the manuscript. 

      (2) "Dorsal hippocampus mediates light-tone sensory preconditioning task in mice" suggests that a brain region mediates a task. I would rather suggest, e.g. "Dorsal hippocampus mediates light-tone association in mice" 

      We thank the Reviewer for this comment that has been addressed and corrected in the revised version of the manuscript.

      (3) As you are using low-salience stimuli, it would be better to also inform the readership with the light intensity used for the light cue, for replicability purposes. 

      In the Methods section (Page 5, Line 30), we have added new information regarding the visual stimuli used. 

      (4) If the authors didn't use a background noise during the probe tests, the tone cue could have been perceived as being louder/clearer by mice. Couldn't it have inflated the freezing response for the tone cue?  

      This is an interesting comment made by the Reviewer although we do not have any data to directly answer his/her suggestion. However, the presence of the Background noise resulted necessary to set up the protocol and to change different aspects of the context through all the paradigm, which was necessary to avoid fear generalization in mice. In addition, as demonstrated before [6] , the presence of background noise is important to avoid that other auditory cue (i.e. tone) could induce fear responses by itself as the transition of noise to silence is a signal to danger for animals. 

      (5) "salience" is usually used for the intensity of a stimulus, not for an association or pairing. Rather, we usually refer to the strength of an association. 

      We thank the Reviewer for this comment that has been addressed and corrected in the revised version of the manuscript.

      (6) Figure 3, panel A. "RCaMP Neurons", maybe "Pan-Neurons" would be more appropriate, as PV+ inter-neurons are also neurons. 

      We thank the Reviewer for this comment that has been corrected accordingly.

      (7) Figure 4, panel A, please add the AAV injected, and the neurons labelled in your example slice. 

      We thank the Reviewer for this comment that has been corrected accordingly.

      References

      (1) Wong, F. S., Westbrook, R. F. & Holmes, N. M. 'Online' integration of sensory and fear memories in the rat medial temporal lobe. Elife 8 (2019). https://doi.org:10.7554/eLife.47085

      (2) Prendergast, B. J., Onishi, K. G. & Zucker, I. Female mice liberated for inclusion in neuroscience and biomedical research. Neurosci Biobehav Rev 40, 1-5 (2014). https://doi.org:10.1016/j.neubiorev.2014.01.001

      (3) Becker, J. B., Prendergast, B. J. & Liang, J. W. Female rats are not more variable than male rats: a meta-analysis of neuroscience studies. Biol Sex Differ 7, 34 (2016). https://doi.org:10.1186/s13293-016-0087-5

      (4) Shansky, R. M. Are hormones a "female problem" for animal research? Science 364,  825-826 (2019). https://doi.org:10.1126/science.aaw7570

      (5) Busquets-Garcia, A. et al. Hippocampal CB1 Receptors Control Incidental Associations. Neuron 99, 1247-1259 e1247 (2018). https://doi.org:10.1016/j.neuron.2018.08.014

      (6) Pereira, A. G., Cruz, A., Lima, S. Q. & Moita, M. A. Silence resulting from the cessation of movement signals danger. Curr Biol 22, R627-628 (2012). https://doi.org:10.1016/j.cub.2012.06.015

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review): 

      Summary: 

      This paper by Schommartz and colleagues investigates the neural basis of memory reinstatement as a function of both how recently the memory was formed (recent, remote) and its development (children, young adults). The core question is whether memory consolidation processes as well as the specificity of memory reinstatement differ with development. A number of brain regions showed a greater activation difference for recent vs. remote memories at the long versus shorter delay specifically in adults (cerebellum, PHG, LOC). A different set showed decreases in the same comparison, but only in children (precuneus, RSC). The authors also used neural pattern similarity analysis to characterize reinstatement, though still in this revised paper I have substantive concerns about how the analyses were performed. While scene-specific reinstatement decreased for remote memories in both children and adults, claims about its presence cannot be made given the analyses. Gist-level reinstatement was observed in children but not adults, but I also have concerns about this analysis. Broadly, the behavioral and univariate findings are consistent with the idea memory consolidation differs between children and adults in important ways, and takes a step towards characterizing how.

      Strengths: 

      The topic and goals of this paper are very interesting. As the authors note, there is little work on memory consolidation over development, and as such this will be an important data point in helping us begin to understand these important differences. The sample size is great, particularly given this is an onerous, multi-day experiment; the authors are to be commended for that. The task design is also generally well controlled, for example as the authors include new recently learned pairs during each session.  

      Weaknesses: 

      As noted above and in my review of the original submission, the pattern similarity analysis for both item and category-level reinstatement were performed in a way that is not interpretable given concerns about temporal autocorrelation within scanning run.Unfortunately these issues remain of concern in this revision because they were not rectified. Most of my review focuses on this analytic issue, though I also outline additional concerns. 

      (1) The pattern similarity analyses are largely uninterpretable due to how they were performed. 

      (a) First, the scene-specific reinstatement index: The authors have correlated a neural pattern during a fixation cross (delay period) with a neural pattern associated with viewing a scene as their measure of reinstatement. The main issue with this is that these events always occurred back-to-back in time. As such, the two patterns will be similar due simply to the temporal autocorrelation in the BOLD signal. Because of the issues with temporal autocorrelation within scanning run, it is always recommended to perform such correlations only across different runs. In this case, the authors always correlated patterns extracted from the same run, and which moreover have temporal lags that are perfectly confounded with their comparison of interest (i.e., from Fig 4A, the "scene-specific" comparisons will always be back-to-back, having a very short temporal lag; "set-based" comparisons will be dispersed across the run, and therefore have a much higher lag). The authors' within-run correlation approach also yields correlation values that are extremely high - much higher than would be expected if this analysis was done appropriately. The way to fix this would be to restrict the analysis to only cross-run comparisons, which is not possible given the design. 

      To remedy this, in the revision the authors have said they will refrain from making conclusions about the presence of scene-specific reinstatement (i.e., reinstatement above baseline). While this itself is an improvement from the original manuscript, I still have several concerns. First, this was not done thoroughly and at times conclusions/interpretations still seem to imply or assume the presence of scene reinstatement (e.g., line 979-985, "our research supports the presence of scene-specific reinstatement in 5-to-7-year-old children"; line 1138). 

      We thank the reviewers for pointing out that there are inconsistencies in our writing. We agree that we cannot make any claims about the baseline level of scene-specific reinstatement. To reiterate, our focus is on the changes in reinstatement over time (30 minutes, 24 hours, and two weeks after learning), which showed a robust decrease. Importantly, scenespecific reinstatement indices for recent items — tested on different days — did not significantly differ, as indicated by non-significant main effects of Session (all p > .323) and Session x ROI interactions (all p > .817) in either age group. This supports our claim that temporal autocorrelation is stable and consistent across conditions and that the observed decline in scene-specific reinstatement reflects a time-dependent change in remote retrieval. We have revised the highlighted passages, accordingly, emphasizing the delay-related decrease in scene-specific reinstatement rather than its absolute magnitude. 

      Second, the authors' logic for the neural-behavioural correlations in the PLSC analysis involved restricting to regions that showed significant reinstatement for the gist analysis, which cannot be done for the analogous scene-specific reinstatement analysis. This makes it challenging to directly compare these two analyses since one was restricted to a small subset of regions and only children (gist), while scene reinstatement included both groups and all ROIs. 

      We thank the reviewer for pointing this out and want to clarify that it was not our intention to directly compare these analyses. For the neural-behavioral correlations, we included only those regions identified based on gist-like representations baseline, whereas for scene-specific reinstatement, we included all regions due to the absence of such a baseline. The primary aim of the PLSC analysis was to identify a set of regions that, after a stringent permutation and bootstrapping procedure, form a latent variable that explains a significant proportion of variance in behavioral performance across all participants. 

      Third, it is also unclear whether children and adults' values should be directly comparable given pattern similarity can be influenced by many factors like motion, among other things. 

      We thank the reviewer for raising this important point. In our multivariate analysis, we included confounding regressors specifically addressing motion-related artefacts. Following recent best practices for mitigating motion-related confounding factors in both adult and pediatric fMRI data (Ciric et al., 2017; Esteban et al., 2020; Jones et al., 2021; Satterthwaite et al., 2013), we implemented the most effective motion correction strategies. 

      Importantly, our group × session interaction analysis focuses on relative changes in reinstatement over time rather than comparing absolute levels of pattern similarity between children and adults. This approach controls for potential baseline differences and instead examines whether the magnitude of delay-related changes differs across groups. We believe this warrants the comparison and ensures that our conclusions are not driven by group-level differences in baseline similarity or motion artifacts.

      My fourth concern with this analysis relates to the lack of regional specificity of the effects. All ROIs tested showed a virtually identical pattern: "Scene-specific reinstatement" decreased across delays, and was greater in children than adults. I believe control analyses are needed to ensure artifacts are not driving these effects. This would greatly strengthen the authors' ability to draw conclusions from the "clean" comparison of day 1 vs. day 14. (A) The authors should present results from a control ROI that should absolutely not show memory reinstatement effects (e.g., white matter?). Results from the control ROI should look very different - should not differ between children and adults, and should not show decreases over time. 

      (C) If the same analysis was performed comparing the object cue and immediately following fixation (rather than the fixation and the immediately following scene), the results should look very different. I would argue that this should not be an index of reinstatement at all since it involves something presented visually rather than something reinstated (i.e., the scene picture is not included in this comparison). If this control analysis were to show the same effects as the primary analysis, this would be further evidence that this analysis is uninterpretable and hopelessly confounded. 

      We appreciate the reviewer’s suggestion to strengthen the interpretation of our findings by including appropriate control analyses to rule out non-memory-related artifacts. In response, we conducted several control analyses, detailed below, which collectively support the specificity of the observed reinstatement effects. The report of the results is included in the manuscript (line 593-619).

      We checked that item reinstatement for incorrectly remembered trial did not show any session-related decline for any ROI. This indicates that the reinstatement for correctly remembered items is memory-related (see Fig. S5 for details). 

      We conducted additional analyses on three subregions of the corpus callosum (the body, genu, and splenium). The results of the linear mixed-effects models revealed no significant group effect (all p > .426), indicating no differences between children and adults. In contrast, all three ROIs showed a significant main effect of Session (all p < .001). However, post hoc analyses indicated that this effect was driven by differences between the recent and the Day 14 remote condition. The main contrasts of interest – recent vs. Day 1 remote and Day 1 remote vs. Day 14 remote – were not significant (all p > .080; see Table S10.4), suggesting that, unlike in other ROIs, there was no delay-related decrease in scene-specific reinstatement in these white matter regions.

      Then we repeated our analysis using the same procedure but replaced the “scene” time window with the “object” time window. The rationale for this control is that comparing the object cue to the immediately following fixation period should not reflect scene reinstatement, as the object and the reinstated scene rely on distinct neural representations. Accordingly, we did not expect a delay-related decrease in the reinstatement index. Consistent with this expectation, the analysis using the object – fixation similarity index – though also influenced by temporal autocorrelation – did not reveal any significant effect of session or delay in any ROI (all p > .059; see Table S9, S9.1).

      Together, these control analyses provide converging evidence that our findings are not driven by global or non-specific signal changes. We believe that these control analyses strengthen our interpretation about delay-related decrease in scene-specific reinstatement index. 

      (B) Do the recent items from day 1 vs. day 14 differ? If so, this could suggest something is different about the later scans (and if not, it would be reassuring). 

      The recent items tested on day 1 and day14 do not differ (all p. > .323). This effect remains stable across all ROIs.

      (b) For the category-based neural reinstatement: (1) This suffers from the same issue of correlations being performed within run. Again, to correct this the authors would need to restrict comparisons to only across runs (i.e., patterns from run 1 correlated with patterns for run 2 and so on). The authors in their response letter have indicated that because the patterns being correlated are not derived from events in close temporal proximity, they should not suffer from the issue of temporal autocorrelation. This is simply not true. For example, see the paper by Prince et al. (eLife 2022; on GLMsingle). This is not the main point of Prince et al.'s paper, but it includes a nice figure that shows that, using standard modelling approaches, the correlation between (same-run) patterns can be artificially elevated for lags as long as ~120 seconds (and can even be artificially reduced after that; Figure 5 from that paper) between events. This would affect many of the comparisons in the present paper. The cleanest way to proceed is to simply drop the within-run comparisons, which I believe the authors can do and yet they have not. Relatedly, in the response letter the authors say they are focusing mainly on the change over time for reinstatement at both levels including the gist-type reinstatement; however, this is not how it is discussed in the paper. They in fact are mainly relying on differences from zero, as children show some "above baseline" reinstatement while adults do not, but I believe there were no significant differences over time (i.e., the findings the authors said they would lean on primarily, as they are arguably the most comparable).  

      We thank the reviewer for this important comment regarding the potential inflation of similarity values due to within-run comparisons.

      To address the reviewer’s concern, we conducted an additional cross-run analysis for all correctly retrieved trials. The approach restricted comparisons to non-overlapping runs (run1run2, run2-run3, run1-run3). This analysis revealed robust gist-like reinstatement in children for remote Day 14 memories in the mPFC (p = .035) and vlPFC (p = .0007), in adults’ vlPFC remote Day 1 memories (p = .029), as well as in children and adults remote Day 1 memories in LOC (p < .02). A significant Session effect in both regions (mPFC: p = .026; vlPFC: p = .002) indicated increased reinstatement for long delay (Day 14) compared to short-delay and recent session (all p < .05). Given that the cross-run results largely replicate and reinforce the effects found previously with within-run, we believe that combining both sources of information is methodologically justified and statistically beneficial. Specifically, both approaches independently identified significant gist-like reinstatement in children’s mPFC and vlPFC (although within-run vlPFC effect (short delay: p = .038; long delay p = .047) did not survive multiple comparisons), particularly for remote memories. Including both withinrun and between-run comparisons increases the number of unique, non-repeated trial pairs, improving statistical power without introducing redundancy. While we acknowledge that same-run comparisons may be influenced by residual autocorrelation (as shown by Prince et al. 2022, eLife), we believe that our design mitigates this risk through consistency between within-run and cross-run results, long inter-trial intervals, and trial-wise estimation of activation. We have adjusted the manuscript, accordingly, reporting the combined analysis. We also report cross-run and within-run analysis separately in supplementary materials (Tables S12.1, S12.2, showing that they converge with the cross-run results and thus strengthen rather than dilute the findings. 

      As suggested, we now explicitly highlight the change over time as the central finding. We observe a clear increase in gist-like reinstatement from recent to remote memories in children, particularly in mPFC and vlPFC. These effects based on combined within- and cross-run comparisons, are now clearly stated in the main results and interpreted in the discussion accordingly. 

      (2) This analysis uses a different approach of comparing fixations to one another, rather than fixations to scenes. In their response letter and the revised paper, the authors do provide a bit of reasoning as to why this is the most sensible. However, it is still not clear to me whether this is really "reinstatement" which (in my mind) entails the re-evoking of a neural pattern initially engaged during perception. Rather, could this be a shared neural state that is category specific? 

      We thank the reviewer for raising this important conceptual point about whether our findings reflect reinstatement in the classical sense — namely, the reactivation of perceptual neural patterns — or a shared, category-specific state.

      While traditional definitions of reinstatement emphasize item-specific reactivation (e.g., Ritchey et al., 2013; Xiao et al., 2017) it is increasingly recognized that memory retrieval can also involve the reactivation of abstracted, generalized, or gist-like representations, especially as memories consolidate. Our analysis follows this view, aimed to capture how memory representations evolve over time, particularly in development.

      Several studies support this broader notion of gist-like reinstatement. For instance, Chen et al. (2017) showed that while event-specific patterns were reinstated across the default mode network and medial temporal lobe, inter-subject recall similarity exceeded encodingretrieval similarity, suggesting transformation and abstraction beyond perceptual reinstatement. Zhuang et al. (2021) further showed that loss of neural distinctiveness in the

      MTL over time predicted false memories, linking neural similarity to representational instability. This aligns with our finding that greater gist-like reinstatement is associated with lower memory accuracy.

      Ye et al. (2020) discuss how memory representations are reshaped post-encoding — becoming more differentiated, integrated, or weakened depending on task goals and neural resources. While their work focuses on adults, our previous findings (Schommartz et al., 2023) suggest that children’s neural systems (the same sample) are structurally immature, making them more likely to rely on gist-based consolidation (see Fandakova et al., 2019). Adults, by contrast, may retain more item-specific traces.

      Relatedly, St-Laurent & Buchsbaum (2019) show that with repeated encoding, neural memory representations become increasingly distinct from perception, suggesting that reinstatement need not mimic perception. We agree that reinstatement does not always reflect reactivation of low-level sensory patterns, particularly over long delays or in developing brains.

      Finally, while we did not correlate retrieval patterns directly with perceptual encoding patterns, we assessed neural similarity among retrieved items within vs. between categories, based on non-repeated, independently sampled trials. This approach is intended to capture the structure and delay-related transformation of mnemonic representations, especially in terms of how they become more schematic or gist-like over time. Our findings align conceptually with the results of Kuhl et al. (2012), who used MVPA to show that older and newer visual memories can be simultaneously reactivated during retrieval, with greater reactivation of older memories interfering with retrieval accuracy for newer memories. Their work highlights how overlapping category-level representations in ventral temporal cortex can reflect competition among similar memories, even in the absence of item-specific cues. In our developmental context, we interpret the increased neural similarity among category members in children as possibly reflecting such representational overlap or competition, where generalized traces dominate over item-specific ones. This pattern may reflect a shift toward efficient but less precise retrieval, consistent with developmental constraints on memory specificity and consolidation.

      In this context, we view our findings as evidence of memory trace reorganization — from differentiated, item-level representations toward more schematic, gist-like neural patterns (Sekeres et al., 2018), particularly in children. Our cross-run analyses further confirm that this is not an artifact of same-run correlations or low-level confounds. We have clarified this distinction and interpretation throughout the revised manuscript (see lines 144-158; 1163-1170).

      In any case, I think additional information should be added to the text to clarify that this definition differs from others in the literature. The authors might also consider using some term other than reinstatement. Again (as I noted in my prior review), the finding of no category-level reinstatement in adults is surprising and confusing given prior work and likely has to do with the operationalization of "reinstatement" here. I was not quite sure about the explanation provided in the response letter, as category-level reinstatement is quite widespread in the brain for adults and is robust to differences in analytic procedures etc. 

      We agree that our operationalization of "reinstatement" differs from more conventional uses of the term, which typically involve direct comparisons between encoding and retrieval phases, often with item-level specificity. As our analysis is based on similarity among retrieval-phase trials (fixation-based activation patterns) and focuses on within- versus between-category neural similarity, we agree that the term reinstatement may suggest a stronger encoding–retrieval mapping than we are claiming.

      To avoid confusion and overstatement, we have revised the terminology throughout the manuscript: we now refer to our measure as “gist-like representations” rather than “gist-like reinstatement.” This change better reflects the nature of our analysis — namely, that we are capturing shared neural patterns among category-consistent memories that may reflect reorganized or abstracted traces, especially after delay and in development.

      As the reviewer rightly points out, category-level reinstatement is well documented in adults (e.g., Kuhl & Chun, 2014; Tompary et al., 2020; Tompary & Davachi, 2017). The absence of such effects in our adult group may indeed reflect differences in study design, particularly our use of non-repeated, cross-trial comparisons based on fixation events. It may also reflect different consolidation strategies, with adults preserving more differentiated or item-specific representations, while children form more schematic or generalizable representations — a pattern consistent with our interpretation and supported by prior work (Fandakova et al., 2019; Sekeres et al., 2018) 

      We have updated the relevant sections of the manuscript (Results, Discussion (particularly lines 1163- 1184), and Figure captions) to clarify this terminology shift and explicitly contrast our approach with more standard definitions of reinstatement. We hope this revision provides the needed conceptual clarity while preserving the integrity of our developmental findings.

      (3) Also from a theoretical standpoint-I'm still a bit confused as to why gist-based reinstatement would involve reinstatement of the scene gist, rather than the object's location (on the screen) gist. Were the locations on the screen similar across scene backgrounds from the same category? It seems like a different way to define memory retrieval here would be to compare the neural patterns when cued to retrieve the same vs. similar (at the "gist" level) vs. different locations across object-scene pairs. This is somewhat related to a point from my review of the initial version of this manuscript, about how scene reinstatement is not necessary. The authors state that participants were instructed to reinstate the scene, but that does not mean they were actually doing it. The point that what is being measured via the reinstatement analyses is actually not necessary to perform the task should be discussed in more detail in the paper. 

      We appreciate the reviewer’s thoughtful theoretical question regarding whether our measure of “gist-like representations” might reflect reinstatement of spatial (object-location) gist, rather than scene-level gist. We would like to clarify several key points about our task design and interpretation:

      (1) Object locations were deliberately varied and context dependent.

      In our stimulus set, each object was embedded in a rich scene context, and the locations were distributed across six distinct possible areas within each scene, with three possible object placements per location. These placements were manually selected to ensure realistic and context-sensitive positioning of objects within the scenes. Importantly, locations were not fixed across scenes within a given category. For example, objects placed in “forest” scenes could appear in different screen locations across different scene exemplars (e.g., one in the bottom-left side, another floating above). Therefore, the task did not introduce a consistent spatial schema across exemplars from the same scene category that could give rise to a “location gist.”

      (2) Scene categories provided consistent high-level contextual information.

      By contrast, the scene categories (e.g., farming, forest, indoor, etc.) provided semantically coherent and visually rich contextual backgrounds that participants could draw upon during retrieval. This was emphasized in the instruction phase, where participants were explicitly encouraged to recall the whole scene based on the stories they created during learning (not just the object or its position). While we acknowledge that we cannot directly verify the reinstated content, this instruction aligns with prior studies showing that scene and context reinstatement can occur even without direct task relevance (e.g., Kuhl & Chun, 2014; Ritchey et al., 2013).

      (3) Our results are unlikely to reflect location-based reinstatement.

      If participants had relied on a “location gist” strategy, we would have expected greater neural similarity across scenes with similar spatial layouts, regardless of category. However, our design avoids this confound by deliberately varying locations across exemplars within categories. Additionally, our categorical neural similarity measure contrasted within-category vs. between-category comparisons — making it sensitive to shared contextual or semantic structure, not simply shared screen positions.

      Considering this, we believe that the neural similarity observed in the mPFC and vlPFC in children at long delay reflects the emergence of scene-level, gist-like representations, rather than low-level spatial regularities. Nevertheless, we now clarify this point in the manuscript and explicitly discuss the limitation that reinstatement of scene context was encouraged but not required for successful task performance.

      Future studies could dissociate spatial and contextual components of reinstatement more directly by using controlled spatial overlap or explicit location recall conditions. However, given the current task structure, location-based generalization is unlikely to account for the category-level similarity patterns we observe.

      (2) Inspired by another reviewer's comment, it is unclear to me the extent to which age group differences can be attributed to differences in age/development versus memory strength. I liked the other reviewer's suggestions about how to identify and control for differences in memory strength, which I don't think the authors actually did in the revision. They instead showed evidence that memory strength does seem to be lower in children, which indicates this is an interpretive confound. For example, I liked the reviewer's suggestion of performing analyses on subsets of participants who were actually matched in initial learning/memory performance would have been very informative. As it is, the authors didn't really control for memory strength adequately in my opinion, and as such their conclusions about children vs. adults could have been reframed as people with weak vs. strong memories. This is obviously a big drawback given what the authors want to conclude. Relatedly, I'm not sure the DDM was incorporated as the reviewer was suggesting; at minimum I think the authors need to do more work in the paper to explain what this means and why it is relevant. (I understand putting it in the supplement rather

      than the main paper, but I still wanted to know more about what it added from an interpretive perspective.) 

      We appreciate the reviewer’s thoughtful concerns regarding potential confounding effects of memory strength on the observed age group differences. This is indeed a critical issue when interpreting developmental findings.

      While we agree that memory strength differs between children and adults — and our own DDM-based analysis confirms this, mirroring differences observed in accuracy — we would like to emphasize that these differences are not incidental but rather reflect developmental changes in the underlying memory system. Given the known maturation of both structural and functional memory-related brain regions, particularly the hippocampus and prefrontal cortex, we believe it would be theoretically inappropriate to control for memory strength entirely, as doing so would remove variance that is central to the age-related neural effects we aim to understand.

      To address the reviewer's concern empirically, we conducted an additional control analysis in which we subsampled children to include only those who reached learning criterion after two cycles (N = 28 out of 49 children, see Table S1.1, S1.2, Figure S1, Table S9.1), thereby selecting a high-performing subgroup. Importantly, this subsample replicated behavioral and neural results to the full group. This further suggests that the observed age group differences are not merely driven by differences in memory strength.

      As abovementioned, the results of the DDM support our behavioral findings, showing that children have lower drift rates for evidence accumulation, consistent with weaker or less accessible memory representations. While these results are reported in the Supplementary Materials (section S2.1, Figure S2, Table S2), we agree that their interpretive relevance should be more clearly explained in the main text. We have therefore updated the Discussion section to explicitly state how the DDM results provide converging evidence for our interpretation that developmental differences in memory quality — not merely strategy or task performance — underlie the observed neural differences (see lines 904-926).

      In sum, we view memory strength not as a confound to be removed, but as a meaningful and theoretically relevant factor in understanding the emergence of gist-like representations in children. We have clarified this interpretive stance in the revised manuscript and now discuss the role of memory strength more explicitly in the Discussion.

      (3) Some of the univariate results reporting is a bit strange, as they are relying upon differences between retrieval of 1- vs. 14-day memories in terms of the recent vs. remote difference, and yet don't report whether the regions are differently active for recent and remote retrieval. For example in Figure 3A, neither anterior nor posterior hippocampus seem to be differentially active for recent vs. remote memories for either age group (i.e., all data is around 0). Precuneus also interestingly seems to show numerically recent>remote (values mostly negative), whereas most other regions show the opposite. This difference from zero (in either direction) or lack thereof seems important to the message. In response to this comment on the original manuscript, the authors seem to have confirmed that hippocampal activity was greater during retrieval than implicit baseline. But this was not really my question - I was asking whether hippocampus is (and other ROIs in this same figure are) differently engaged for recent vs. remote memories.

      We thank the reviewer for bringing up this important point. Our previous analysis showed that both anterior and posterior regions of the hippocampus, anterior parahippocampal gyrus and precuneus exhibited significant activation from zero in children and adults for correctly remembered items (see Fig. S2, Table S7 in Supplementary Materials). Based on your suggestion, our additional analysis showed: 

      (i) The linear mixed-effects model for correctly remembered items showed no significant interaction effects (group x session x memory age (recent, remote)) for the anterior hippocampus (all p > .146; see Table S7.1).

      (ii) For the posterior hippocampus, we observed a significant main effect of group (F(1,85),   = 5.62, p = .038), showing significantly lower activation in children compared to adults (b = .03, t = -2.34, p = .021). No other main or interaction effects were significant (all p > .08; see Table S7.1).

      (iii) For the anterior PHG, that also showed no significant remote > recent difference, the model showed that there was indeed no difference between remote and recent items across age groups and delays (all p > .194; Table S7.1). 

      Moreover, when comparing recent and remote hippocampal activation directly, there were no significant differences in either group (all FDR-adjusted p > .116; Table S7.2), supporting the conclusion that hippocampal involvement was stable across delays for successfully retrieved items. 

      In contrast, analysis of unsuccessfully remembered items showed that hippocampal activation was not significantly different from zero in either group (all FDR-adjusted p > .052; Fig. S2.1, Table S7.1), indicating that hippocampal engagement was specific to successful memory retrieval.

      To formally test whether hippocampal activation differs between remembered and forgotten items, we ran a linear mixed-effects model with Group, Memory Success (remembered vs. forgotten), and ROI (anterior vs. posterior hippocampus) as fixed effects. This model revealed a robust main effect of memory success (F(1,1198) = 128.27, p < .001), showing that hippocampal activity was significantly higher for remembered compared to forgotten items (b = .06, t(1207) = 11.29, p < .001; Table S7.3). 

      As the reviewer noted, precuneus activation was numerically higher for recent vs. remote items, and this was confirmed in our analysis. While both recent and remote retrieval elicited significantly above-zero activation in the precuneus (Table S7.2), activation for recent items was significantly higher than for remote items, consistent across both age groups.

      Taken together, these analyses support the conclusion that hippocampal involvement in successful retrieval is sustained across delays, while other ROIs such as the precuneus may show greater engagement for more recent memories. We have now updated the manuscript text ( lines 370-390) and supplementary materials to reflect these findings more clearly, as well as to clarify the distinction between activation relative to baseline and memory-agerelated modulation.

      (4) Related to point 3, the claims about hippocampus with respect to multiple trace theory feel very unsupported by the data. I believe the authors want to conclude that children's memory retrieval shows reliance on hippocampus irrespective of delay, presumably because this is a detailed memory task. However the authors have not really shown this; all they have shown is that hippocampal involvement (whatever it is) does not vary by delay. But we do not have compelling evidence that the hippocampus is involved in this task at all. That hippocampus is more active during retrieval than implicit baseline is a very low bar and does not necessarily indicate a role in memory retrieval. If the authors want to make this claim, more data are needed (e.g., showing that hippocampal activity during retrieval is higher when the upcoming memory retrieval is successful vs. unsuccessful). In the absence of this, I think all the claims about multiple trace theory supporting retrieval similarly across delays and that this is operational in children are inappropriate and should be removed. 

      We thank the reviewer for pointing this out. We agree that additional analysis of hippocampal activity during successful and unsuccessful memory retrieval is warranted. This will provide stronger support for our claim that strong, detailed memories during retrieval rely on the hippocampus in both children and adults. Our previously presented results on the remote > recent univariate signal difference in the hippocampus (p. 14-18; lines 433-376, Fig. 3A) show that this difference does not vary between children and adults, or between Day 1 and Day 14. Our further analysis showed that both anterior and posterior regions of the hippocampus exhibited significant activation from zero in children and adults for correctly remembered items (see Fig. S2, Table S7 in Supplementary Materials). Based on your suggestion, our recent additional analysis showed:

      (i) For forgotten items, we did not observe any activation significantly higher than zero in either the anterior or posterior hippocampus for recent and remote memory on Day 1 and Day 14 in either age group (all p > .052 FDR corrected; see Table S7.1, Fig. S2.1).

      (ii) After establishing no difference between recent and remote activation across and between sessions (Day 1, Day 14), we conducted another linear mixed-effects model with group x memory success (remembered, forgotten) x region (anterior hippocampus, posterior hippocampus), with subject as a random effect. The model showed no significant effects for the memory success x region interaction (F = 1.12(1,1198), p = .289) and no significant group x memory success x region interaction (F = .017(1,1198), p = .895). However, we observed a significant main effect of memory success (F = 128.27(1,1198), p < .001), indicating significantly higher hippocampal activation for remembered compared to forgotten items (b = .06, t = 11.29, p <.001; see Table S7.3).

      (iii) Considering the comparatively low number of incorrect trials for recent items in the adult group, we reran this analysis only for remote items. Similarly, the model showed no significant effects for the memory success x region interaction (F = .72(1,555), p = .398) and no significant group x memory success x region interaction (F = .14(1,555), p = .705). However, we observed a significant main effect of memory success (F = 68.03(1,555), p < .001), indicating significantly higher hippocampal activation for remote remembered compared to forgotten items (b = .07, t = 8.20, p <.001; see Table S7.3).

      Taken together, our results indicate that significant hippocampal activation was observed only for correctly remembered items in both children and adults, regardless of memory age and session. For forgotten items, we did not observe any significant hippocampal activation in either group or delay. Moreover, hippocampal activation was significantly higher for remembered compared to forgotten memories. This evidence supports our conclusions regarding the Multiple Trace and Trace Transformation Theories, suggesting that the hippocampus supports retrieval similarly across delays, and provides novel evidence that this process is operational in both children and adults. This aligns also with Contextual Bindings Theory, as well as empirical evidence by Sekeres, Winokur, & Moscovitch (2018), among others. We have added this information to the manuscript.

      (5) There are still not enough methodological details in the main paper to make sense of the results. Some of these problems were addressed in the revision but others remain. For example, a couple of things that were unclear: that initially learned locations were split, where half were tested again at day 1 and the other half at day 14; what specific criterion was used to determine to pick the 'well-learned' associations that were used for comparisons at different delay periods (object-scene pairs that participants remembered accurately in the last repetition of learning? Or across all of learning?). 

      We thank the reviewer for pointing this out. The initially learned object-scene associations on Day 0 were split in two halves based on  their categories before the testing. Specifically, half of the pairs from the first set and half of the pairs from the second set of 30 object-scene associations were used to create the set 30 remote pair for Day 1 testing. A similar procedure was repeated for the remaining pairs to create a set of remote object-scene associations for Day 14 retrieval. We tried to equally distribute the categories of pairs between the testing sets. We added this information to the methods section of the manuscript (see p. 47, lines 12371243). In addition, the sets of association for delay test on Day 1 and Day 14 were not based on their learning accuracy. Of note, the analysis of variance revealed that there was no difference in learning accuracy between the two sets created for delay tests in either age group (children: p = .23; adults  p = .06). These results indicate that the sets were comprised of items learned with comparable accuracy in both age groups. 

      (6) In still find the revised Introduction a bit unclear. I appreciated the added descriptions of different theories of consolidation, though the order of presented points is still a bit hard to follow. Some of the predictions I also find a bit confusing as laid out in the introduction. (1) As noted in the paper multiple trace theory predicts that hippocampal involvement will remain high provided memories retained are sufficiently high detail. The authors however also predict that children will rely more on gist (than detailed) memories than adults, which would seem to imply (combined with the MTT idea) that they should show reduced hippocampal involvement over time (while in adults, it should remain high). However, the authors' actual prediction is that hippocampus will show stable involvement over time in both kids and adults. I'm having a hard time reconciling these points. (2) With respect to the extraction of gist in children, I was confused by the link to Fuzzy Trace Theory given the children in the present study are a bit young to be showing the kind of gist extraction shown in the Brainerd & Reyna data. Would 5-7 year olds not be more likely to show reliance on verbatim traces under that framework? Also from a phrasing perspective, I was confused about whether gist-like information was something different from just gist in this sentence: "children may be more inclined to extract gist information at the expense of detailed or gist-like information." (p. 8) - is this a typo? 

      We thank the reviewer for this thoughtful observation. 

      Our hypothesis of stable hippocampal engagement over time was primarily based on Contextual Binding Theory (Yonelinas et al., 2019), and the MTT, supported by the evidence provided by Sekeres et al., 2018, which posits that the hippocampus continues to support retrieval when contextual information is preserved, even for older, consolidated memories. Given that our object-location associations were repeatedly encoded and tied to specific scene contexts, we believe that retrieval success for both recent and remote memories likely involved contextual reinstatement, leading to sustained hippocampal activity. Also in accordance with the MTT and related TTT, different memory representations may coexist, including detailed and gist-like memories. Therefore, we suggest that children may not rely on highly detailed item-specific memory, but rather on sufficiently contextualized schematic traces, which still engage the hippocampus. This distinction is now made clearer in the Introduction (see lines 223-236).

      We appreciate the reviewer’s point regarding Fuzzy Trace Theory (Brainerd & Reyna, 2002). Indeed, in classic FTT, young children are thought to rely more on verbatim traces due to immature gist extraction mechanisms (primarily from verbal material). However, we use the term “gist-like representations” to refer to schematic or category-level retrieval that emerges through structured, repeated learning (as in our task). This form of abstraction may not require full semantic gist extraction in the FTT sense but may instead reflect consolidation-driven convergence onto shared category-level representations — especially when strategic resources are limited. We now clarify this distinction and revise the ambiguous sentence with typo (“at the expense of detailed or gist-like information”) to better reflect our intended meaning (see p.8).

      (7) For the PLSC, if I understand this correctly, the profiles were defined for showing associations with behaviour across age groups. (1) As such, is it not "double dipping" to then show that there is an association between brain profile and behaviour-must this not be true by definition? If I am mistaken, it might be helpful to clarify this in the paper. (2) In addition, I believe for the univariate and scene-specific reinstatement analyses these profiles were defined across both age groups. I assume this doesn't allow for separate definition of profiles across the two group (i.e., a kind of "interaction"). If this is the case, it makes sense that there would not be big age differences... the profiles were defined for showing an association across all subjects. If the authors wanted to identify distinct profiles in children and adults they may need to run another analysis. 

      We thank the reviewer for this thoughtful comment. 

      (1) We agree that showing the correlation between the latent variable and behavior may be redundant, as the relationship is already embedded in the PLSC solution and quantified by the explained variance. Our intention was merely to visualize the strength of this relationship. In hindsight, we agree that this could be misinterpreted, and we have removed the additional correlation figure from the manuscript.

      We also see the reviewer’s point that, given the shared latent profile across groups, it is expected that the strength of the brain-behavior relationship does not differ between age groups. Instead, to investigate group differences more appropriately, we examined whether children and adults differed in their expression of the shared latent variable (i.e., brain scores). This analysis revealed that children showed significantly lower brain scores than adults both in short delay, t(83) = -4.227, p = .0001, and long delay, t(74) = -5.653, p < .001, suggesting that while the brain-behavior profile is shared, its expression varies by group. We have added this clarification to the Results section (p. 19-20) of the revised manuscript. 

      (2) Regarding the second point, we agree with the reviewer that defining the PLS profiles across both age groups inherently limits the ability to detect group-specific association, as the resulting latent variables represent shared pattern across the full sample. To address this, we conducted additional PLS analyses separately within each age group to examine whether distinct neural upregulation profiles (remote > recent) emerge for short and long delay conditions.

      These within-group analyses, however, were based on smaller subsamples, which reduced statistical power, especially when using bootstrapping to assess the stability of the profiles. For the short delay, although some regions reached significance, the overall latent variables did not reach conventional thresholds for stability (all p > .069), indicating that the profiles were not robust. This suggests that within-group PLS analyses may be underpowered to detect subtle effects, particularly when modelling neural upregulation (remote > recent), which may be inherently small.

      Nonetheless, when we exploratively applied PLSC separately within each group using recent and remote activity levels against the implicit baseline (rather than the contrast remote > recent) and its relation to memory performance, we observed significant and stable latent variables in both children and adults. This implies that such contrasts (vs. baseline) may be more sensitive and better suited to detect meaningful brain–behavior relationships within age groups. We have added this clarification to the Results sections of the manuscript to highlight the limitations of within-group contrasts for neural upregulation. 

      Author response image 1.

      (3) Also, as for differences between short delay brain profile and long delay brain profile for the scene-specific reinstatement - there are 2 regions that become significant at long delay that were not significant at a short delay (PC, and CE). However, given there are ceiling effects in behaviour at the short but not long delay, it's unclear if this is a meaningful difference or just a difference in sensitivity. Is there a way to test whether the profiles are statistically different from one another?

      We thank the reviewer for this comment. To better illustrate differential profiles also for high memory accuracy after immediate delay (30 minutes delay), we added the immediate (30 minutes delay) condition as a third reference point, given the availability of scene-specific reinstatement data at this time point. Interestingly, the immediate reinstatement profile revealed a different set of significant regions, with distinct expression patterns compared to both the short and long delay conditions. This supports the view that scene-specific reinstatement is not static but dynamically reorganized over time.

      Regarding the ceiling effect at short delay, we acknowledge this as a potential limitation. However, we note that our primary analyses were conducted across both age groups combined, and not solely within high-performing individuals. As such, the grouping may mitigate concerns that ceiling-level performance in a subset of participants unduly influenced the overall reinstatement profile. Moreover, we observed variation in neural reinstatement despite ceiling-level behavior, suggesting that the neural signal retains sensitivity to consolidation-related processes even when behavioral accuracy is near-perfect.

      While we agree that formal statistical comparisons of reinstatement profiles across delays (e.g., using representational profile similarity or interaction tests) could be an informative direction, we feel that this goes beyond the scope of the current manuscript. 

      (4) As I mentioned above, it also was not ideal in my opinion that all regions were included for the scene-specific reinstatement due to the authors' inability to have an appropriate baseline and therefore define above-chance reinstatement. It makes these findings really challenging to compare with the gist reinstatement ones. 

      We appreciate the reviewer’s comment and agree that the lack of a clearly defined baseline for scene-specific reinstatement limits our ability to determine whether these values reflect above-chance reinstatement. However, we would like to clarify that we do not directly compare the magnitude of scene-specific reinstatement to that of gist-like reinstatement in our analyses or interpretations. These two analyses serve complementary purposes: the scenespecific analysis captures trial-unique similarity (within-item reinstatement), while the gistlike analysis captures category-level representational structure (across items). Because they differ not only in baseline assumptions but also in analytical scope and theoretical interpretation, our goal was not to compare them directly, but rather to explore distinct but co-existing representational formats that may evolve differently across development and delay.

      (8) I would encourage the authors to be specific about whether they are measuring/talking about memory representations versus reinstatement, unless they think these are the same thing (in which case some explanation as to why would be helpful). For example, especially under the Fuzzy Trace framework, couldn't someone maintain both verbatim and gist traces of a memory yet rely more on one when making a memory decision? 

      We thank the reviewer for pointing out the importance of conceptual clarity when referring to memory representations versus reinstatement. We agree that these are distinct but related concepts: in our framework, memory representations refer to the neural content stored as a result of encoding and consolidation, whereas reinstatement refers to the reactivation of those representations during retrieval. Thus, reinstatement serves as a proxy for the underlying memory representation — it is how we measure or infer the nature (e.g., specificity, abstraction) of the stored content.

      Under Fuzzy Trace Theory, it is indeed possible for both verbatim and gist representations to coexist. Our interpretation is not that children lack verbatim traces, but rather that they are more likely to rely on schematic or gist-like representations during retrieval, especially after a delay. Our use of neural pattern similarity (reinstatement) reflects which type of representation is being accessed, not necessarily which traces exist in parallel.

      To avoid ambiguity, we have revised the manuscript to more explicitly distinguish between reinstatement (neural reactivation) and the representational format (verbatim vs. gist-like), especially in the framing of our hypotheses and interpretation of age group differences.

      (9) With respect to the learning criteria - it is misleading to say that "children needed between two to four learning-retrieval cycles to reach the criterion of 83% correct responses" (p. 9). Four was the maximum, and looking at the Figure 1C data it appears as though there were at least a few children who did not meet the 83% minimum. I believe they were included in the analysis anyway? Please clarify. Was there any minimum imposed for inclusion?

      We thank the reviewer for pointing this out. As stated in Methods Section (p. 50, lines 13261338) “These cycles ranged from a minimum of two to a maximum of four.<…> The cycles ended when participants provided correct responses to 83% of the trials or after the fourth cycle was reached.” We have corrected the corresponding wording in the Results section (line 286-289) to reflect this more accurately. Indeed, five children did not reach the 83% criterion but achieved final performance between 70 and 80% after the fourth learning cycle. These participants were included in this analysis for two main reasons:

      (1) The 83% threshold was established during piloting as a guideline for how many learningretrieval cycles to allow, not a strict learning criterion. It served to standardize task continuation, rather than to exclude participants post hoc.

      (2) The performance of these five children was still well above chance level (33%), indicating meaningful learning. Excluding them would have biased the sample toward higherperforming children and reduced the ecological validity of our findings. Including them ensures a more representative view of children’s performance under extended learning conditions.

      (10) For the gist-like reinstatement PLSC analysis, results are really similar a short and long delays and yet some of the text seems to implying specificity to the long delay. One is a trend and one is significant (p. 31), but surely these two associations would not be statistically different from one another?  

      We agree with the reviewer that the associations at short and long delays appeared similar. While a formal comparison (e.g., using a Z-test for dependent correlations) would typically be warranted, in the reanalyzed dataset only the long delay profile remains statistically significant, which limits the interpretability of such a comparison. 

      (11) As a general comment, I had a hard time tying all of the (many) results together. For example adults show more mature neocortical consolidation-related engagement, which the authors say is going to create more durable detailed memories, but under multiple trace theory we would generally think of neocortical representations as providing more schematic information. If the authors could try to make more connections across the different neural analyses, as well as tie the neural findings in more closely with the behaviour & back to the theoretical frameworks, that would be really helpful.  

      We thank the reviewer for this valuable suggestion. We have revised the discussion section to more clearly link the behavioral and neural findings and to interpret them in light of existing consolidation theories for better clarity. 

      Reviewer #2 (Public Review): 

      Schommartz et al. present a manuscript characterizing neural signatures of reinstatement during cued retrieval of middle-aged children compared to adults. The authors utilize a paradigm where participants learn the spatial location of semantically related item-scene memoranda which they retrieve after short or long delays. The paradigm is especially strong as the authors include novel memoranda at each delayed time point to make comparisons across new and old learning. In brief, the authors find that children show more forgetting than adults, and adults show greater engagement of cortical networks after longer delays as well as stronger item-specific reinstatement. Interestingly, children show more category-based reinstatement, however, evidence supports that this marker may be maladaptive for retrieving episodic details. The question is extremely timely both given the boom in neurocognitive research on the neural development of memory, and the dearth of research on consolidation in this age group. Also, the results provide novel insights into why consolidation processes may be disrupted in children. 

      We thank the reviewer for the positive evaluation.

      Comments on the revised version: 

      I carefully reviewed not only the responses to my own reviews as well as those raised by the other reviewers. While they addressed some of the concerns raised in the process, I think many substantive concerns remain. 

      Regarding Reviewer 1: 

      The authors point that the retrieval procedure is the same over time and similarly influenced by temporal autocorrelations, which makes their analysis okay. However, there is a fundamental problem as to whether they are actually measuring reinstatement or they are only measuring differences in temporal autocorrelation (or some non-linear combination of both). The authors further argue that the stimuli are being processed more memory wise rather than perception wise, however, I think there is no evidence for that and that perception-memory processes should be considered on a continuum rather than as discrete processes. Thus, I agree with reviewer 1 that these analyses should be removed. 

      We thank the reviewer for raising this important question. We would like to clarify a few key points regarding temporal autocorrelation and reinstatement.

      During the fixation window, participants were instructed to reinstate the scene and location associated with the cued object from memory. This task was familiar to them, as they had been trained in retrieving locations within scenes. Our analysis aims to compare the neural representations during this retrieval phase with those when participants view the scene, in order to assess how these representations change in similarity over time, as memories become less precise.

      We acknowledge that temporal proximity can lead to temporal autocorrelation. However, evidence suggests that temporal autocorrelation is consistent and stable across conditions (Gautama & Van Hulle, 2004; Woolrich et al., 2004). Shinn & Lagalwar (2021)further demonstrated that temporal autocorrelation is highly reliable at both the subject and regional levels. Given that we analyze regions of interest (ROIs) separately, potential spatial variability in temporal autocorrelation is not a major concern.

      No difference between item-specific reinstatement for recent items on day 1 and day 14 (which were merged) for further delay-related comparison also suggests that the reinstatement measure was stable for recent items even sampled at two different testing days. 

      Importantly, we interpret the relative change in the reinstatement index rather than its absolute value.

      In addition, when we conducted the same analysis for incorrectly retrieved memories, we did not observe any delay-related decline in reinstatement (see p. 25, lines 623-627). This suggests that the delay-related changes in reinstatement are specific to correctly retrieved memories. 

      Finally, our control analysis examining reinstatement between object and fixation time points (as suggested by Reviewer 1) revealed no delay-related effects in any ROI (see p.24, lines 605-612), further highlighting the specificity of the observed delay-related change in item reinstatement.

      We emphasize that temporal autocorrelation should be similar across all retrieval delays due to the identical task design and structure. Therefore, any observed decrease in reinstatement with increasing delay likely reflects a genuine change in the reinstatement index, rather than differences in temporal autocorrelation. Since our analysis includes only correctly retrieved items, and there is no perceptual input during the fixation window, this process is inherently memory-based, relying on mnemonic retrieval rather than sensory processing.

      We respectfully disagree with the reviewer's assertion that retrieval during the fixation period cannot be considered more memory-driven than perception-driven. At this time point, participants had no access to actual images of the scene, making it necessary for them to rely on mnemonic retrieval. The object cue likely triggered pattern completion for the learned object-scene association, forming a unique memory if remembered correctly(Horner & Burgess, 2013). This process is inherently mnemonic, as it is based on reconstructing the original neural representation of the scene (Kuhl et al., 2012; Staresina et al., 2013).

      While perception and memory processes can indeed be viewed as a continuum, some cognitive processes are predominantly memory-based, involving reconstruction rather than reproduction of previous experiences (Bartlett, 1932; Ranganath & Ritchey, 2012). In our task, although the retrieved material is based on previously encoded visual information, the process of recalling this information during the fixation period is fundamentally mnemonic, as it does not involve visual input. Our findings indicate that the similarity between memorybased representations and those observed during actual perception decreases over time, suggesting a relative change in the quality of the representations. However, this does not imply that detailed representations disappear; they may still be robust enough to support correct memory recall. Previous studies examining encoding-retrieval similarity have shown similar findings(Pacheco Estefan et al., 2019; Ritchey et al., 2013).

      We do not claim that perception and memory processes are entirely discrete, nor do we suggest that only perception is involved when participants see the scene. Viewing the scene indeed involves recognition processes, updating retrieved representations from the fixation period, and potentially completing missing or unclear information. This integrative process demonstrates the interrelation of perception and memory, especially in complex tasks like the one we employed.

      In conclusion, our task design and analysis support the interpretation that the fixation period is primarily characterized by mnemonic retrieval, facilitated by cue-triggered pattern completion, rather than perceptual processing. We believe this approach aligns with the current understanding of memory retrieval processes as supported by the existing literature.

      The authors seem to have a design that would allow for across run comparisons, however, they did not include these additional analyses. 

      Thank you for pointing this out. We ran as additional cross-run comparison. This results and further proceeding are reported in the comment for reviewer 1. 

      To address the reviewer’s concern, we conducted an additional cross-run analysis for all correctly retrieved trials. The approach restricted comparisons to non-overlapping runs (run1run2, run2-run3, run1-run3). This analysis revealed robust gist-like reinstatement in children for remote Day 14 memories in the mPFC (p = .035) and vlPFC (p = .0007), in adults’ vlPFC remote Day 1 memories (p = .029), as well as in children and adults remote Day 1 memories in LOC (p < .02). A significant Session effect in both regions (mPFC: p = .026; vlPFC: p = .002) indicated increased reinstatement for long delay (Day 14) compared to short-delay and recent session (all p < .05). Given that the cross-run results largely replicate and reinforce the effects found previously with within-run, we believe that combining both sources of information is methodologically justified and statistically beneficial. Specifically, both approaches independently identified significant gist-like reinstatement in children’s mPFC and vlPFC (although within-run vlPFC effect (short delay: p = .038; long delay p = .047) did not survive multiple comparisons), particularly for remote memories. Including both withinrun and between-run comparisons increases the number of unique, non-repeated trial pairs, improving statistical power without introducing redundancy. While we acknowledge that same-run comparisons may be influenced by residual autocorrelation(Prince et al., 2022), we believe that our design mitigates this risk through consistency between within-run and crossrun results, long inter-trial intervals, and trial-wise estimation of activation. We have adjusted the manuscript, accordingly, reporting the combined analysis. We also report cross-run and within-run analysis separately in supplementary materials (Tables S12.1, S12.2, showing that they converge with the cross-run results and thus strengthen rather than dilute the findings. 

      As suggested, we now explicitly highlight the change over time as the central finding. We observe a clear increase in gist-like reinstatement from recent to remote memories in children, particularly in mPFC and vlPFC. These effects based on combined within- and cross-run comparisons, are now clearly stated in the main results and interpreted in the discussion accordingly. 

      (1) The authors did not satisfy my concerns about different amounts of re-exposures to stimuli as a function of age, which introduces a serious confound in the interpretation of the neural data. 

      (2) Regarding Reviewer 1's point about different number of trials being entered into analysis, I think a more formal test of sub-sampling the adult trials is warranted. 

      (1) We thank the reviewer for pointing this out. Overall, children needed 2 to 4 learning cycles to improve their performance and reach the learning criteria, compared to 2 learning cycles in adults. To address the different amounts of re-exposure to stimuli between the age groups, we subsampled the child group to only those children who reached the learning criteria after 2 learning cycles. For this purpose, we excluded 21 children from the analysis who needed 3 or 4 learning cycles. This resulted in 39 young adults and 28 children being included in the subsequent analysis. 

      (i) We reran the behavioral analysis with the subsampled dataset (see Supplementary Materials,  Table S1.1, Fig. S1, Table S1.2). This analysis replicated the previous findings of less robust memory consolidation in children across all time delays. 

      (ii) We reran the univariate analysis (see in Supplementary Materials, Table S9.1). This analysis also replicated fully the previous findings. This indicates that the inclusion of child participants with greater material exposure during learning in the analysis of neural retrieval patterns did not affect the group differences in univariate neural results. 

      These subsampled results demonstrated that the amount of re-exposure to stimuli during encoding does not affect consolidation-related changes in memory retrieval at the behavioral and neural levels in children and adults across all time delays. We have added this information to the manuscript (line 343-348, 420-425). 

      (2) We appreciate Reviewer 1's suggestion to perform a formal test by sub-sampling the adult trials to match the number of trials in the child group. However, we believe that this approach may not be optimal for the following reasons:

      (i) Loss of Statistical Power: Sub-sampling the adult trials would result in a reduced sample size, potentially leading to a significant loss of statistical power and the ability to detect meaningful effects, particularly in a context where the adult group is intended to serve as a robust control or comparison group.

      (ii) Introducing sub-sampling could introduce variability that complicates the interpretation of results, particularly if the trial sub-sampling process does not fully capture the variability inherent in the original adult data.

      (iii) Robustness of Existing Findings: We have already addressed potential concerns about unequal trial numbers by conducting analyses that control for the number of learning cycles, as detailed in our supplementary materials. These analyses have shown that the observed effects are consistent, suggesting that the differences in trial numbers do not critically influence our findings.

      Given these considerations, we hope the reviewer understands our rationale and agrees that the current analysis is robust and appropriate for addressing the research questions.

      I also still fundamentally disagree with the use of global signals when comparing children to adults, and think this could very much skew the results. 

      We thank the reviewer for raising this important issue. To address this concern comprehensively, we have taken the following steps:

      (1) Overview of the literature support for global signal regression (GSR). A growing body of methodological and empirical research supports the inclusion of global signal repression as part of best practice denoising pipelines, particularly when analyzing pediatric fMRI data. Studies such as (Ciric et al., 2017; Parkes et al., 2018; J. D. Power et al., 2012, 2014; Power et al., 2012), and (Thompson et al., 2016) show that  GSR improves motion-related artifact removal. Critically, pediatric-specific studies (Disselhoff et al., 2025; Graff et al., 2022) conclude that pipelines including GSR are most effective for signal recovery and artifact removal in younger children. Graff et al. (2021) demonstrated that among various pipelines, GSR yielded the best noise reduction in 4–8-year-olds. Additionally, (Li et al., 2019; Qing et al., 2015) emphasized that GSR reduces artifactual variance without distorting the spatial structure of neural signals. (Ofoghi et al., 2021)demonstrated that global signal regression helps mitigate non-neuronal noise sources, including respiration, cardiac activity, motion, vasodilation, and scanner-related artifacts. Based on this and other recent findings, we consider GSR particularly beneficial for denoising paediatric  fMRI data in our study.

      (2) Empirical comparison of pipelines with and without GSR. We re-run the entire first-level univariate analysis using the pipeline that excluded the global signal regression. The resulting activation maps (see Supplementary Figure S3.2, S4.2, S5.2, S9.2) differed notably from the original pipeline. Specifically, group differences in cortical regions such as mPFC, cerebellum, and posterior PHG no longer reached significance, and the overall pattern of results appeared noisier. 

      (3) Evaluation of the pipeline differences. To further evaluate the impact of GSR, we conducted the following analyses:

      (a) Global signal is stable across groups and sessions. A linear mixed-effects model showed no significant main effects or interactions involving group or session on the global signal (F-values < 2.62, p > .11), suggesting that the global signal was not group- or session-dependent in our sample. 

      (b) Noise Reduction Assessment via Contrast Variability. We compared the variability (standard deviation and IQR) of contrast estimates across pipelines. Both SD (b = .070, p < .001) and IQR (b = .087, p < .001) were significantly reduced in the GSR pipeline, especially in children (p < .001) compared to adults (p = .048). This suggests that GSR reduces inter-subject variability in children, likely reflecting improved signal quality.

      (c) Residual Variability After Regressing Global Signal. We regressed out global signal post hoc from both pipelines and compared the residual variance. Residual standard deviation was significantly lower for the GSR pipeline (F = 199, p < .001), with no interaction with session or group, further indicating that GSR stabilizes the signal and attenuates non-neuronal variability.

      Conclusion

      In summary, while we understand the reviewer’s concern, we believe the empirical and theoretical support for GSR, especially in pediatric samples, justifies its use in our study. Nonetheless, to ensure full transparency, we provide full results from both pipelines in the Supplementary Materials and have clarified our reasoning in the revised manuscript.

      Reviewer #1 (Recommendations For The Authors): 

      (1) Some figures are still missing descriptions of what everything on the graph means; please clarify in captions. 

      We thank the reviewer for pointing this out. We undertook the necessary adjustments in the graph annotations. 

      (2) The authors conclude they showed evidence of neural reorganization of memory representations in children (p. 41). But the gist is not greater in children than adults, and also does not differ over time-so, I was confused about what this claim was based on? 

      We thank the reviewer for raising this question. Our results on gist-like reinstatements suggest that gist-like reinstatement was significantly higher in children compared to adults in the mPFC in addition to the child gist-like reinstatement indices being significantly higher than zero (see p.27-28). These results support our claim on neural reorganization of memory represenations in children. We hope this clarifies the issue. 

      References

      Bartlett, F. C. (1932). Remembering: A study in experimental and social psychology. Cambridge University Press.

      Brainerd, C. J., & Reyna, V. F. (2002). Fuzzy-Trace Theory: Dual Processes in Memory, Reasoning, and Cognitive Neuroscience (pp. 41–100). https://doi.org/10.1016/S00652407(02)80062-3

      Chen, J., Leong, Y. C., Honey, C. J., Yong, C. H., Norman, K. A., & Hasson, U. (2017). Shared memories reveal shared structure in neural activity across individuals. Nature Neuroscience, 20(1), 115–125. https://doi.org/10.1038/nn.4450

      Ciric, R., Wolf, D. H., Power, J. D., Roalf, D. R., Baum, G. L., Ruparel, K., Shinohara, R. T., Elliott, M. A., Eickhoff, S. B., Davatzikos, C., Gur, R. C., Gur, R. E., Bassett, D. S., & Satterthwaite, T. D. (2017). Benchmarking of participant-level confound regression strategies for the control of motion artifact in studies of functional connectivity. NeuroImage, 154, 174–187. https://doi.org/10.1016/j.neuroimage.2017.03.020

      Disselhoff, V., Jakab, A., Latal, B., Schnider, B., Wehrle, F. M., Hagmann, C. F., Held, U., O’Gorman, R. T., Fauchère, J.-C., & Hüppi, P. (2025). Inhibition abilities and functional brain connectivity in school-aged term-born and preterm-born children. Pediatric Research, 97(1), 315–324. https://doi.org/10.1038/s41390-024-03241-0

      Esteban, O., Ciric, R., Finc, K., Blair, R. W., Markiewicz, C. J., Moodie, C. A., Kent, J. D., Goncalves, M., DuPre, E., Gomez, D. E. P., Ye, Z., Salo, T., Valabregue, R., Amlien, I. K., Liem, F., Jacoby, N., Stojić, H., Cieslak, M., Urchs, S., … Gorgolewski, K. J. (2020). Analysis of task-based functional MRI data preprocessed with fMRIPrep. Nature Protocols, 15(7), 2186–2202. https://doi.org/10.1038/s41596-020-0327-3

      Fandakova, Y., Leckey, S., Driver, C. C., Bunge, S. A., & Ghetti, S. (2019). Neural specificity of scene representations is related to memory performance in childhood. NeuroImage, 199, 105–113. https://doi.org/10.1016/j.neuroimage.2019.05.050

      Gautama, T., & Van Hulle, M. M. (2004). Optimal spatial regularisation of autocorrelation estimates in fMRI analysis. NeuroImage, 23(3), 1203–1216.  https://doi.org/10.1016/j.neuroimage.2004.07.048

      Graff, K., Tansey, R., Ip, A., Rohr, C., Dimond, D., Dewey, D., & Bray, S. (2022). Benchmarking common preprocessing strategies in early childhood functional connectivity and intersubject correlation fMRI. Developmental Cognitive Neuroscience, 54, 101087. https://doi.org/10.1016/j.dcn.2022.101087

      Horner, A. J., & Burgess, N. (2013). The associative structure of memory for multi-element events. Journal of Experimental Psychology: General, 142(4), 1370–1383. https://doi.org/10.1037/a0033626

      Jones, J. S., the CALM Team, & Astle, D. E. (2021). A transdiagnostic data-driven study of children’s behaviour and the functional connectome. Developmental Cognitive Neuroscience, 52, 101027. https://doi.org/10.1016/j.dcn.2021.101027

      Kuhl, B. A., Bainbridge, W. A., & Chun, M. M. (2012). Neural Reactivation Reveals Mechanisms for Updating Memory. Journal of Neuroscience, 32(10), 3453–3461. https://doi.org/10.1523/JNEUROSCI.5846-11.2012

      Kuhl, B. A., & Chun, M. M. (2014). Successful Remembering Elicits Event-Specific Activity Patterns in Lateral Parietal Cortex. Journal of Neuroscience, 34(23), 8051–8060. https://doi.org/10.1523/JNEUROSCI.4328-13.2014

      Li, J., Kong, R., Liégeois, R., Orban, C., Tan, Y., Sun, N., Holmes, A. J., Sabuncu, M. R., Ge, T., & Yeo, B. T. T. (2019). Global signal regression strengthens association between resting-state functional connectivity and behavior. NeuroImage, 196, 126–141. https://doi.org/10.1016/j.neuroimage.2019.04.016

      Ofoghi, B., Chenaghlou, M., Mooney, M., Dwyer, D. B., & Bruce, L. (2021). Team technical performance characteristics and their association with match outcome in elite netball. International Journal of Performance Analysis in Sport, 21(5), 700–712. https://doi.org/10.1080/24748668.2021.1938424

      Pacheco Estefan, D., Sánchez-Fibla, M., Duff, A., Principe, A., Rocamora, R., Zhang, H., Axmacher, N., & Verschure, P. F. M. J. (2019). Coordinated representational reinstatement in the human hippocampus and lateral temporal cortex during episodic memory retrieval. Nature Communications, 10(1), 2255. https://doi.org/10.1038/s41467019-09569-0

      Parkes, L., Fulcher, B., Yücel, M., & Fornito, A. (2018). An evaluation of the efficacy, reliability, and sensitivity of motion correction strategies for resting-state functional MRI. NeuroImage, 171, 415–436. https://doi.org/10.1016/j.neuroimage.2017.12.073

      Power, J. D., Barnes, K. A., Snyder, A. Z., Schlaggar, B. L., & Petersen, S. E. (2012). Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. NeuroImage, 59(3), 2142–2154. https://doi.org/10.1016/j.neuroimage.2011.10.018

      Power, J. D., Mitra, A., Laumann, T. O., Snyder, A. Z., Schlaggar, B. L., & Petersen, S. E. (2014). Methods to detect, characterize, and remove motion artifact in resting state fMRI. NeuroImage, 84, 320–341. https://doi.org/10.1016/j.neuroimage.2013.08.048

      Power, S. D., Kushki, A., & Chau, T. (2012). Intersession Consistency of Single-Trial Classification of the Prefrontal Response to Mental Arithmetic and the No-Control State by NIRS. PLoS ONE, 7(7), e37791. https://doi.org/10.1371/journal.pone.0037791

      Prince, J. S., Charest, I., Kurzawski, J. W., Pyles, J. A., Tarr, M. J., & Kay, K. N. (2022). Improving the accuracy of single-trial fMRI response estimates using GLMsingle. ELife, 11. https://doi.org/10.7554/eLife.77599

      Qing, Z., Dong, Z., Li, S., Zang, Y., & Liu, D. (2015). Global signal regression has complex effects on regional homogeneity of resting state fMRI signal. Magnetic Resonance Imaging, 33(10), 1306–1313. https://doi.org/10.1016/j.mri.2015.07.011

      Ranganath, C., & Ritchey, M. (2012). Two cortical systems for memory-guided behaviour. Nature Reviews Neuroscience, 13(10), 713–726. https://doi.org/10.1038/nrn3338

      Ritchey, M., Wing, E. A., LaBar, K. S., & Cabeza, R. (2013). Neural Similarity Between Encoding and Retrieval is Related to Memory Via Hippocampal Interactions. Cerebral Cortex, 23(12), 2818–2828. https://doi.org/10.1093/cercor/bhs258

      Satterthwaite, T. D., Elliott, M. A., Gerraty, R. T., Ruparel, K., Loughead, J., Calkins, M. E., Eickhoff, S. B., Hakonarson, H., Gur, R. C., Gur, R. E., & Wolf, D. H. (2013). An improved framework for confound regression and filtering for control of motion artifact in the preprocessing of resting-state functional connectivity data. NeuroImage, 64, 240–256. https://doi.org/10.1016/j.neuroimage.2012.08.052

      Schommartz, I., Lembcke, P. F., Pupillo, F., Schuetz, H., de Chamorro, N. W., Bauer, M., Kaindl, A. M., Buss, C., & Shing, Y. L. (2023). Distinct multivariate structural brain profiles are related to variations in short- and long-delay memory consolidation across children and young adults. Developmental Cognitive Neuroscience, 59. https://doi.org/10.1016/J.DCN.2022.101192

      Sekeres, M. J., Winocur, G., & Moscovitch, M. (2018). The hippocampus and related neocortical structures in memory transformation. Neuroscience Letters, 680, 39–53. https://doi.org/10.1016/j.neulet.2018.05.006

      Shinn, L. J., & Lagalwar, S. (2021). Treating Neurodegenerative Disease with Antioxidants: Efficacy of the Bioactive Phenol Resveratrol and Mitochondrial-Targeted MitoQ and SkQ. Antioxidants, 10(4), 573. https://doi.org/10.3390/antiox10040573

      Staresina, B. P., Alink, A., Kriegeskorte, N., & Henson, R. N. (2013). Awake reactivation predicts memory in humans. Proceedings of the National Academy of Sciences, 110(52), 21159–21164. https://doi.org/10.1073/pnas.1311989110

      St-Laurent, M., & Buchsbaum, B. R. (2019). How Multiple Retrievals Affect Neural Reactivation in Young and Older Adults. The Journals of Gerontology: Series B, 74(7), 1086–1100. https://doi.org/10.1093/geronb/gbz075

      Thompson, G. J., Riedl, V., Grimmer, T., Drzezga, A., Herman, P., & Hyder, F. (2016). The Whole-Brain “Global” Signal from Resting State fMRI as a Potential Biomarker of Quantitative State Changes in Glucose Metabolism. Brain Connectivity, 6(6), 435–447. https://doi.org/10.1089/brain.2015.0394

      Tompary, A., & Davachi, L. (2017). Consolidation Promotes the Emergence of Representational Overlap in the Hippocampus and Medial Prefrontal Cortex. Neuron, 96(1), 228-241.e5. https://doi.org/10.1016/j.neuron.2017.09.005

      Tompary, A., Zhou, W., & Davachi, L. (2020). Schematic memories develop quickly, but are not expressed unless necessary. PsyArXiv.

      Woolrich, M. W., Behrens, T. E. J., Beckmann, C. F., Jenkinson, M., & Smith, S. M. (2004). Multilevel linear modelling for FMRI group analysis using Bayesian inference. NeuroImage, 21(4), 1732–1747. https://doi.org/10.1016/j.neuroimage.2003.12.023

      Xiao, X., Dong, Q., Gao, J., Men, W., Poldrack, R. A., & Xue, G. (2017). Transformed Neural Pattern Reinstatement during Episodic Memory Retrieval. The Journal of Neuroscience, 37(11), 2986–2998. https://doi.org/10.1523/JNEUROSCI.2324-16.2017

      Ye, Z., Shi, L., Li, A., Chen, C., & Xue, G. (2020). Retrieval practice facilitates memory updating by enhancing and differentiating medial prefrontal cortex representations. ELife, 9, 1–51. https://doi.org/10.7554/ELIFE.57023

      Yonelinas, A. P., Ranganath, C., Ekstrom, A. D., & Wiltgen, B. J. (2019). A contextual binding theory of episodic memory: systems consolidation reconsidered. Nature Reviews. Neuroscience, 20(6), 364–375. https://doi.org/10.1038/S41583-019-01504

      Zhuang, L., Wang, J., Xiong, B., Bian, C., Hao, L., Bayley, P. J., & Qin, S. (2021). Rapid neural reorganization during retrieval practice predicts subsequent long-term retention and false memory. Nature Human Behaviour, 6(1), 134–145.

      https://doi.org/10.1038/s41562-021-01188-4

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The Major Histocompatibility Complex (MHC) region is a collection of numerous genes involved in both innate and adaptive immunity. MHC genes are famed for their role in rapid evolution and extensive polymorphism in a variety of vertebrates. This paper presents a summary of gene-level gain and loss of orthologs and paralogs within MHC across the diversity of primates, using publicly available data.

      Strengths:

      This paper provides a strong case that MHC genes are rapidly gained (by paralog duplication) and lost over millions of years of macroevolution. The authors are able to identify MHC loci by homology across species, and from this infer gene duplications and losses using phylogenetic analyses. There is a remarkable amount of genic turnover, summarized in Figure 6 and Figure 7, either of which might be a future textbook figure of immune gene family evolution. The authors draw on state-of-the-art phylogenetic methods, and their inferences are robust insofar as the data might be complete enough to draw such conclusions.

      Weaknesses:

      One concern about the present work is that it relies on public databases to draw inferences about gene loss, which is potentially risky if the publicly available sequence data are incomplete. To say, for example, that a particular MHC gene copy is absent in a taxon (e.g., Class I locus F absent in Guenons according to Figure 1), we need to trust that its absence from the available databases is an accurate reflection of its absence in the genome of the actual organisms. This may be a safe assumption, but it rests on the completeness of genome assembly (and gene annotations?) or people uploading relevant data. This reviewer would have been far more comfortable had the authors engaged in some active spot-checking, doing the lab work to try to confirm absences at least for some loci and some species. Without this, a reader is left to wonder whether gene loss is simply reflecting imperfect databases, which then undercuts confidence in estimates of rates of gene loss.

      Indeed, just because a locus has not been confirmed in a species does not necessarily mean that it is absent. As we explain in the Figure 1 caption, only a few species have had their genomes extensively studied (gray background), and only for these species does the absence of a point in this figure mean that a locus is absent. The white background rows represent species that are not extensively studied, and we point out that the absence of a point does not mean that a locus is absent from the species, rather undiscovered. We have also added a parenthetical to the text to explain this (line 156): “Only species with rows highlighted in gray have had their MHC regions extensively studied (and thus only for these rows is the absence of a gene symbol meaningful).”

      While we agree that spot-checking may be a helpful next step, one of the goals of this manuscript is to collect and synthesize the enormous volume of MHC evolution research in the primates, which will serve as a jumping-off point for other researchers to perform important wet lab work.

      Some context is useful for comparing rates of gene turnover in MHC, to other loci. Changing gene copy numbers, duplications, and loss of duplicates, are common it seems across many loci and many organisms; is MHC exceptional in this regard, or merely behaving like any moderately large gene family? I would very much have liked to see comparable analyses done for other gene families (immune, like TLRs, or non-immune), and quantitative comparisons of evolutionary rates between MHC versus other genes. Does MHC gene composition evolve any faster than a random gene family? At present readers may be tempted to infer this, but evidence is not provided.

      Our companion paper (Fortier and Pritchard, 2025) demonstrates that the MHC is a unique locus in many regards, such as its evidence for deep balancing selection and its excess of disease associations. Thus, we expect that it is evolving faster than any random gene family. It would be interesting to repeat this analysis for other gene families, but that is outside of the scope of this project. Additionally, allele databases for other gene families are not nearly as developed, but as more alleles become available for other polymorphic families, a comparable analysis could become possible.

      We have added a paragraph to the discussion (lines 530-546) to clarify that we do not know for certain whether the MHC gene family is evolving rapidly compared to other gene families.

      While on the topic of making comparisons, the authors make a few statements about relative rates. For instance, lines 447-8 compare gene topology of classical versus non-classical genes; and line 450 states that classical genes experience more turnover. But there are no quantitative values given to these rates to provide numerical comparisons, nor confidence intervals provided (these are needed, given that they are estimates), nor formal statistical comparisons to confirm our confidence that rates differ between types of genes.

      More broadly, the paper uses sophisticated phylogenetic methods, but without taking advantage of macroevolutionary comparative methods that allow model-based estimation of macroevolutionary rates. I found the lack of quantitative measurements of rates of gene gain/loss to be a weakness of the present version of the paper, and something that should be readily remedied. When claiming that MHC Class I genes "turn over rapidly" (line 476) - what does rapidly mean? How rapidly? How does that compare to rates of genetic turnover at other families? Quantitative statements should be supported by quantitative estimates (and their confidence intervals).

      These statements refer to qualitative observations, so we cannot provide numerical values. We simply conclude that certain gene groups evolve faster or slower based on the species and genes present in each clade. It is difficult to provide estimates because of the incomplete sampling of genes that survived to the present day. In addition, the presence or absence of various orthologs in different species still needs to be confirmed, at which point it might be useful to be more quantitative. We have also added a paragraph to the discussion to address this concern and advocate for similar analyses of other gene families in the future when more data is available (lines 530-546).

      The authors refer to 'shared function of the MHC across species' (e.g. line 22); while this is likely true, they are not here presenting any functional data to confirm this, nor can they rule out neofunctionalization or subfunctionalization of gene duplicates. There is evidence in other vertebrates (e.g., cod) of MHC evolving appreciably altered functions, so one may not safely assume the function of a locus is static over long macroevolutionary periods, although that would be a plausible assumption at first glance.

      Indeed, we cannot assume that the function of a locus is static across time, especially for the MHC region. In our research, we read hundreds of papers that each focused on a small number of species or genes and gathered some information about them, sometimes based on functional experiments and sometimes on measures such as dN/dS. These provide some indication of a gene’s broad classification in a species or clade, even if the evidence is preliminary. Where possible, we used this preliminary evidence to give genes descriptors “classical,” “non-classical,” “dual characteristics,” “pseudogene,” “fixed”, or “unfixed.” Sometimes multiple individuals and haplotypes were analyzed, so we could even assign a minimum number of gene copies present in a species. We have aggregated all of these references into Supplementary Table 1 (for Class I/Figure 1) and Supplementary Table 2 (for Class II/Figure 2) along with specific details about which data points in these figures that each reference supports. We realize that many of these classifications are based on a small number of individuals or indirect measures, so they may change in the future as more functional data is generated.

      Reviewer #2 (Public review):

      Summary:

      The authors aim to provide a comprehensive understanding of the evolutionary history of the Major Histocompatibility Complex (MHC) gene family across primate species. Specifically, they sought to:

      (1) Analyze the evolutionary patterns of MHC genes and pseudogenes across the entire primate order, spanning 60 million years of evolution.

      (2) Build gene and allele trees to compare the evolutionary rates of MHC Class I and Class II genes, with a focus on identifying which genes have evolved rapidly and which have remained stable.

      (3) Investigate the role of often-overlooked pseudogenes in reconstructing evolutionary events, especially within the Class I region.

      (4) Highlight how different primate species use varied MHC genes, haplotypes, and genetic variation to mount successful immune responses, despite the shared function of the MHC across species.

      (5) Fill gaps in the current understanding of MHC evolution by taking a broader, multi-species perspective using (a) phylogenomic analytical computing methods such as Beast2, Geneconv, BLAST, and the much larger computing capacities that have been developed and made available to researchers over the past few decades, (b) literature review for gene content and arrangement, and genomic rearrangements via haplotype comparisons.

      (6) The authors overall conclusions based on their analyses and results are that 'different species employ different genes, haplotypes, and patterns of variation to achieve a successful immune response'.

      Strengths:

      Essentially, much of the information presented in this paper is already well-known in the MHC field of genomic and genetic research, with few new conclusions and with insufficient respect to past studies. Nevertheless, while MHC evolution is a well-studied area, this paper potentially adds some originality through its comprehensive, cross-species evolutionary analysis of primates, focus on pseudogenes and the modern, large-scale methods employed. Its originality lies in its broad evolutionary scope of the primate order among mammals with solid methodological and phylogenetic analyses.

      The main strengths of this study are the use of large publicly available databases for primate MHC sequences, the intensive computing involved, the phylogenetic tool Beast2 to create multigene Bayesian phylogenetic trees using sequences from all genes and species, separated into Class I and Class II groups to provide a backbone of broad relationships to investigate subtrees, and the presentation of various subtrees as species and gene trees in an attempt to elucidate the unique gene duplications within the different species. The study provides some additional insights with summaries of MHC reference genomes and haplotypes in the context of a literature review to identify the gene content and haplotypes known to be present in different primate species. The phylogenetic overlays or ideograms (Figures 6 and 7) in part show the complexity of the evolution and organisation of the primate MHC genes via the orthologous and paralogous gene and species pathways progressively from the poorly-studied NWM, across a few moderately studied ape species, to the better-studied human MHC genes and haplotypes.

      Weaknesses:

      The title 'The Primate Major Histocompatibility Complex: An Illustrative Example of GeneFamily Evolution' suggests that the paper will explore how the Major Histocompatibility Complex (MHC) in primates serves as a model for understanding gene family evolution. The term 'Illustrative Example' in the title would be appropriate if the paper aimed to use the primate Major Histocompatibility Complex (MHC) as a clear and representative case to demonstrate broader principles of gene family evolution. That is, the MHC gene family is not just one instance of gene family evolution but serves as a well-studied, insightful example that can highlight key mechanisms and concepts applicable to other gene families. However, this is not the case, this paper only covers specific details of primate MHC evolution without drawing broader lessons to any other gene families. So, the term 'Illustrative Example' is too broad or generalizing. In this case, a term like 'Case Study' or simply 'Example' would be more suitable. Perhaps, 'An Example of Gene Family Diversity' would be more precise. Also, an explanation or 'reminder' is suggested that this study is not about the origins of the MHC genes from the earliest jawed vertebrates per se (~600 mya), but it is an extension within a subspecies set that has emerged relatively late (~60 mya) in the evolutionary divergent pathways of the MHC genes, systems, and various vertebrate species.

      Thank you for your input on the title; we have changed it to “A case study of gene family evolution” instead.

      Thank you also for pointing out the potential confusion about the time span of our study. We have added “Having originated in the jawed vertebrates,” to a sentence in the introduction (lines 38-39). We have also added the sentence “Here, we focus on the primates, spanning approximately 60 million years within the over 500-million-year evolution of the family \citep{Flajnik2010}.“ to be more explicit about the context for our work (lines 59-61).

      Phylogenomics. Particular weaknesses in this study are the limitations and problems associated with providing phylogenetic gene and species trees to try and solve the complex issue of the molecular mechanisms involved with imperfect gene duplications, losses, and rearrangements in a complex genomic region such as the MHC that is involved in various effects on the response and regulation of the immune system. A particular deficiency is drawing conclusions based on a single exon of the genes. Different exons present different trees. Which are the more reliable? Why were introns not included in the analyses? The authors attempt to overcome these limitations by including genomic haplotype analysis, duplication models, and the supporting or contradictory information available in previous publications. They succeed in part with this multidiscipline approach, but much is missed because of biased literature selection. The authors should include a paragraph about the benefits and limitations of the software that they have chosen for their analysis, and perhaps suggest some alternative tools that they might have tried comparatively. How were problems with Bayesian phylogeny such as computational intensity, choosing probabilities, choosing particular exons for analysis, assumptions of evolutionary models, rates of evolution, systemic bias, and absence of structural and functional information addressed and controlled for in this study?

      We agree that different exons have different trees, which is exactly why we repeated our analysis for each exon in order to compare and contrast them. In particular, the exons encoding the binding site of the resulting protein (exons 2 and 3 for Class I and exon 2 for Class II) show evidence for trans-species polymorphism and gene conversion. These phenomena lead to trees that do not follow the species tree and are fascinating in and of themselves, which we explore in detail in our companion paper (Fortier and Pritchard, 2025). Meanwhile, the non-peptide-binding extracellular-domain-encoding exon (exon 4 for Class I and exon 3 for Class II) is comparably sized to the binding-site-encoding exons and provides an interesting functional contrast. As this exon is likely less affected by trans-species polymorphism, gene conversion, and convergent evolution, we present results from it most often in the main text, though we occasionally touch on differences between the exons. See lines 191-196, 223-226, and 407-414 for some examples of how we discuss the exons in the text. Additionally, all trees from all of these exons can be found in the supplement. 

      We agree that introns would valuable to study in this context. Even though the non--binding-site-encoding exons are probably *less* affected by trans-species polymorphism, gene conversion, and convergent evolution, they are still functional. The introns, however, experience much more relaxed selection, if any, and comparing their trees to those for the exons would be valuable and illuminating. We did not generate intron trees for two reasons. Most importantly, there is a dearth of data available for the introns; in the databases we used, there was often intron data available only for human, chimpanzee, and sometimes macaque, and only for a small subset of the genes. This limitation is at odds with the comprehensive, many-gene-many-species approach which we feel is the main novelty of this work. Secondly, the introns that *are* available are difficult to align. Even aligning the exons across such a highly-diverged set of genes and pseudogenes was difficult and required manual effort. The introns proved even more difficult to try to align across genes. In the future, when more intron data is available and sufficient effort is put into aligning them, it will be possible and desirable to do a comparable analysis. We also added a sentence to the “Data” section to briefly explain why we did not include introns (lines 134-135).

      We explain our Bayesian phylogenetics approach in detail in the Methods (lines 650-725), including our assumptions and our solutions to challenges specific to this application. For further explanation of the method itself, we suggest reading the original BEAST and BEAST2 papers (Drummond & Rambaut (2007), Drummond et al. (2012), Bouckaert et al. (2014), and Bouckaert et al. (2019)). Known structural and functional information helped us validate the alignments we used in this study, but the fact that such information is not fully known for every gene and species should not affect the method itself.

      Gene families as haplotypes. In the Introduction, the MHC is referred to as a 'gene family', and in paragraph 2, it is described as being united by the 'MHC fold', despite exhibiting 'very diverse functions'. However, the MHC region is more accurately described as a multigene region containing diverse, haplotype-specific Conserved Polymorphic Sequences, many of which are likely to be regulatory rather than protein-coding. These regulatory elements are essential for controlling the expression of multiple MHC-related products, such as TNF and complement proteins, a relationship demonstrated over 30 years ago. Non-MHC fold loci such as TNF, complement, POU5F1, lncRNA, TRIM genes, LTA, LTB, NFkBIL1, etc, are present across all MHC haplotypes and play significant roles in regulation. Evolutionary selection must act on genotypes, considering both paternal and maternal haplotypes, rather than on individual genes alone. While it is valuable to compile databases for public use, their utility is diminished if they perpetuate outdated theories like the 'birth-and-death model'. The inclusion of prior information or assumptions used in a statistical or computational model, typically in Bayesian analysis, is commendable, but they should be based on genotypic data rather than older models. A more robust approach would consider the imperfect duplication of segments, the history of their conservation, and the functional differences in inheritance patterns. Additionally, the MHC should be examined as a genomic region, with ancestral haplotypes and sequence changes or rearrangements serving as key indicators of human evolution after the 'Out of Africa' migration, and with disease susceptibility providing a measurable outcome. There are more than 7000 different HLA-B and -C alleles at each locus, which suggests that there are many thousands of human HLA haplotypes to study. In this regard, the studies by Dawkins et al (1999 Immunol Rev 167,275), Shiina et al. (2006 Genetics 173,1555) on human MHC gene diversity and disease hitchhiking (haplotypes), and Sznarkowska et al. (2020 Cancers 12,1155) on the complex regulatory networks governing MHC expression, both in terms of immune transcription factor binding sites and regulatory non-coding RNAs, should be examined in greater detail, particularly in the context of MHC gene allelic diversity and locus organization in humans and other primates.

      Thank you for these comments. To clarify that the MHC “region” is different from (and contains) the MHC “gene family” as we describe it, we changed a sentence in the abstract (lines 8-10) from “One large gene family that has experienced rapid evolution is the Major Histocompatibility Complex (MHC), whose proteins serve critical roles in innate and adaptive immunity.” to “One large gene family that has experienced rapid evolution lies within the Major Histocompatibility Complex (MHC), whose proteins serve critical roles in innate and adaptive immunity.” We know that the region is complex and contains many other genes and regulatory sequences; Figure 1 of our companion paper (Fortier and Pritchard, 2025) depicts these in order to show the reader that the MHC genes we focus on are just one part of the entire region.

      We love the suggestion to look at the many thousands of alleles present at each of the classical loci. This is the focus of our complimentary paper (Fortier and Pritchard, 2025) which explores variation at the allele level. In the current paper, we look mainly at the differences between genes and the use of different genes in different species.

      Diversifying and/or concerted evolution. Both this and past studies highlight diversifying selection or balancing selection model is the dominant force in MHC evolution. This is primarily because the extreme polymorphism observed in MHC genes is advantageous for populations in terms of pathogen defence. Diversification increases the range of peptides that can be presented to T cells, enhancing the immune response. The peptide-binding regions of MHC genes are highly variable, and this variability is maintained through selection for immune function, especially in the face of rapidly evolving pathogens. In contrast, concerted evolution, which typically involves the homogenization of gene duplicates through processes like gene conversion or unequal crossing-over, seems to play a minimal role in MHC evolution. Although gene duplication events have occurred in the MHC region leading to the expansion of gene families, the resulting paralogs often undergo divergent evolution rather than being kept similar or homozygous by concerted evolution. Therefore, unlike gene families such as ribosomal RNA genes or histone genes, where concerted evolution leads to highly similar copies, MHC genes display much higher levels of allelic and functional diversification. Each MHC gene copy tends to evolve independently after duplication, acquiring unique polymorphisms that enhance the repertoire of antigen presentation, rather than undergoing homogenization through gene conversion. Also, in some populations with high polymorphism or genetic drift, allele frequencies may become similar over time without the influence of gene conversion. This similarity can be mistaken for gene conversion when it is simply due to neutral evolution or drift, particularly in small populations or bottlenecked species. Moreover, gene conversion might contribute to greater diversity by creating hybrids or mosaics between different MHC genes. In this regard, can the authors indicate what percentage of the gene numbers in their study have been homogenised by gene conversion compared to those that have been diversified by gene conversion?

      We appreciate the summary, and we feel we have appropriately discussed both gene conversion and diversifying selection in the context of the MHC genes. Because we cannot know for sure when and where gene conversion has occurred, we cannot quantify percentages of genes that have been homogenized or diversified.  

      Duplication models. The phylogenetic overlays or ideograms (Figures 6 and 7) show considerable imperfect multigene duplications, losses, and rearrangements, but the paper's Discussion provides no in-depth consideration of the various multigenic models or mechanisms that can be used to explain the occurrence of such events. How do their duplication models compare to those proposed by others? For example, their text simply says on line 292, 'the proposed series of events is not always consistent with phylogenetic data'. How, why, when? Duplication models for the generation and extension of the human MHC class I genes as duplicons (extended gene or segmental genomic structures) by parsimonious imperfect tandem duplications with deletions and rearrangements in the alpha, beta, and kappa blocks were already formulated in the late 1990s and extended to the rhesus macaque in 2004 based on genomic haplotypic sequences. These studies were based on genomic sequences (genes, pseudogenes, retroelements), dot plot matrix comparisons, and phylogenetic analyses of gene and retroelement sequences using computer programs. It already was noted or proposed in these earlier 1999 studies that (1) the ancestor of HLA-P(90)/-T(16)/W(80) represented an old lineage separate from the other HLA class I genes in the alpha block, (2) HLA-U(21) is a duplicated fragment of HLA-A, (3) HLA-F and HLA-V(75) are among the earliest (progenitor) genes or outgroups within the alpha block, (4) distinct Alu and L1 retroelement sequences adjoining HLA-L(30), and HLA-N genomic segments (duplicons) in the kappa block are closely related to those in the HLA-B and HLA-C in the beta block; suggesting an inverted duplication and transposition of the HLA genes and retroelements between the beta and kappa regions. None of these prior human studies were referenced by Fortier and Pritchard in their paper. How does their human MHC class I gene duplication model (Fig. 6) such as gene duplication numbers and turnovers differ from those previously proposed and described by Kulski et al (1997 JME 45,599), (1999 JME 49,84), (2000 JME 50,510), Dawkins et al (1999 Immunol Rev 167,275), and Gaudieri et al (1999 GR 9,541)? Is this a case of reinventing the wheel?

      Figures 6 and 7 are intended to synthesize and reconcile past findings and our own trees, so they do not strictly adhere to the findings of any particular study and cannot fully match all studies. In the supplement, Figure 6 - figure supplement 1 and Figure 7 - figure supplement 1 duly credit all of the past work that went into making these trees. Most previous papers focus on just one aspect of these trees, such as haplotypes within a species, a specific gene or allelic lineage relationship, or the branching pattern of particular gene groups. We believe it was necessary to bring all of these pieces of evidence together. Even among papers with the same focus (to understand the block duplications that generated the current physical layout of the MHC), results differ. For example, Geraghty (1992), Hughes (1995), Kulski (2004)/Kulski (2005),  and Shiina (1999) all disagree on the exact branching order of the genes MHC-W, -P, and -T, and of MHC-G, -J, and -K. While the Kulski studies you pointed out were very thorough for their era, they still only relied on data from three species and one haplotype per species. Our work is not intended to replace or discredit these past works, simply build upon them with a larger set of species and sequences. We hope the hypotheses we propose in Figures 6 and 7 can help unify existing research and provide a more easily accessible jumping-off-point for future work.

      Results. The results are presented as new findings, whereas most if not all of the results' significance and importance already have been discussed in various other publications. Therefore, the authors might do better to combine the results and discussion into a single section with appropriate citations to previously published findings presented among their results for comparison. Do the trees and subsets differ from previous publications, albeit that they might have fewer comparative examples and samples than the present preprint? Alternatively, the results and discussion could be combined and presented as a review of the field, which would make more sense and be more honest than the current format of essentially rehashing old data.

      In starting this project, we found that a large barrier to entry to this field of study is the immense amount of published literature over 30+ years. It is both time-consuming and confusing to read up on the many nuances of the MHC genes, their changing names, and their evolution, making it difficult to start new, innovative projects. We acknowledge that while our results are not entirely novel, the main advantage of our work is that it provides a thorough, comprehensive starting point for others to learn about the MHC quickly and dive into new research. We feel that we have appropriately cited past literature in both the main text, appendices, and supplement, so that readers may dive into a particular area with ease.

      Minor corrections:

      (1) Abstract, line 19: 'modern methods'. Too general. What modern methods?

      To keep the abstract brief, the methods are introduced in the main text when each becomes relevant as well as in the methods section.

      (2) Abstract, line 25: 'look into [primate] MHC evolution.' The analysis is on the primate MHC genes, not on the entire vertebrate MHC evolution with a gene collection from sharks to humans. The non-primate MHC genes are often differently organised and structurally evolved in comparison to primate MHC.

      Thank you! We have added the word “primate” to the abstract (line 25).

      (3) Introduction, line 113. 'In a companion paper (Fortier and Pritchard, 2024)' This paper appears to be unpublished. If it's unpublished, it should not be referenced.

      This paper is undergoing the eLife editorial process at the same time; it will have a proper citation in the final version.

      (4) Figures 1 and 2. Use the term 'gene symbols' (circle, square, triangle, inverted triangle, diamond) or 'gene markers' instead of 'points'. 'Asterisks "within symbols" indicate new information.

      Thank you, the word “symbol” is much clearer! We have changed “points” to “symbols” in the captions for Figure 1, Figure 1 - figure supplement 1, Figure 2, and Figure 2 - figure supplement 1. We also changed this in the text (lines 157-158 and 170).

      (5) Figures. A variety of colours have been applied for visualisation. However, some coloured texts are so light in colour that they are difficult to read against a white background. Could darker colours or black be used for all or most texts?

      With such a large number of genes and species to handle in this work, it was nearly impossible to choose a set of colors that were distinct enough from each other. We decided to prioritize consistency (across this paper, its supplement, and our companion paper) as well as at-a-glance grouping of similar sequences. Unfortunately, this means we had to sacrifice readability on a white background, but readers may turn to the supplement if they need to access specific sequence names.

      (6) Results, line 135. '(Fortier and Pritchard, 2024)' This paper appears to be unpublished. If it's unpublished, it should not be referenced.

      Repeat of (3). This paper is undergoing the eLife editorial process at the same time; it will have a proper citation in the final version.

      (7) Results, lines 152 to 153, 164, 165, etc. 'Points with an asterisk'. Use the term 'gene symbols' (circle, square, triangle, inverted triangle, diamond) or 'gene markers' instead of 'points'. A point is a small dot such as those used in data points for plotting graphs .... The figures are so small that the asterisks in the circles, squares, triangles, etc, look like points (dots) and the points/asterisks terminology that is used is very confusing visually.

      Repeat of (4). Thank you, the word “symbol” is much clearer! We have changed “points” to “symbols” in the captions for Figure 1, Figure 1 - figure supplement 1, Figure 2, and Figure 2 - figure supplement 1. We also changed this in the text (lines 157-158 and 170).

      (8) Line 178 (BEA, 2024) is not listed alphabetically in the References.

      Thank you for catching this! This reference maps to the first bibliography entry, “SUMMARIZING POSTERIOR TREES.” We are unsure how to cite a webpage that has no explicit author within the eLife Overleaf template, so we will consult with the editor.

      (9) Lines 188-190. 'NWM MHC-G does not group with ape/OWM MHC-G, instead falling outside of the clade containing ape/OWM MHC-A, -G, -J and -K.' This is not surprising given that MHC-A, -G, -J, and -K are paralogs of each other and that some of them, especially in NWM have diverged over time from the paralogs and/or orthologs and might be closer to one paralog than another and not be an actual ortholog of OWM, apes or humans.

      We included this sentence to clarify the relationships between genes and to help describe what is happening in Figure 6. Figure 6 - figure supplement 1 includes all of the references that go into such a statement and Appendix 3 details our reasoning for this and other statements.

      (10) Line 249. Gene conversion: This is recombination between two different genes where a portion of the genes are exchanged with one another so that different portions of the gene can group within one or other of the two gene clades. Alternatively, the gene has been annotated incorrectly if the gene does not group within either of the two alternative clades. Another possibility is that one or two nucleotide mutations have occurred without a recombination resulting in a mistaken interpretation or conclusion of a recombination event. What measures are taken to avoid false-positive conclusions? How many MHC gene conversion (recombination) events have occurred according to the authors' estimates? What measures are taken to avoid false-positive conclusions?

      All of these possibilities are certainly valid. We used the program GENECONV to infer gene conversion events, but there is considerable uncertainty owing to the ages of the genes and the inevitable point mutations that have occurred post-event. Gene conversion was not the focus of our paper, so we did our best to acknowledge it (and the resulting differences between trees from different exons) without spending too much time diving into it. A list of inferred gene conversion events can be found in Figure 3 - source data 1 and Figure 4 - source data 1.

      (11) Lines 284-286. 'The Class I MHC region is further divided into three polymorphic blocks-alpha, beta, and kappa blocks-that each contains MHC genes but are separated by well-conserved non-MHC genes.' The MHC class I region was first designated into conserved polymorphic duplication blocks, alpha and beta by Dawkins et al (1999 Immunol Rev 167,275), and kappa by Kulski et al (2002 Immunol Rev 190,95), and should be acknowledged (cited) accordingly.

      Thank you for catching this! We have added these citations (lines 302-303)!

      (12) Lines 285-286. 'The majority of the Class I genes are located in the alpha-block, which in humans includes 12 MHC genes and pseudogenes.' This is not strictly correct for many other species, because the majority of class I genes might be in the beta block of new and old-world monkeys, and the authors haven't provided respective counts of duplication numbers to show otherwise. The alpha block in some non-primate mammalian species such as pigs, rats, and mice has no MHC class I genes or only a few. Most MHC class I genes in non-primate mammalian species are found in other regions. For example, see Ando et al (2005 Immunogenetics 57,864) for the pig alpha, beta, and kappa regions in the MHC class I region. There are no pig MHC genes in the alpha block.

      Yes, which is exactly why we use the phrase “in humans” in that particular sentence. The arrangement of the MHC in several other primate reference genomes is shown in Figure 1 - figure supplement 2.

      (13) Line 297 to 299. 'The alpha-block also contains a large number of repetitive elements and gene fragments belonging to other gene families, and their specific repeating pattern in humans led to the conclusion that the region was formed by successive block duplications (Shiina et al., 1999).' There are different models for successive block duplications in the alpha block and some are more parsimonious based on imperfect multigenic segmental duplications (Kulski et al 1999, 2000) than others (Shiina et al., 1999). In this regard, Kulski et al (1999, 2000) also used duplicated repetitive elements neighbouring MHC genes to support their phylogenetic analyses and multigenic segmental duplication models. For comparison, can the authors indicate how many duplications and deletions they have in their models for each species?

      We have added citations to this sentence to show that there are different published models to describe the successive block duplications (line 307). Our models in Figure 6 and Figure 7 are meant to aggregate past work and integrate our own, and thus they were not built strictly by parsimony. References can be found in Figure 6 - figure supplement 1 and Figure 7 - figure supplement 1.

      (14) Lines 315-315. 'Ours is the first work to show that MHC-U is actually an MHC-A-related gene fragment.' This sentence should be deleted. Other researchers had already inferred that MHC-U is actually an MHC-A-related gene fragment more than 25 years ago (Kulski et al 1999, 2000) when the MHC-U was originally named MHC-21.

      While these works certainly describe MHC-U/MHC-21 as a fragment in the 𝛼-block, any relation to MHC-A was by association only and very few species/haplotypes were examined. So although the idea is not wholly novel, we provide convincing evidence that not only is MHC-U related to MHC-A by sequence, but also that it is a very recent partial duplicate of MHC-A. We show this with Bayesian phylogenetic trees as well as an analysis of haplotypes across many more species than were included in those papers.  

      (15) Lines 361-362. 'Notably, our work has revealed that MHC-V is an old fragment.' This is not a new finding or hypothesis. Previous phylogenetic analysis and gene duplication modelling had already inferred HLA-V (formerly HLA-75) to be an old fragment (Kulski et al 1999, 2000).

      By “old,” we mean older than previous hypotheses suggest. Previous work has proposed that MHC-V and -P were duplicated together, with MHC-V deriving from an MHC-A/H/V ancestral gene and MHC-P deriving from an MHC-W/T/P ancestral gene (Kulski (2005), Shiina (1999)). However, our analysis (Figure 5A) shows that MHC-V sequences form a monophyletic clade outside of the MHC-W/P/T group of genes as well as outside of the MHC-A/B/C/E/F/G/J/K/L group of genes, which is not consistent with MHC-A and -V being closely related. Thus, we conclude that MHC-V split off earlier than the differentiation of these other gene groups and is thus older than previously thought. We explain this in the text as well (lines 317-327) and in Appendix 3.  

      (16) Line 431-433. 'the Class II genes have been largely stable across the mammals, although we do see some lineage-specific expansions and contractions (Figure 2 and Figure 2-gure Supplement 2).' Please provide one or two references to support this statement. Is 'gure' a typo?

      We corrected this typo, thank you! This conclusion is simply drawn from the data presented in Figure 2 and Figure 2 - figure supplement 2. The data itself comes from a variety of sources, which are already included in the supplement as Figure 2 - source data 1.

      (17) Line 437. 'We discovered far more "specific" events in Class I, while "broad-scale" events were predominant in Class II.' Please define the difference between 'specific' and 'broad-scale'.

      These terms are defined in the previous sentence (lines 466-469).

      450-451. 'This shows that classical genes experience more turnover and are more often affected by long-term balancing selection or convergent evolution.' Is balancing selection a form of divergent evolution that is different from convergent evolution? Please explain in more detail how and why balancing selection or convergent evolution affects classical and nonclassical genes differently.

      Balancing selection acts to keep alleles at moderate frequencies, preventing any from fixing in the population. In contrast, convergent evolution describes sequences or traits becoming similar over time even though they are not similar by descent. While we cannot know exactly what selective forces have occurred in the past, we observe different patterns in the trees for each type of gene. In Figures 1 and 2, viewers can see at first glance that the nonclassical genes (which are named throughout the text and thoroughly described in Appendix 3) appear to be longer-lived than the classical genes. In addition, lines 204-222 and 475-488 describe topological differences in the BEAST2 trees of these two types of genes. However, we acknowledge that it could be helpful to have additional, complimentary information about the classical vs. non-classical genes. Thus, we have added a sentence and reference to our companion paper (Fortier and Pritchard, 2025), which focuses on long-term balancing selection and draws further contrast between classical and non-classical genes. In lines 481-484, we added  “We further explore the differences between classical and non-classical genes in our companion paper, finding ancient trans-species polymorphism at the classical genes but not at the non-classical genes \citep{Fortier2025b}.”

      References

      Some references in the supplementary materials such as Alvarez (1997), Daza-Vamenta (2004), Rojo (2005), Aarnink (2014), Kulski (2022), and others are missing from the Reference list. Please check that all the references in the text and the supplementary materials are listed correctly and alphabetically.

      We will make sure that these all show up properly in the proof.

      Reviewer #3 (Public review):

      Summary:

      The article provides the most comprehensive overview of primate MHC class I and class II genes to date, combining published data with an exploration of the available genome assemblies in a coherent phylogenetic framework and formulating new hypotheses about the evolution of the primate MHC genomic region.

      Strengths:

      I think this is a solid piece of work that will be the reference for years to come, at least until population-scale haplotype-resolved whole-genome resequencing of any mammalian species becomes standard. The work is timely because there is an obvious need to move beyond short amplicon-based polymorphism surveys and classical comparative genomic studies. The paper is data-rich and the approach taken by the authors, i.e. an integrative phylogeny of all MHC genes within a given class across species and the inclusion of often ignored pseudogenes, makes a lot of sense. The focus on primates is a good idea because of the wealth of genomic and, in some cases, functional data, and the relatively densely populated phylogenetic tree facilitates the reconstruction of rapid evolutionary events, providing insights into the mechanisms of MHC evolution. Appendices 1-2 may seem unusual at first glance, but I found them helpful in distilling the information that the authors consider essential, thus reducing the need for the reader to wade through a vast amount of literature. Appendix 3 is an extremely valuable companion in navigating the maze of primate MHC genes and associated terminology.

      Weaknesses:

      I have not identified major weaknesses and my comments are mostly requests for clarification and justification of some methodological choices.

      Thank you so much for your kind and supportive review!

      Reviewer #1 (Recommendations for the authors):

      (1) Line 151: How is 'extensively studied' defined?

      Extensively studied is not a strict definition, but a few organisms clearly stand apart from the rest in terms of how thoroughly their MHC regions have been studied. For example, the macaque is a model organism, and individuals from many different species and populations have had their MHC regions fully sequenced. This is in contrast to the gibbon, for example, in which there is some experimental evidence for the presence of certain genes, but no MHC region has been fully sequenced from these animals.

      (2) Can you clarify how 'classical' and 'non-classical' MHC genes are being determined in your analysis?

      Classical genes are those whose protein products perform antigen presentation to T cells and are directly involved in adaptive immunity, while non-classical genes are those whose protein products do not do this. For example, these non-classical genes might code for proteins that interact with receptors on Natural Killer cells and influence innate immunity. The roles of these proteins are not necessarily conserved between closely related species, and experimental evidence is needed to evaluate this. However, in the absence of such evidence, wherever possible we have provided our best guess as to the roles of the orthologous genes in other species, presented in Figure 1 - source data 1 and Figure 2 - source data 1. This is based on whatever evidence is available at the moment, sometimes experimental but typically based on dN/dS ratios and other indirect measures.

      (3) I find the overall tone of the paper to be very descriptive, and at times meandering and repetitive, with a lot of similar kinds of statements being repeated about gene gain/loss. This is perhaps inevitable because a single question is being asked of each of many subsets of MHC gene types, and even exons within gene types, so there is a lot of repetition in content with a slightly different focus each time. This does not help the reader stay focused or keep track. I found myself wishing for a clearly defined question or hypothesis, or some rate parameter in need of estimation. I would encourage the authors to tighten up their phrasing, or consider streamlining the results with some better signposting to organize ideas within the results.

      We totally understand your critique, as we talk about a wide range of specific genes and gene groups in this paper. To improve readability, we have added many more signposting phrases and sentences:

      “Aside from MHC-DRB, …” (line 173)

      “Now that we had a better picture of the landscape of MHC genes present in different primates, we wanted to understand the genes’ relationships. Treating Class I, Class IIA, and Class IIB separately, ...” (line 179-180)

      “We focus first on the Class I genes.” (line 191)

      “... for visualization purposes…” (line195)

      “We find that sequences do not always assort by locus, as would be expected for a typical gene.” (lines 196-197)

      “... rather than being directly orthologous to the ape/OWM MHC-G genes.” (lines 201-202)

      “Appendix 3 explains each of these genes in detail, including previous work and findings from this study.“ (lines 202-203)

      “... (but not with NWM) …” (line 208)

      “While genes such as MHC-F have trees which closely match the overall species tree, other genes show markedly different patterns, …” (lines 212-213)

      “Thus, while some MHC-G duplications appear to have occurred prior to speciation events within the NWM, others are species-specific.” (lines 218-219)

      “... indicating rapid evolution of many of the Class I genes” (lines 220-221)

      “Now turning to the Class II genes, …“ (line 223)

      “(see Appendix 2 for details on allele nomenclature) “ (line 238)

      “(e.g. MHC-DRB1 or -DRB2)” (line 254)

      “...  meaning their names reflect previously-observed functional similarity more than evolutionary relatedness.” (lines 257-258)

      “(see Appendix 3 for more detail)” (line 311)

      “(a 5'-end fragment)” (line 324)

      “Therefore, we support past work that has deemed MHC-V an old fragment.” (lines 326-327)

      “We next focus on MHC-U, a previously-uncharacterized fragment pseudogene containing only exon 3.” (line 328-329)

      “However, it is present on both chimpanzee haplotypes and nearly all human haplotypes, and we know that these haplotypes diverged earlier---in the ancestor of human and gorilla. Therefore, ...” (lines 331-333)

      “Ours is the first work to show that MHC-U is actually an MHC-A-related gene fragment and that it likely originated in the human-gorilla ancestor.” (lines 334-336)  

      “These pieces of evidence suggest that MHC-K and -KL duplicated in the ancestor of the apes.” (lines 341-342)

      “Another large group of related pseudogenes in the Class I $\alpha$-block includes MHC-W, -P, and -T (see Appendix 3 for more detail).” (lines 349-350)

      “...to form the current physical arrangement” (lines 354)

      “Thus, we next focus on the behavior of this subgroup in the trees.” (line 358)

      “(see Appendix 3 for further explanation).” (line 369)

      “Thus, for the first time we show that there must have been three distinct MHC-W-like genes in the ape/OWM ancestor.” (lines 369-371)

      “... and thus not included in the previous analysis. ” (lines 376-377)

      “MHC-Y has also been identified in gorillas (Gogo-Y) (Hans et al., 2017), so we anticipate that Gogo-OLI will soon be confirmed. This evidence suggests that the MHC-Y and -OLI-containing haplotype is at least as old as the human-gorilla split. Our study is the first to place MHC-OLI in the overall story of MHC haplotype evolution“ (lines 381-384)

      “Appendix 3 explains the pieces of evidence leading to all of these conclusions (and more!) in more detail.” (lines 395-396)

      “However, looking at this exon alone does not give us a complete picture.” (lines 410-411)

      “...instead of with other ape/OWM sequences, …” (lines 413-414)

      “Figure 7 shows plausible steps that might have generated the current haplotypes and patterns of variation that we see in present-day primates. However, some species are poorly represented in the data, so the relationships between their genes and haplotypes are somewhat unclear.” (lines 427-429)

      “(and more-diverged)” (line 473)

      “(of both classes)” (line 476)

      “..., although the classes differ in their rate of evolution.”  (line 487-488)

      “Including these pseudogenes in our trees helped us construct a new model of $\alpha$-block haplotype evolution. “ (lines 517-518)

      (4) Line 480-82: "Notably...." why is this notable? Don't merely state that something is notable, explain what makes it especially worth drawing the reader's attention to: in what way is it particularly significant or surprising?

      We have changed the text from “Notably” to “In particular” (line 390) so that readers are expecting us to list some specific findings. Similarly, we changed “Notably” to “Specifically” (line 515).

      (5) The end of the discussion is weak: "provide context" is too vague and not a strong statement of something that we learned that we didn't know before, or its importance. This is followed by "This work will provide a jumping-off point for further exploration..." such as? What questions does this paper raise that merit further work?

      We have made this paragraph more specific and added some possible future research directions. It now reads “By treating the MHC genes as a gene family and including more data than ever before, this work enhances our understanding of the evolutionary history of this remarkable region. Our extensive set of trees incorporating classical genes, non-classical genes, pseudogenes, gene fragments, and alleles of medical interest across a wide range of species will provide context for future evolutionary, genomic, disease, and immunologic studies. For example, this work provides a jumping-off-point for further exploration of the evolutionary processes affecting different subsets of the gene family and the nuances of immune system function in different species. This study also provides a necessary framework for understanding the evolution of particular allelic lineages within specific MHC genes, which we explore further in our companion paper \citep{Fortier2025b}. Both studies shed light on MHC gene family evolutionary dynamics and bring us closer to understanding the evolutionary tradeoffs involved in MHC disease associations.” (lines 576-586)

      Reviewer #3 (Recommendations for the authors):

      (1) Figure 1 et seq. Classifying genes as having 'classical', 'non-classical' and 'dual' properties is notoriously difficult in non-model organisms due to the lack of relevant information. As you have characterised a number of genes for the first time in this paper and could not rely entirely on published classifications, please indicate the criteria you used for classification.

      The roles of these proteins are not necessarily conserved between closely related species, and experimental evidence is needed to evaluate this. However, in the absence of such evidence, wherever possible we have provided our best guess as to the roles of the orthologous genes in other species, presented in Figure 1 - source data 1 and Figure 2 - source data 1. This is based on whatever evidence is available at the moment, sometimes experimental but typically based on dN/dS ratios and other indirect measures.

      (2) Line 61 It's important to mention that classical MHC molecules present antigenic peptides to T cells with variable alphabeta T cell receptors, as non-classical MHC molecules may interact with other T cell subsets/types.

      Thank you for pointing this out; we have updated the text to make this clearer (lines 63-65). We changed “‘Classical’ MHC molecules perform antigen presentation to T cells---a key part of adaptive immunity---while ‘non-classical’ molecules have niche immune roles.” to “‘Classical’ MHC molecules perform antigen presentation to T cells with variable alphabeta TCRs---a key part of adaptive immunity---while ‘non-classical’ molecules have niche immune roles.”

      (3) Perhaps it's worth mentioning in the introduction that you are deliberately excluding highly divergent non-classical MHC molecules such as CD1.

      Thank you, it’s worth clarifying exactly what molecules we are discussing. We have added a sentence to the introduction (lines 38-43): “Having originated in the jawed vertebrates, this group of genes is now involved in diverse functions including lipid metabolism, iron uptake regulation, and immune system function (proteins such as zinc-𝛼2-glycoprotein (ZAG), human hemochromatosis protein (HFE), MHC class I chain–related proteins (MICA, MICB), and the CD1 family) \citep{Hansen2007,Kupfermann1999,Kaufman2022,Adams2013}. However, here we focus on…”

      (4) Line 94-105 This material presents results, it could be moved to the results section as it now somewhat disrupts the flow.

      We feel it is important to include a “teaser” of the results in the introduction, which can be slightly more detailed than that in the abstract.

      (5) Line 118-131 This opening section of the results sets the stage for the whole presentation and contains important information that I feel needs to be expanded to include an overview and justification of your methodological choices. As the M&M section is at the end of the MS (and contains limited justification), some information on two aspects is needed here for the benefit of the reader. First, as far as I understand, all phylogenetic inferences were based entirely on DNA sequences of individual (in some cases concatenated) exons. It would be useful for the reader to explain why you've chosen to rely on DNA rather than protein sequences, even though some of the genes you include in the phylogenetic analysis are highly divergent. Second, a reader might wonder how the "maximum clade credibility tree" from the Bayesian analysis compares to commonly seen trees with bootstrap support or posterior probability values assigned to particular clades. Personally, I think that the authors' approach to identifying and presenting representative trees is reasonable (although one might wonder why "Maximum clade credibility tree" and not "Maximum credibility tree" https://www.beast2.org/summarizing-posterior-trees/), since they are working with a large number of short, sometimes divergent and sometimes rather similar sequences - in such cases, a requirement for strict clade support could result in trees composed largely of polytomies. However, I feel it's necessary to be explicit about this and to acknowledge that the relationships represented by fully resolved bifurcating representative trees and interpreted in the study may not actually be highly supported in the sense that many readers might expect. In other words, the reader should be aware from the outset of what the phylogenies that are so central to the paper represent.

      We chose to rely on DNA rather than protein sequences because convergent evolution is likely to happen in regions that code for extremely important functions such as adaptive and innate immunity. Convergent evolution acts upon proteins while trans-species polymorphism retains ancient nucleotide variation, so studying the DNA sequence can help tease apart convergent evolution from trans-species polymorphism.

      As for the “maximum clade credibility tree”, this is a matter of confusing nomenclature. In the online reference guide (https://www.beast2.org/summarizing-posterior-trees/), the tree with the maximum product of the posterior clade probabilities is called the “maximum credibility tree” while the tree that has the maximum sum of posterior clade probabilities is called the “Maximum credibility tree”. The “Maximum credibility tree” (referring to the sum) appears to have only been named in this way in the first version of TreeAnnotator. However, the version of TreeAnnotator that I used lists the options “maximum clade credibility tree” and “maximum sum of clade probabilities”. So the context suggests that the “maximum clade credibility tree” option is actually maximizing the product. This “maximum clade credibility tree” is the setting I used for this project (in TreeAnnotator version 2.6.3).

      We agree that readers may not fully grasp what the collapsed trees represent upon first read. We have added a sentence to the beginning of the results (line 188-190) to make this more explicit.

      (6) Line 224, you're referring to the DPB1*09 lineage, not the DRB1*09 lineage.

      Indeed! We have changed these typos.

      (7) Line 409, why "Differences between MHC subfamilies" and not "Differences between MHC classes"?

      We chose the word “subfamilies” because we discuss the difference between classical and non-classical genes in addition to differences between Class I and Class II genes.

      (8) Line 529-544 This might work better as a table.

      We agree! This information is now presented as Table 1.

      (9) Line 547 MHC-DRB9 appears out of the blue here - please say why you are singling it out.

      Great point! We added a paragraph (lines 614-623) to explain why this was necessary.

      (10) Line 550-551 Even though you've screened the hits manually, it would be helpful to outline your criteria for this search.

      Thank you! We’ve added a couple of sentences to explain how we did this (lines 607-610).

      (11) Line 556-580 please provide nucleotide alignments as supplementary data so that the reader can get an idea of the actual divergence of the sequences that have been aligned together.

      Thank you! We’ve added nucleotide alignments as supplementary files.

      (12) Line 651-652 Why "Maximum clade credibility tree" and not "Maximum credibility tree"? 

      Repeat of (5). This is a matter of confusing nomenclature. In the online reference guide (https://www.beast2.org/summarizing-posterior-trees/), the tree with the maximum product of the posterior clade probabilities is called the “maximum credibility tree” while the tree that has the maximum sum of posterior clade probabilities is called the “Maximum credibility tree”. The “Maximum credibility tree” (referring to the sum) appears to have only been named in this way in the first version of TreeAnnotator. However, the version of TreeAnnotator that I used lists the options “maximum clade credibility tree” and “maximum sum of clade probabilities”. So the context suggests that the “maximum clade credibility tree” option is actually maximizing the product. This “maximum clade credibility tree” is the setting I used for this project (in TreeAnnotator version 2.6.3).

      (13) In the appendices, links to references do not work as expected.

      We will make sure these work properly when we receive the proofs.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      but see Franzius, Sprekeler, Wiskott, PLoS Computational Biology, 2007

      We have discussed the differences with this work in the response to Editor recommendations above.

      While the findings reported here are interesting, it is unclear whether they are the consequence of the specific model setting, and how well they would generalize.

      We have considered deep vision models across different architectures in our paper, which include traditional feedforward convolutional neural networks (VGG-16), convolutional neural networks with skip connections (ResNet-50) and the Vision Transformer (VIT) which employs self-attention instead of convolution as its core information processing unit.

      In particular, examining the pictures shown in Fig. 1A, it seems that local walls of the ’box’ contain strong oriented features that are distinct across different views. Perhaps the response of oriented visual filters can leverage these features to uniquely determine the spatial variable. This is concerning because this is a very specific setting that is unlikely to generalize.

      The experimental set up is based on experimental studies of spatial cognition in rodents. They are typically foraging in square or circular environments. Indeed, square environments will have more borders and corners that will provide information about the spatial environment, which is true in both empirical studies and our simulations. In any navigation task, and especially more realistic environments, visual information such as borders or landmarks likely play a major role in spatial information available to the agent. In fact, studies that do not consider sensory information to contribute to spatial information are likely missing a major part of how animals navigate.

      The prediction would be that place cells/head direction cells should go away in darkness. This implies that key aspects of functional cell types in the spatial cognition are missing in the current modeling framework.

      We addressed this comment in our response to the editor’s highlight. To briefly recap, we do not intend to propose a comprehensive model of the brain that captures all spatial phenomena, as we would not expect this from an object recognition network. Instead, we show that such a simple and nonspatial model can reproduce key signatures of spatial cells, raising important questions about how we interpret spatial cell types that dominate current research.

      Reviewer #2 (Public Review):

      The network used in the paper is still guided by a spatial error signal [...] one could say that the authors are in some way hacking this architecture and turning it into a spatial navigation one through learning.

      To be clear, the base networks we use do not undergo spatial error training. They have either been pre-trained on image classification tasks or are untrained. We used a standard neuroscience approach: training linear decoders on representations to assess the spatial information present in the network layers. The higher decoding errors in early layer representations (Fig. 2A) indicate that spatial information differs across layers—an effect that cannot be attributed to the linear decoder alone.

      My question is whether the paper is fighting an already won battle.

      Intuitive cell type discovery are still being celebrated. Concentrating on this kind of cell type discovery has broader implications that could be deleterious to the future of science. One point to note is that this issue depends on the area or subfield of neuroscience. In some subfields, papers that claim to find cell types with a strong claim of specific functions are relatively rare, and population coding is common (e.g., cognitive control in primate prefrontal cortex, neural dynamics of motor control). Although rodent neuroscience as a field is increasingly adopting population approaches, influential researchers and labs are still publishing “cell types” and in top journals (here are a few from 2017-2024: Goal cells (Sarel et al., 2017), Object-vector cells (Høydal et al., 2019), 3D place cells (Grieves et al., 2020), Lap cells (Sun et al., 2020), Goal-vector cells (Ormond and O’Keefe, 2022), Predictive grid cells (Ouchi and Fujisawa, 2024).

      In some cases, identification of cell types is only considered a part of the story, and there are analyses on behavior, neural populations, and inactivationbased studies. However, our view (and suggest this is shared amongst most researchers) is that a major reason these papers are reviewed and accepted to top journals is because they have a simple, intuitive “cell type” discovery headline, even if it is not the key finding or analysis that supports the insightful aspects of the work. This is unnecessary and misleading to students of neuroscience, related fields, and the public, it affects private and public funding priorities and in turn the future of science. Worse, it could lead the field down the wrong path, or at the least distribute attention and resources to methods and papers that could be providing deeper insights. Consistent with the central message of our work, we believe the field should prioritize theoretical and functional insights over the discovery of new “cell types”.

      Reviewer #3 (Public Review):

      The ability to linearly decode position from a large number of units is not a strong test of spatial information, nor is it a measure of spatial cognition

      Using a linear decoder to test what information is contained in a population of neurons available for downstream areas is a common technique in neuroscience (Tong and Pratte, 2012; DiCarlo et al., 2012) including spatial cells (e.g., Diehl et al. 2017; Horrocks et al. 2024). A linear decoder is used because it is a direct mapping from neurons to potential output behavior. In other words, it only needs to learn some mapping to link one set of neurons to another set which can “read out” the information. As such, it is a measure of the information contained in the population, and it is a lower bound of the information contained - as both biological and artificial neurons can do more complex nonlinear operations (as the activation function is nonlinear).

      We understand the reviewer may understand this concept but we explain it here to justify our position and for completeness of this public review.

      For example, consider the head direction cells in Figure 3C. In addition to increased activity in some directions, these cells also have a high degree of spatial nonuniformity, suggesting they are responding to specific visual features of the environment. In contrast, the majority of HD cells in the brain are only very weakly spatially selective, if at all, once an animal’s spatial occupancy is accounted for (Taube et al 1990, JNeurosci). While the preferred orientation of these cells are anchored to prominent visual cues, when they rotate with changing visual cues the entire head direction system rotates together (cells’ relative orientation relationships are maintained, including those that encode directions facing AWAY from the moved cue), and thus these responses cannot be simply independent sensory-tuned cells responding to the sensory change) (Taube et al 1990 JNeurosci, Zugaro et al 2003 JNeurosci, Ajbi et al 2023).

      As we have noted in our response to the editor, one of the main issues is how the criteria to assess what they are interested in is created in a subjective, and biased way, in a circular fashion (seeing spatial-like responses, developing criteria to determine a spatial response, select a threshold).

      All the examples the reviewer provides concentrate on strict criteria developed after finding such cells. What is the purpose of these cells for function, for behavior? Just finding a cell that looks like it is tuned to something does not explain its function. Neuroscience began with tuning curves in part due to methodological constraints, which was a promising start, but we propose that this is not the way forward.

      The metrics used by the authors to quantify place cell tuning are not clearly defined in the methods, but do not seem to be as stringent as those commonly used in real data. (e.g. spatial information, Skaggs et al 1992 NeurIPS).

      We identified place cells following the definition from Tanni et al. (2022), by one of the leading labs in the field. Since neurons in DNNs lack spikes, we adapted their criteria by focusing on the number of spatial bins in the ratemap rather than spike-based measures. However, our central argument is that the very act of defining spatial cells is problematic. Researchers set out to find place cells to study spatial representations, find spatially selective cells with subjective, qualitative criteria (sometimes combined with prior quantitative criteria, also subjectively defined), then try to fine-tune the criteria to more “stringent” criteria, depending on the experimental data at hand. It is not uncommon to see methodological sections that use qualitative judgments, such as: “To avoid bias ... we applied a loose criteria for place cells” Tanaka et al. (2018) , which reflects the lack of clarity for and subjectivity of place cell selection criteria.

      A simple literature survey reveals inconsistent criteria across studies. For place field selection, Dombeck et al. (2010) required mean firing rates exceeding 25% of peak rate, while Tanaka et al. (2018) used a 20% threshold. Speed thresholds also vary dramatically: Dombeck et al. (2010) calculated firing rates only when mice moved faster than 8.3 cm/s, whereas Tanaka et al. (2018) used 2 cm/s. Additional criteria differ further: Tanaka et al. (2018) required firing rates between 1-10 Hz and excluded cells with place fields larger than 1/3 of the area, while Dombeck et al. (2010) selected fields above 1.5 Hz, and Tanni et al. (2022) used a 10 spatial bins to 1/2 area threshold. As Dombeck et al. (2010) noted, differences in recording methods and place field definitions lead to varying numbers of identified place cells. Moreover, Grijseels et al. (2021) demonstrated that different detection methods produce vastly different place cell counts with minimal overlap between identified populations.

      This reflects a deeper issue. Unlike structurally and genetically defined cell types (e.g., pyramidal neurons, interneurons, dopamingeric neurons, cFos expressing neurons), spatial cells lack such clarity in terms of structural or functional specialization and it is unclear whether such “cell types” should be considered cell types in the same way. While scientific progress requires standardized definitions, the question remains whether defining spatial cells through myriad different criteria advances our understanding of spatial cognition. Are researchers finding the same cells? Could they be targeting different populations? Are they missing cells crucial for spatial cognition that they exclude due to the criteria used? We think this is likely. The inconsistency matters because different criteria may capture genuinely different neural populations or computational processes.

      Variability in definitions and criteria is an issue in any field. However, as we have stated, the deeper issue is whether we should be defining and selecting these cells at all before commencing analysis. By defining and restricting to spatial “cell types”, we risk comparing fundamentally different phenomena across studies, and worse, missing the fundamental unit of spatial cognition (e.g., the population).

      We have added a paragraph in Discussion (lines 357-366) noting the inconsistency in place cell selection criteria in the literature and the consequences of using varying criteria.

      We have also added a sentence (lines 354-356) raising the comparison of functionally defined spatial cell types with structurally and genetically defined cell types in the Discussion.

      Thus, the question is not whether spatially tuned cells are influenced by sensory information, but whether feed-forward sensory processing alone is sufficient to account for their observed turning properties and responses to sensory manipulations.

      These issues indicate a more significant underlying issue of scientific methodology relating to the interpretation of their result and its impact on neuroscientific research. Specifically, in order to make strong claims about experimental data, it is not enough to show that a control (i.e. a null hypothesis) exists, one needs to demonstrate that experimental observations are quantitatively no better than that control.

      Where the authors state that ”In summary, complex networks that are not spatial systems, coupled with environmental input, appear sufficient to decode spatial information.” what they have really shown is that it is possible to decode *some degree* of spatial information. This is a null hypothesis (that observations of spatial tuning do not reflect a ”spatial system”), and the comparison must be made to experimental data to test if the so-called ”spatial” networks in the brain have more cells with more reliable spatial info than a complex-visual control.

      We agree that good null hypotheses with quantitative comparisons are important. However, it is not clear that researchers in the field have not been using a null hypothesis, rather they make the assumption that these cell types exist and are functional in the way they assume. We provide one null hypothesis. The field can and should develop more and stronger null hypotheses.

      In our work, we are mainly focusing on criteria of finding spatial cells, and making the argument that simply doing this is misleading. Researcher develop criteria and find such cells, but often do not go further to assess whether they are real cell “types”, especially if they exclude other cells which can be misleading if other cells also play a role in the function of interest.

      But from many other experiments including causal manipulations (e.g. Robinson et al 2020 Cell, DeLauilleon et al 2015 Nat Neuro), which the authors conveniently ignore. Thus, I do not find their argument, as strongly stated as it is, to be well-supported.

      We acknowledge that there are several studies that have performed inactivation studies that suggest a strong role for place cells in spatial behavior. Most studies do not conduct comprehensive analyses to confirm that their place cells are in fact crucial for the behavior at hand.

      One question is how the criteria were determined. Did the researchers make their criteria based on what “worked”, so they did not exclude cells relevant to the behavior? What if their criteria were different, then the argument could have been that non-place cells also contribute to behavior.

      Another question is whether these cells are the same kinds of cells across studies and animals, given the varied criteria across studies? As most studies do not follow the same procedures, it is unclear whether we can generalize these results across cells and indeed, across task and spatial environments.

      Finally, does the fact that the place cells – the strongly selective cells with a place field – have a strong role in navigation provide any insight into the mechanism? Identifying cells by itself does not contribute to our understanding of how they work. Consistent with our main message, we argue that performing analyses and building computational models that uncover how the function of interest works is more valuable than simply naming cells.

      Finally, I find a major weakness of the paper to be the framing of the results in opposition to, as opposed to contributing to, the study of spatially tuned cells. For example, the authors state that ”If a perception system devoid of a spatial component demonstrates classically spatially-tuned unit representations, such as place, head-direction, and border cells, can ”spatial cells” truly be regarded as ’spatial’?” Setting aside the issue of whether the perception system in question does indeed demonstrate spatiallytuned unit representations comparable to those in the brain, I ask ”Why not?” This seems to be a semantic game of reading more into a name then is necessarily there. The names (place cells, grid cells, border cells, etc) describe an observation (that cells are observed to fire in certain areas of an animal’s environment). They need not be a mechanistic claim... This is evidenced by the fact that even within e.g. the place cell community, there is debate about these cells’ mechanisms and function (eg memory, navigation, etc), or if they can even be said to serve only a single function. However, they are still referred to as place cells, not as a statement of their function but as a history-dependent label that refers to their observed correlates with experimental variables. Thus, the observation that spatially tuned cells are ”inevitable derivatives of any complex system” is itself an interesting finding which *contributes to*, rather than contradicts, the study of these cells. It seems that the authors have a specific definition in mind when they say that a cell is ”truly” ”spatial” or that a biological or artificial neural network is a ”spatial system”, but this definition is not stated, and it is not clear that the terminology used in the field presupposes their definition.

      We have to agree to disagree with the reviewer on this point. Although researchers may reflect on their work and discuss what the mechanistic role of these cells are, it is widely perceived that cell type discovery is perceived as important to journals and funders due to its intuitive appeal and easy-tounderstand impact – even if there is no finding of interest to be reported. As noted in the comment above, papers claiming cell type discovery continue to be published in top journals and is continued to be funded.

      Our argument is that maybe “cell type” discovery research should not celebrated in the way it is, and in fact they shouldn’t be discovered when they are not genuine cell types like structural or genetic cell types. By using this term it make it appear like they are something they are not, which is misleading. They may be important cells, but providing a name like a “place” cell also suggests other cells are not encoding space - which is very unlikely to be true.

      In sum, our view is that finding and naming cells through a flawed theoretical lens that may not actually function as their names suggests can lead us down the wrong path and be detrimental to science.

      Reviewer #1 (Recommendations For The Authors):

      The novelty of the current study relative to the work by Franzius, Sprekeler, Wiskott (PLoS Computational Biology, 2007) needs to be carefully addressed. That study also modeled the spatial correlates based on visual inputs.

      Our work differs from Franzius et al. (2007) on both theoretical and experimental fronts. While both studies challenge the mechanisms underlying spatial cell formation, our theoretical contributions diverge. Franzius et al. (2007) assume spatial cells are inherently important for spatial cognition and propose a sensory-driven computational mechanism as an alternative to mainstream path integration frameworks for how spatial cells arise and support spatial cognition. In contrast, we challenge the notion that spatial cells are special at all. Using a model with no spatial grounding, we demonstrate that 1) spatial cells as naturally emerge from complex non-linear processing and 2) are not particularly useful for spatial decoding tasks, suggesting they are not crucial for spatial cognition.

      Our approach employs null models with fixed weights—either pretrained on classification tasks or entirely random—that process visual information non-sequentially. These models serve as general-purpose information processors without spatial grounding. In contrast, Franzius et al. (2007)’s model learns directly from environmental visual information, and the emergence of spatial cells (place or head-direction cells) in their framework depends on input statistics, such as rotation and translation speeds. Notably, their model does not simultaneously generate both place and head-direction cells; the outcome varies with the relative speed of rotation versus translation. Their sensory-driven model indirectly incorporates motion information through learning, exhibiting a time-dependence influenced by slow-feature analysis.

      Conversely, our model simultaneously produces units with place and headdirection cell profiles by processing visual inputs sampled randomly across locations and angles, independent of temporal or motion-related factors. This positions our model as a more general and fundamental null hypothesis, ideal for challenging prevailing theories on spatial cells due to its complete lack of spatial or motion grounding.

      Finally, unlike Franzius et al. (2007), who do not evaluate the functional utility of their spatial representations, we test whether the emergent spatial cells are useful for spatial decoding. We find that not only do spatial cells emerge in our non-spatial model, but they also fail to significantly aid in location or head-direction decoding. This is the central contribution of our work: spatial cells can arise without spatial or sensory grounding, and their functional relevance is limited. We have updated the manuscript to clarify the novelty of the current contribution to previous work (lines 324-335).

      In Fig. 2, it may be useful to plot the error in absolute units, rather than the normalized error. The direction decoding can be quantified in terms of degree Also, it would be helpful to compare the accuracy of spatial localization to that of the actual place cells in rodents.

      We argue it makes more sense and put comparison in perspective when we normalize the error by dividing the maximal error possible under each task. For transparency, we plot the errors in absolute physical units used by the Unity game engine in the updated Appendix (Fig. 1).

      Reviewer #2 (Recommendations For The Authors):

      Regarding the involvement of ’classified cells’ in decoding, I think a useful way to present the results would be to show the relationship between ’placeness’, ’directioness’ and ’borderness’ and the strength of the decoder weights. Either as a correlation or as a full scatter plot.

      We appreciate your suggestion to visualize the relationship between units’ spatial properties and their corresponding decoder weights. We believe it would be an important addition to our existing results. Based on the exclusion analyses, we anticipated the correlation to be low, and the additional results support this expectation.

      As an example, we present unit plots below for VGG-16 (pre-trained and untrained, at its penultimate layer with sampling rate equals 0.3; Author response image 1 and 2). Additional plots for various layers and across models are included in the supplementary materials (Fig. S12-S28). Consistently across conditions, we observed no significant correlations between units’ spatial properties (e.g., placeness) and their decoding weight strengths. These results further corroborate the conclusions drawn from our exclusion analyses.

      Reviewer #3 (Recommendations For The Authors):

      My main suggestions are that the authors: -perform manipulations to the sensory environment similar to those done in experimental work, and report if their tuned cells respond in similar ways -quantitatively compare the degree of spatial tuning in their networks to that seen in publicly available data -re-frame the discussion of their results to critically engage with and contribute to the field and its past work on sensory influences to these cells

      As we noted in our opening section, our model is not intended as a model of the brain. It is a non-spatial null model, and we present the surprising finding that even such a model contains spatial cell-like units if identified using criteria typically used in the field. This raises the question whether simply finding cells that show spatial properties is sufficient to grant the special status of “cell type” that is involved in the brain function of interest.

      Author response image 1.

      VGG-16 (pre-trained), penultimate layer units, show no apparent relationship between spatial properties and their decoder weight strengths.

      Author response image 2.

      VGG-16 (untrained), penultimate layer units, show no apparent relationship between spatial properties and their decoder weight strengths.

      Furthermore, our main simulations were designed to be compared to experimental work where rodents foraged around square environments in the lab. We did not do an extensive set of simulations as the purpose of our study is not to show that we capture exactly every single experimental finding, but rather raise the issues with the functional cell type definition and identification approach for progressing neuroscientific knowledge.

      Finally, as we note in more detail below, different labs use different criteria for identifying spatial cells, which depend both on the lab and the experimental design. Our point is that we can identify such cells using criteria set by neuroscientists, and that such cell types may not reflect any special status in spatial processing. Additional simulations that show less alignment with certain datasets will not provide support for or against our general message.

      References

      Banino A, Barry C, Uria B, Blundell C, Lillicrap T, Mirowski P, Pritzel A, Chadwick MJ, Degris T, Modayil J, Wayne G, Soyer H, Viola F, Zhang B, Goroshin R, Rabinowitz N, Pascanu R, Beattie C, Petersen S, Sadik A, Gaffney S, King H, Kavukcuoglu K, Hassabis D, Hadsell R, Kumaran D (2018) Vector-based navigation using grid-like representations in artificial agents. Nature 557(7705):429–433, DOI 10.1038/s41586-018-0102-6, URL http://www.nature.com/articles/s41586-018-0102-6

      DiCarlo JJ, Zoccolan D, Rust NC (2012) How Does the Brain Solve Visual Object Recognition? Neuron 73(3):415–434, DOI 10.1016/J.NEURON.2012.01.010, URL https://www.cell.com/neuron/fulltext/S0896-6273(12)00092-X

      Diehl GW, Hon OJ, Leutgeb S, Leutgeb JK (2017) Grid and Nongrid Cells in Medial Entorhinal Cortex Represent Spatial Location and Environmental Features with Complementary Coding Schemes. Neuron 94(1):83– 92.e6, DOI 10.1016/j.neuron.2017.03.004, URL https://linkinghub.elsevier.com/retrieve/pii/S0896627317301873

      Dombeck DA, Harvey CD, Tian L, Looger LL, Tank DW (2010) Functional imaging of hippocampal place cells at cellular resolution during virtual navigation. Nature Neuroscience 13(11):1433–1440, DOI 10.1038/nn.2648, URL https://www.nature.com/articles/nn.2648

      Ebitz RB, Hayden BY (2021) The population doctrine in cognitive neuroscience. Neuron 109(19):3055–3068, DOI 10.1016/j.neuron. 2021.07.011, URL https://linkinghub.elsevier.com/retrieve/pii/S0896627321005213

      Grieves RM, Jedidi-Ayoub S, Mishchanchuk K, Liu A, Renaudineau S, Jeffery KJ (2020) The place-cell representation of volumetric space in rats. Nature Communications 11(1):789, DOI 10.1038/s41467-020-14611-7, URL https://www.nature.com/articles/s41467-020-14611-7

      Grijseels DM, Shaw K, Barry C, Hall CN (2021) Choice of method of place cell classification determines the population of cells identified. PLOS Computational Biology 17(7):e1008835, DOI 10.1371/journal.pcbi.1008835, URL https://dx.plos.org/10.1371/journal.pcbi.1008835

      Horrocks EAB, Rodrigues FR, Saleem AB (2024) Flexible neural population dynamics govern the speed and stability of sensory encoding in mouse visual cortex. Nature Communications 15(1):6415, DOI 10.1038/s41467-024-50563-y, URL https://www.nature.com/articles/s41467-024-50563-y

      Høydal , Skytøen ER, Andersson SO, Moser MB, Moser EI (2019) Objectvector coding in the medial entorhinal cortex. Nature 568(7752):400– 404, DOI 10.1038/s41586-019-1077-7, URL https://www.nature.com/articles/s41586-019-1077-7

      Ormond J, O’Keefe J (2022) Hippocampal place cells have goal-oriented vector fields during navigation. Nature 607(7920):741–746, DOI 10.1038/s41586-022-04913-9, URL https://www.nature.com/articles/s41586-022-04913-9

      Ouchi A, Fujisawa S (2024) Predictive grid coding in the medial entorhinal cortex. Science 385(6710):776–784, DOI 10.1126/science.ado4166, URL https://www.science.org/doi/10.1126/science.ado4166

      Sarel A, Finkelstein A, Las L, Ulanovsky N (2017) Vectorial representation of spatial goals in the hippocampus of bats. Science 355(6321):176–180, DOI 10.1126/science.aak9589, URL https://www.science.org/doi/10.1126/science.aak9589

      Sun C, Yang W, Martin J, Tonegawa S (2020) Hippocampal neurons represent events as transferable units of experience. Nature Neuroscience 23(5):651–663, DOI 10.1038/s41593-020-0614-x, URL https://www.nature.com/articles/s41593-020-0614-x

      Tanaka KZ, He H, Tomar A, Niisato K, Huang AJY, McHugh TJ (2018) The hippocampal engram maps experience but not place. Science 361(6400):392–397, DOI 10.1126/science.aat5397, URL https://www.science.org/doi/10.1126/science.aat5397

      Tanni S, De Cothi W, Barry C (2022) State transitions in the statistically stable place cell population correspond to rate of perceptual change. Current Biology 32(16):3505–3514.e7, DOI 10.1016/j.cub. 2022.06.046, URL https://linkinghub.elsevier.com/retrieve/pii/S0960982222010089

      Tong F, Pratte MS (2012) Decoding Patterns of Human Brain Activity. Annual Review of Psychology 63(1):483–509, DOI 10.1146/annurev-psych-120710-100412, URL https://www.annualreviews.org/doi/10.1146/annurev-psych-120710-100412

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer 1, point 1: In general, the statistical analysis is not transparent. The size of the sample, i.e. the number of observations or data points, is never specified. This information is essential for further evaluation of the statistical details.

      The size of each sample quantified, given as number of ommatidia/number of retinas, is indicated in the figure legends. This must have escaped the attention of reviewer 1, so we have added a sentence in the legend of Fig. 2 to state it more clearly. We think that the figure legends are the best place to put this information for ease of comparison to the figures.

      *Reviewer 1, point 2: To gain a better understanding of chitin deposition, it would be beneficial to have data on Kkv overexpression in cone cells versus outer pigment cells. Does it cause reb/exp-like effects on chitin deposition and corneal lens formation? Furthermore, can the authors rule out the involvement of chitin synthase 2 in chitin matrix formation and the retention of the matrix in kkv knockdowns? *

      We will generate clones of cells that over-express Kkv in either central cells (cone and primary pigment cells) or lattice cells (secondary and tertiary pigment cells), using the same drivers that we used to over-express Reb, and will examine chitin secretion at 54 h after puparium formation (APF) and in adults.

      As there are no available mutations in Chitin synthase 2 (Chs2), we will knock it down with RNAi in all retinal cells using lGMR-GAL4 and look for corneal lens defects. However, we think that Chs2 is unlikely to contribute chitin to the corneal lens, because its expression is restricted to the digestive system, and because kkv knockdown essentially eliminates chitin from the corneal lens.

      *Reviewer 1, point 3: Recent results published by the authors regarding ZP domain proteins, such as dusky-like (dyl), have not been adequately discussed in the context of chitin secretion and Kkv expression, a matter that must be addressed. It has been demonstrated that dyl mutants do not affect Kkv expression, but chitin levels are reduced. Does Dyl exhibit Kkv-like phenotypes? Furthermore, what is the expression of Dyl or Dmupy in Kkv knockdowns? Is there any interaction between the ZP domain protein matrix and the chitin matrix required for lens formation? *

      In dyl mutants, chitin deposition is delayed, but it does accumulate later in development, so the phenotype is different from kkv mutants. We have clarified this in the manuscript (p. 6). To address the other points, we will examine the expression of Dyl and of Dumpy-YFP in mid-pupal and late pupal retinas in which kkv is knocked down in all cells with lGMR-GAL4. The ZP protein matrix is originally deposited before chitin secretion begins, so we will examine whether loss of chitin affects its later maintenance.

      *Reviewer 1, point 4: What is retained in the chitin matrix if chitin is missing in kkv knockdown? Is it the ZP domain matrix (see the above question) or are the chitin matrix proteins also involved, such as Obst-A, Obst-C (Gasp), Knk and others? Obst proteins are particularly essential for the regular packaging of chitin and thus for the formation of the chitin layer, which is shown in Fig. 1. Beyond this story, it would also be interesting to see how the aforementioned chitin matrix proteins (Obst-A, Obst-C (Gasp), Knk and others) impact lens formation. *

      Adult corneal lenses derived from kkv knockdown retinas do not contain chitin, but there is remaining corneal lens material. We do not think that this is the ZP domain matrix, as this is normally lost in late pupal development, but we will check whether Dpy-YFP is retained in kkv knockdown adults. We will try to detect Obst-A and Gasp proteins using available antibodies. However, this may not be successful, as we have found that antibodies do not penetrate the corneal lens well. Our transcriptomic studies have identified numerous secreted proteins that are expressed at high levels in the mid-pupal retina and could be components of the corneal lens. We may be able to detect some of these using fluorescently tagged forms, but it is possible that the currently available tools will not be sufficient to answer this question.

      We have begun to work on how some of these proteins affect corneal lens structure, but this will take a significant amount of time and we think it would work better as a separate manuscript. We see our current manuscript as a short and focused story about the importance of the source of chitin in determining corneal lens shape.

      *Reviewer 1, minor comment 1: Figure 1 is not easily comprehensible for those who are not already familiar with the subject of eye development. Fig -1A' please label the cone cells and pigment cells. *

      We have labeled these cells in Fig. 1A’’.

      *Reviewer 1, minor comment 2: Fig. 1H - The meaning of the abbreviations and numbers is not given in the legend. It would also be beneficial to include a meaningful cartoon illustrating the corneal lens situation before and after chitin secretion, as shown in Figure 3. *

      We have defined the abbreviations in the figure legend. Fig. 1H did show the corneal lens situation before, during and after chitin secretion, but we have added the cone and pigment cells to the 72 h APF and adult diagrams to make them more meaningful (now Fig. 1I).

      *Reviewer 1, minor comment 3: Fig.1 F when does the authors recognize a first chitin assembly as initial corneal lens at the eye and how does it look like? Chitin expression is high already at 54h APF, which means 20 hours earlier. *

      We think that the reviewer is asking when the chitin first starts to form a dome shape. We have added an orthogonal view of chitin in a 54 h APF retina viewed with LIGHTNING microscopy, showing that the external curvature is already present at this stage (new Fig. 1F).

      *Reviewer 1, minor comment 4: Page 6 / Fig 2E: cells autonomously synthesize chitin and no lateral diffusion. Please label which lens contains chitin and which not *

      Fig. 2E shows part of a retina in which kkv has been knocked down in all cells, so none of the corneal lenses contain chitin. We have clarified this in the legend to Fig. 2.

      *Reviewer 1, minor comment 5: Page 7: The authors state that reb/exp knockdown affects external and internal curvature. However, Fig. S1 statistics does not support this statement. *

      We were referring to the double knockdown, which Fig. 2L, M show is significant, and not to the single knockdowns quantified in Fig. S1. We have clarified this in the text.

      *Reviewer 1, minor comment 6: Fig.2 and Fig. S1: what is Chp (Chaoptin)? *

      We have stated in the legend to Fig. 2 that Chaoptin is a component of photoreceptor rhabdomeres.

      *Reviewer 1, minor comment 7: Fig. S1E,I: which part of the eye is marked by the chitin staining outside the cone and pigment cells? *

      Chitin is still present in the mechanosensory bristles in Fig.S1I, as these do not express lGMR-GAL4. We have stated this in the figure legend.

      *Reviewer 1, minor comment 8: Fig. 2 L,M, Why do exp/reb show different statistical results at outer angle in exp and reb knockdown when compared with the IGMR driver line, although chitin reduction is eliminated in exp knockdown already from 54h APF onwards? *

      The double knockdown of exp and reb has a more significant effect on the adult corneal lens outer angle than the single exp knockdown, even though the exp knockdown lacks chitin at 54 h APF. We believe that this is because Reb is sufficient for some chitin synthesis at later stages of development. This was mentioned in the text (p. 6) and we have added further clarification in the legend to Fig. S1.

      *Reviewer 1, minor comment 9: Fig 3 G-H: please clarify where the chitin reduction can be observed at the edge of adult corneal lens and provide comparable wt staining's. Fig. S2 D. What was the normalization and the sample number? *

      We have added a high magnification image of a mosaic ommatidium with one wild-type and one kkv knockdown edge, showing the region at the edge of the corneal lens in which chitin fluorescence was quantified and the central region used for the normalization (Fig. 3I). The sample numbers are given in the legend to Fig. S2D.

      Reviewer 1, minor comment 10: Page 6, last paragraph: I fully agree that ZP domain proteins may retain other corneal lens components. But deeper discussion is missing. It should be noted that the authors hypothesis fits well to the proposed function of the ZP matrix in providing chitin matrix adhesion to the underlying cell surface. A loss of the ZP domain protein Piopio causes loss of the chitin matrix as show recently in trachea and at epidermal tendon cells (Göpfert et al., 2025; https://www.sciencedirect.com/science/article/pii/S1742706125003733). Furthermore, a recent publication identifies ZPD proteins as modular units that establish the mechanical environment essential for nanoscale morphogenesis (Itakura et al., https://www.biorxiv.org/content/10.1101/2024.08.20.608778v1.full.pdf*). This should be cited and discussed accordingly.

      It could be that outer and inner part of the chitin is different in ultrastructure due to expression pattern. In dragonfly the surface morphology analysis by scanning electron microscopy revealed that the outer part of corneal lenses consisted of long chitin fibrils with regular arrays of papillary structures while the smoother inner part had concentric lamellated chitin formation with shorter chitin nanofibrils (Kaya et al., 2016; https://www.sciencedirect.com/science/article/pii/S0141813016303646?via%3Dihub#fig0020) . Thus, a ultrastructure analyses would be very beneficial, or at least a detailed discussion. *

      We have added a discussion of these points and papers to the text (p. 6 and 9). Although we are not specifically addressing differences between the inner and outer parts of the corneal lens in this manuscript, we have now included a high-resolution LIGHTNING image showing how the layered structure of the corneal lens is affected when chitin production by central cells is increased (Fig. 4F).

      *Reviewer 2, point 1: Adult corneal lenses lacking chitin still form a thin structure in kkv RNAi. The authors suggest that this may be due to the presence of the ZP domain proteins Dyl, Dpy and Pio. Immunostaining for these ZP domain proteins could provide supporting evidence. *

      To clarify, we meant to say that the earlier presence of the ZP domain matrix could retain components other than chitin in the corneal lens. The ZP domain proteins are no longer present in the adult. We have made this clearer in the text. As described under reviewer 1, points 3 and 4, we will examine Dyl and Dpy-YFP expression in kkv knockdown retinas at mid-pupal and adult stages, and we will also look at the expression of another ZP domain protein, Piopio.

      *Reviewer 2, minor comment 1: At 50 h APF, Kkv (Fig. 2B, B') and Reb (Fig. S1A, A') appear to be expressed at higher levels in lattice cells than in central cells, even though chitin is mainly present in the central cells at this time (Fig. 1B-B'). Discuss possible explanation for their expression pattern and their roles at this stage. *

      We agree that this is a surprising result. We have added a discussion of possible explanations, such as the lack of another component necessary for chitin secretion in lattice cells at this stage, or the presence of high levels of chitinases (p. 7).

      *Reviewer 2, minor comment 2: Fig. 1F and G: Indicate that the cryosection images represent single ommatidia, and label "external" and "internal" to help orient readers. *

      We have made these changes to the figure panels (now G and H), and indicated in the legend that they are single ommatidia.

      *Reviewer 2, minor comment 3: Figure 2. The cartoon diagram showing the angle measurement (currently Fig S1K) should be moved to the main figure to help readers understand the quantifications. *

      We have moved this diagram to Figure 2L.

      *Reviewer 2, minor comment 4: Figure 3H. It would be helpful to clearly mark the edge of the corneal lens in the chitin intensity image. *

      As described under reviewer 1, minor comment 9, we have added a high magnification picture showing the edge region used for chitin quantification (Fig. 3I), which should also address reviewer 2’s concern.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Wang et al. studied an old, still unresolved problem: Why are reaching movements often biased? Using data from a set of new experiments and from earlier studies, they identified how the bias in reach direction varies with movement direction, and how this depends on factors such as the hand used, the presence of visual feedback, the size and location of the workspace, the visibility of the start position and implicit sensorimotor adaptation. They then examined whether a visual bias, a proprioceptive bias, a bias in the transformation from visual to proprioceptive coordinates and/or biomechanical factors could explain the observed patterns of biases. The authors conclude that biases are best explained by a combination of transformation and visual biases.

      A strength of this study is that it used a wide range of experimental conditions with also a high resolution of movement directions and large numbers of participants, which produced a much more complete picture of the factors determining movement biases than previous studies did. The study used an original, powerful, and elegant method to distinguish between the various possible origins of motor bias, based on the number of peaks in the motor bias plotted as a function of movement direction. The biomechanical explanation of motor biases could not be tested in this way, but this explanation was excluded in a different way using data on implicit sensorimotor adaptation. This was also an elegant method as it allowed the authors to test biomechanical explanations without the need to commit to a certain biomechanical cost function.

      We thank the reviewer for their enthusiastic comments.

      (1) The main weakness of the study is that it rests on the assumption that the number of peaks in the bias function is indicative of the origin of the bias. Specifically, it is assumed that a proprioceptive bias leads to a single peak, a transformation bias to two peaks, and a visual bias to four peaks, but these assumptions are not well substantiated. Especially the assumption that a transformation bias leads to two peaks is questionable. It is motivated by the fact that biases found when participants matched the position of their unseen hand with a visual target are consistent with this pattern. However, it is unclear why that task would measure only the effect of transformation biases, and not also the effects of visual and proprioceptive biases in the sensed target and hand locations. Moreover, it is not explained why a transformation bias would lead to this specific bias pattern in the first place.

      We would like to clarify two things.

      Frist, the measurements of the transformation bias are not entirely independent of proprioceptive and visual biases. Specifically, we define transformation bias as the misalignment between the internal representation of a visual target and the corresponding hand position. By this definition, the transformation error entails both visual and proprioceptive biases (see Author response image 1). Transformation biases have been empirically quantified in numerous studies using matching tasks, where participants either aligned their unseen hand to a visual target (Wang et al., 2021) or aligned a visual target to their unseen hand (Wilson et al., 2010). Indeed, those tasks are always considered as measuring proprioceptive biases assuming visual bias is small given the minimal visual uncertainty.

      Author response image 1.

      Second, the critical difference between models is in how these biases influence motor planning rather than how those biases are measured. In the Proprioceptive bias model, a movement is planned in visual space. The system perceives the starting hand position in proprioceptive space and transforms this into visual space (Vindras & Viviani, 1998; Vindras et al., 2005). As such, bias only affects the perceived starting position; there is no influence on the perceived target location (no visual bias).

      In contrast, the Transformation bias model proposes that while both the starting and target positions are perceived in visual space, movement is planned in proprioceptive space. Consequently, both positions must be transformed from visual space to proprioceptive coordinates before movement planning (i.e., where is my sensed hand and where do I want it to be). Under this framework, biases can emerge from both the start and target positions. This is how the transformation model leads to different predictions compared to the perceptual models, even if the bias is based on the same measurements.

      We now highlight the differences between the Transformation bias model and the Proprioceptive bias model explicitly in the Results section (Lines 192-200):

      “Note that the Proprioceptive Bias model and the Transformation Bias model tap into the same visuo-proprioceptive error map. The key difference between the two models arises in how this error influences motor planning. For the Proprioceptive Bias model, planning is assumed to occur in visual space. As such, the perceived position of the hand (based on proprioception) is transformed into the visual space. This will introduce a bias in the representation of the start position. In contrast, the Transformation Bias model assumes that the visually-based representations of the start and target positions need to be transformed into proprioceptive space for motor planning. As such, both positions are biased in the transformation process. In addition to differing in terms of their representation of the target, the error introduced at the start position is in opposite directions due to the direction of the transformation (see fig 1g-h).”

      In terms of the motor bias function across the workspace, the peaks are quantitatively derived from the model simulations. The number of peaks depends on how we formalize each model. Importantly, this is a stable feature of each model, regardless of how the model is parameterized. Thus, the number of peaks provides a useful criterion to evaluate different models.

      Figure 1 g-h illustrates the intuition of how the models generate distinct peak patterns. We edited the figure caption and reference this figure when we introduce the bias function for each model.

      (2) Also, the assumption that a visual bias leads to four peaks is not well substantiated as one of the papers on which the assumption was based (Yousif et al., 2023) found a similar pattern in a purely proprioceptive task.

      What we referred to in the original submission as “visual bias” is not an eye-centric bias, nor is it restricted to the visual system. Rather, it may reflect a domain-general distortion in the representation of position within polar space. We called it a visual bias as it was associated with the perceived location of the visual target in the current task. To avoid confusion, we have opted to move to a more general term and now refer to this as “target bias.”

      We clarify the nature of this bias when introducing the model in the Results section (Lines 164-169):

      “Since the task permits free viewing without enforced fixation, we assume that participants shift their gaze to the visual target; as such, an eye-centric bias is unlikely. Nonetheless, prior studies have shown a general spatial distortion that biases perceived target locations toward the diagonal axes(Huttenlocher et al., 2004; Kosovicheva & Whitney, 2017). Interestingly, this bias appears to be domain-general, emerging not only for visual targets but also for proprioceptive ones(Yousif et al., 2023). We incorporated this diagonal-axis spatial distortion into a Target Bias model. This model predicts a four-peaked motor bias pattern (Fig 1f).”

      We also added a paragraph in the Discussion to further elaborate on this model (Lines 502-511):

      “What might be the source of the visual bias in the perceived location of the target? In the perception literature, a prominent theory has focused on the role of visual working memory account based on the observation that in delayed response tasks, participants exhibit a bias towards the diagonals when recalling the location of visual stimuli(Huttenlocher et al., 2004; Sheehan & Serences, 2023). Underscoring that the effect is not motoric, this bias is manifest regardless of whether the response is made by an eye movement, pointing movement, or keypress(Kosovicheva & Whitney, 2017). However, this bias is unlikely to be dependent on a visual input as similar diagonal bias is observed when the target is specified proprioceptively via the passive displacement of an unseen hand(Yousif et al., 2023). Moreover, as shown in the present study, a diagonal bias is observed even when the target is continuously visible. Thus, we hypothesize that the bias to perceive the target towards the diagonals reflects a more general distortion in spatial representation rather than being a product of visual working memory.”

      (3) Another weakness is that the study looked at biases in movement direction only, not at biases in movement extent. The models also predict biases in movement extent, so it is a missed opportunity to take these into account to distinguish between the models.

      We thank the reviewer for this suggestion. We have now conducted a new experiment to assess angular and extent biases simultaneously (Figure 4a; Exp. 4; N = 30). Using our KINARM system, participants were instructed to make center-out movements that would terminate (rather than shoot past) at the visual target. No visual feedback was provided throughout the experiment.

      The Transformation Bias model predicts a two-peaked error function in both the angular and extent dimensions (Figure 4c). Strikingly, when we fit the data from the new experiment to both dimensions simultaneously, this model captures the results qualitatively and quantitatively (Figure 4e). In terms of model comparison, it outperformed alternative models (Figure 4g) particularly when augmented with a visual bias component. Together, these results provide strong evidence that a mismatch between visual and proprioceptive space is a key source of motor bias.

      This experiment is now reported within the revised manuscript (Lines 280-301).

      Overall, the authors have done a good job mapping out reaching biases in a wide range of conditions, revealing new patterns in one of the most basic tasks, but unambiguously determining the origin of these biases remains difficult, and the evidence for the proposed origins is incomplete. Nevertheless, the study will likely have a substantial impact on the field, as the approach taken is easily applicable to other experimental conditions. As such, the study can spark future research on the origin of reaching biases.

      We thank the reviewer for these summary comments. We believe that the new experiments and analyses do a better job of identifying the origins of motor biases.

      Reviewer #2 (Public Review):

      Summary:

      This work examines an important question in the planning and control of reaching movements - where do biases in our reaching movements arise and what might this tell us about the planning process? They compare several different computational models to explain the results from a range of experiments including those within the literature. Overall, they highlight that motor biases are primarily caused by errors in the transformation between eye and hand reference frames. One strength of the paper is the large number of participants studied across many experiments. However, one weakness is that most of the experiments follow a very similar planar reaching design - with slicing movements through targets rather than stopping within a target. Moreover, there are concerns with the models and the model fitting. This work provides valuable insight into the biases that govern reaching movements, but the current support is incomplete.

      Strengths:

      The work uses a large number of participants both with studies in the laboratory which can be controlled well and a huge number of participants via online studies. In addition, they use a large number of reaching directions allowing careful comparison across models. Together these allow a clear comparison between models which is much stronger than would usually be performed.

      We thank the reviewer for their encouraging comments.

      Weaknesses:

      Although the topic of the paper is very interesting and potentially important, there are several key issues that currently limit the support for the conclusions. In particular I highlight:

      (1) Almost all studies within the paper use the same basic design: slicing movements through a target with the hand moving on a flat planar surface. First, this means that the authors cannot compare the second component of a bias - the error in the direction of a reach which is often much larger than the error in reaching direction.

      Reviewer 1 made a similar point, noting that we had missed an opportunity to provide a more thorough assessment of reaching biases. As described above, we conducted a new experiment in which participants made pointing movements, instructed to terminate the movements at the target. These data allow us to analyze errors in both angular and extent dimensions. The transformation bias model successfully predicts angular and extent biases, outperformed the other models at both group and individual levels. We have now included this result as Exp 4 in the manuscript. Please see response to Reviewer 1 Comment 3 for details.

      Second, there are several studies that have examined biases in three-dimensional reaching movements showing important differences to two-dimensional reaching movements (e.g. Soechting and Flanders 1989). It is unclear how well the authors' computational models could explain the biases that are present in these much more common-reaching movements.

      This is an interesting issue to consider. We expect the mechanisms identified in our 2D work will generalize to 3D.

      Soechting and Flanders (1989) quantified 3D biases by measuring errors across multiple 2D planes at varying heights (see Author response image 2 for an example from their paper). When projecting their 3-D bias data to a horizontal 2D space, the direction of the bias across the 2D plane looks relatively consistent across different heights even though the absolute value of the bias varies (Author response image 2). For example, the matched hand position is generally to the leftwards and downward of the target. Therefore, the models we have developed and tested in a specific 2D plane are likely to generalize to other 2D plane of different heights.

      Author response image 2.

      However, we think the biases reported by Soechting and Flanders likely reflect transformation biases rather than motor biases. First, the movements in their study were performed very slowly (3–5 seconds), more similar to our proprioceptive matching tasks and much slower than natural reaching movements (<500ms). Given the slow speed, we suspect that motor planning in Soechting and Flanders was likely done in a stepwise, incremental manner (closed loop to some degree). Second, the bias pattern reported in Soechting and Flanders —when projected into 2D space— closely mirrors the leftward transformation errors observed in previous visuo-proprioceptive matching task (e.g., Wang et al., 2021).

      In terms of the current manuscript, we think that our new experiment (Exp 4, where we measure angular and radial error) provides strong evidence that the transformation bias model generalizes to more naturalistic pointing movements. As such, we expect these principles will generalize were we to examine movements in three dimensions, an extension we plan to test in future work.

      (2) The model fitting section is under-explained and under-detailed currently. This makes it difficult to accurately assess the current model fitting and its strength to support the conclusions. If my understanding of the methods is correct, then I have several concerns. For example, the manuscript states that the transformation bias model is based on studies mapping out the errors that might arise across the whole workspace in 2D. In contrast, the visual bias model appears to be based on a study that presented targets within a circle (but not tested across the whole workspace). If the visual bias had been measured across the workspace (similar to the transformation bias model), would the model and therefore the conclusions be different?

      We have substantially expanded the Methods section to clarify the modeling procedures (detailed below in section “Recommendations for the Authors”). We also provide annotated code to enable others to easily simulate the models.

      Here we address three points relevant to the reviewer’s concern about whether the models were tested on equal footing, and in particular, concern that the transformation bias model was more informed by prior literature than the visual bias model.

      First, our center-out reaching task used target locations that have been employed in both visual and proprioceptive bias studies, offering reasonable comprehensive coverage of the workspace. For example, for a target to the left of the body’s midline, visual biases tend to be directed diagonally (Kosovicheva & Whitney, 2017), while transformation biases are typically leftward and downward (Wang et al, 2021). In this sense, the models were similarly constrained by prior findings.

      Second, while the qualitative shape of each model was guided by prior empirical findings, no previous data were directly used to quantitatively constrain the models. As such, we believe the models were evaluated on equal footing. No model had more information or, best we can tell, an inherent advantage over the others.

      Third, reassuringly, the fitted transformation bias closely matches empirically observed bias maps reported in prior studies (Fig 2h). The strong correspondence provides convergent validity and supports the putative causality between transformation biases to motor biases.

      (3) There should be other visual bias models theoretically possible that might fit the experimental data better than this one possible model. Such possibilities also exist for the other models.

      Our initial hypothesis, grounded in prior literature, was that motor biases arise from a combination of proprioceptive and visual biases. This led us to thoroughly explore a range of visual models. We now describe these alternatives below, noting that in the paper, we chose to focus on models that seemed the most viable candidates. (Please also see our response to Reviewer 3, Point 2, on another possible source of visual bias, the oblique effect.)

      Quite a few models have described visual biases in perceiving motion direction or object orientation (e.g., Wei & Stocker, 2015; Patten, Mannion & Clifford, 2017). Orientation perception would be biased towards the Cartesian axis, generating a four-peak function. However, these models failed to account for the motor biases observed in our experiments. This is not surprising given that these models were not designed to capture biases related to a static location.

      We also considered a class of eye-centric models where biases for peripheral locations are measured under fixation. A prominent finding here is that the bias is along the radial axis in which participants overshoot targets when they fixate on the start position during the movement (Beurze et al., 2006; Van Pelt & Medendorp, 2008). Again, this is not consistent with the observed motor biases. For example, participants undershoot rightward targets when we measured the distance bias in Exp 4. Importantly, since most our tasks involved free viewing in natural settings with no fixation requirements, we considered it unlikely that biases arising from peripheral viewing play a major role.

      We note, though, that in our new experiment (Exp 4), participants observed the visual stimuli from a fixed angle in the KinArm setup (see Figure 4a). This setup has been shown to induce depth-related visual biases (Figure 4b, e.g., Volcic et al., 2013; Hibbard & Bradshaw, 2003). For this reason, we implemented a model incorporating this depth bias as part of our analyses of these data. While this model performed significantly worse than the transformation bias model alone, a mixed model that combined the depth bias and transformation bias provided the best overall fit. We now include this result in the main text (Lines 286-294).

      We also note that the “visual bias” we referred to in the original submission is not restricted to the visual system. A similar bias pattern has been observed when the target is presented visually or proprioceptively (Kosovicheva & Whitney, 2017; Yousif, Forrence, & McDougle, 2023). As such, it may reflect a domaingeneral distortion in the representation of position within polar space. Accordingly, in the revision, we now refer to this in a more general way, using the term “target bias.” We justify this nomenclature when introducing the model in the Results section (Lines 164-169). Please also see Reviewer 1 comment 2.

      We recognize that future work may uncover a better visual model or provide a more fine-grained account of visual biases (or biases from other sources). With our open-source simulation code, such biases can be readily incorporated—either to test them against existing models or to combine them with our current framework to assess their contribution to motor biases. Given our explorations, we expect our core finding will hold: Namely, that a combination of transformation and target biases offers the most parsimonious account, with the bias associated with the transformation process explaining the majority of the observed motor bias in visually guided movements.

      Given the comments from the reviewer, we expanded the discussion session to address the issue of alternative models of visual bias (lines 522-529):

      “Other forms of visual bias may influence movement. Depth perception biases could contribute to biases in movement extent(Beurze et al., 2006; Van Pelt & Medendorp, 2008). Visual biases towards the principal axes have been reported when participants are asked to report the direction of moving targets or the orientation of an object(Patten et al., 2017; Wei & Stocker, 2015). However, the predicted patterns of reach biases do not match the observed biases in the current experiments. We also considered a class of eye-centric models in which participants overestimate the radial distance to a target while maintaining central fixation(Beurze et al., 2006; Van Pelt & Medendorp, 2008). At odds with this hypothesis, participants undershot rightward targets when we measured the radial bias in Exp 4. The absence of these other distortions of visual space may be accounted for by the fact that we allowed free viewing during the task.”

      (4) Although the authors do mention that the evidence against biomechanical contributions to the bias is fairly weak in the current manuscript, this needs to be further supported. Importantly both proprioceptive models of the bias are purely kinematic and appear to ignore the dynamics completely. One imagines that there is a perceived vector error in Cartesian space whereas the other imagines an error in joint coordinates. These simply result in identical movements which are offset either with a vector or an angle. However, we know that the motor plan is converted into muscle activation patterns which are sent to the muscles, that is, the motor plan is converted into an approximation of joint torques. Joint torques sent to the muscles from a different starting location would not produce an offset in the trajectory as detailed in Figure S1, instead, the movements would curve in complex patterns away from the original plan due to the non-linearity of the musculoskeletal system. In theory, this could also bias some of the other predictions as well. The authors should consider how the biomechanical plant would influence the measured biases.

      We thank the reviewer for encouraging us on this topic and to formalize a biomechanical model. In response, we have implemented a state-of-the-art biomechanical framework, MotorNet

      (https://elifesciences.org/articles/88591), which simulates a six-muscle, two-skeleton planar arm model using recurrent neural networks (RNNs) to generate control policies (See Figure 6a). This model captures key predictions about movement curvature arising from biomechanical constraints. We view it as a strong candidate for illustrating how motor bias patterns could be shaped by the mechanical properties of the upper limb.

      Interestingly, the biomechanical model did not qualitatively or quantitatively reproduce the pattern of motor biases observed in our data. Specifically, we trained 50 independent agents (RNNs) to perform random point-to-point reaching movements across the workspace used in our task. We used a loss function that minimized the distance between the fingertip and the target over the entire trajectory. When tested on a center-out reaching task, the model produced a four-peaked motor bias pattern (Figure 6b), in contrast to the two-peaked function observed empirically. These results suggest that upper limb biomechanical constraints are unlikely to be a primary driver of motor biases in reaching. This holds true even though the reported bias is read out at 60% of the reaching distance, where biomechanical influences on the curvature of movement are maximal. We have added this analysis to the results (lines 367-373).

      It may seem counterintuitive that biomechanics plays a limited role in motor planning. This could be due to several factors. First, First, task demands (such as the need to grasp objects) may lead the biomechanical system to be inherently organized to minimize endpoint errors (Hu et al., 2012; Trumbower et al., 2009). Second, through development and experience, the nervous system may have adapted to these biomechanical influences—detecting and compensating for them over time (Chiel et al., 2009).

      That said, biomechanical constraints may make a larger contribution in other contexts; for example, when movements involve more extreme angles or span larger distances, or in individuals with certain musculoskeletal impairments (e.g., osteoarthritis) where physical limitations are more likely to come into play. We address this issue in the revised discussion.

      “Nonetheless, the current study does not rule out the possibility that biomechanical factors may influence motor biases in other contexts. Biomechanical constraints may have had limited influence in our experiments due to the relatively modest movement amplitudes used and minimal interaction torques involved. Moreover, while we have focused on biases that manifest at the movement endpoint, biomechanical constraints might introduce biases that are manifest in the movement trajectories.(Alexander, 1997; Nishii & Taniai, 2009) Future studies are needed to examine the influence of context on reaching biases.”

      Reviewer #3 (Public review):

      The authors make use of a large dataset of reaches from several studies run in their lab to try to identify the source of direction-dependent radial reaching errors. While this has been investigated by numerous labs in the past, this is the first study where the sample is large enough to reliably characterize isometries associated with these radial reaches to identify possible sources of errors.

      (1) The sample size is impressive, but the authors should Include confidence intervals and ideally, the distribution of responses across individuals along with average performance across targets. It is unclear whether the observed “averaged function” is consistently found across individuals, or if it is mainly driven by a subset of participants exhibiting large deviations for diagonal movements. Providing individual-level data or response distributions would be valuable for assessing the ubiquity of the observed bias patterns and ruling out the possibility that different subgroups are driving the peaks and troughs. It is possible that the Transformation or some other model (see below) could explain the bias function for a substantial portion of participants, while other participants may have different patterns of biases that can be attributable to alternative sources of error.

      We thank the reviewer for encouraging a closer examination of the individual-level data. We did include standard error when we reported the motor bias function. Given that the error distribution is relatively Gaussian, we opted to not show confidence intervals since they would not provide additional information.

      To examine individual differences, we now report a best-fit model frequency analysis. For Exp 1, we fit each model at the individual level and counted the number of participants that are best predicted by each model. Among the four single source models (Figure 3a), the vast majority of participants are best explained by the transformation bias model (48/56). When incorporating mixture models, the combined transformation + target bias model emerged as the best fit for almost all participants across experiments (50/56). The same pattern holds for Exp 3b, the frequency analysis is more distributed, likely due to the added noise that comes with online studies.

      We report this new analysis in the Results. (see Fig 3. Fig S2). Note that we opted to show some representative individual fits, selecting individuals whose data were best predicted by different models (Fig S2). Given that the number of peaks characterizes each model (independent of the specific parameter values), the two-peaked function exhibited for most participants indicates that the Transformation bias model holds at the individual level and not just at the group level.

      (2) The different datasets across different experimental settings/target sets consistently show that people show fewer deviations when making cardinal-directed movements compared to movements made along the diagonal when the start position is visible. This reminds me of a phenomenon referred to as the oblique effect: people show greater accuracy for vertical and horizontal stimuli compared to diagonal ones. While the oblique effect has been shown in visual and haptic perceptual tasks (both in the horizontal and vertical planes), there is some evidence that it applies to movement direction. These systematic reach deviations in the current study thus may reflect this epiphenomenon that applies across modalities. That is, estimating the direction of a visual target from a visual start position may be less accurate, and may be more biased toward the horizontal axis, than for targets that are strictly above, below, left, or right of the visual start position. Other movement biases may stem from poorer estimation of diagonal directions and thus reflect more of a perceptual error than a motor one. This would explain why the bias function appears in both the in-lab and on-line studies although the visual targets are very different locations (different planes, different distances) since the oblique effects arise independent of plane, distance, or size of the stimuli. When the start position is not visible like in the Vindras study, it is possible that this oblique effect is less pronounced; masked by other sources of error that dominate when looking at 2D reach endpoint made from two separate start positions, rather than only directional errors from a single start position. Or perhaps the participants in the Vindras study are too variable and too few (only 10) to detect this rather small direction-dependent bias.

      The potential link between the oblique effect and the observed motor bias is an intriguing idea, one that we had not considered. However, after giving this some thought, we see several arguments against the idea that the oblique effect accounts for the pattern of motor biases.

      First, by the oblique effect, perceptual variability is greater along the diagonal axes compared to the cardinal axes. These differences in perceptual variability have been used to explain biases in visual perception through a Bayesian model under the assumption that the visual system has an expectation that stimuli are more likely to be oriented along the cardinal axes (Wei & Stocker, 2015). Importantly, the model predicts low biases at targets with peak perceptual variability. As such, even though those studies observed that participants showed large variability for stimuli at diagonal orientations, the bias for these stimuli was close to zero. Given we observed a large bias for targets at locations along the diagonal axes, we do not think this visual effect can explain the motor bias function.

      Second, the reviewer suggested that the observed motor bias might be largely explained by visual biases (or what we now refer to as target biases). If this hypothesis is correct, we would anticipate observing a similar bias pattern in tasks that use a similar layout for visual stimuli but do not involve movement. However, this prediction is not supported. For example, Kosovicheva & Whitney (2017) used a position reproduction/judgment task with keypress responses (no reaching). The stimuli were presented in a similar workspace as in our task. Their results showed four-peaked bias function while our results showed a two-peaked function.

      In summary, we don’t think oblique biases make a significant contribution to our results.

      A bias in estimating visual direction or visual movement vector Is a more realistic and relevant source of error than the proposed visual bias model. The Visual Bias model is based on data from a study by Huttenlocher et al where participants “point” to indicate the remembered location of a small target presented on a large circle. The resulting patterns of errors could therefore be due to localizing a remembered visual target, or due to relative or allocentric cues from the clear contour of the display within which the target was presented, or even movements used to indicate the target. This may explain the observed 4-peak bias function or zig-zag pattern of “averaged” errors, although this pattern may not even exist at the individual level, especially given the small sample size. The visual bias source argument does not seem well-supported, as the data used to derive this pattern likely reflects a combination of other sources of errors or factors that may not be applicable to the current study, where the target is continuously visible and relatively large. Also, any visual bias should be explained by a coordinates centre on the eye and should vary as a function of the location of visual targets relative to the eyes. Where the visual targets are located relative to the eyes (or at least the head) is not reported.

      Thank you for this question. A few key points to note:

      The visual bias model has also been discussed in studies using a similar setup to our study. Kosovicheva & Whitney (2017) observed a four-peaked function in experiments in which participants report a remembered target position on a circle by either making saccades or using key presses to adjust the position of a dot. However, we agree that this bias may be attenuated in our experiment given that the target is continuously visible. Indeed, the model fitting results suggest the peak of this bias is smaller in our task (~3°) compared to previous work (~10°, Kosovicheva & Whitney, 2017; Yousif, Forrence, & McDougle, 2023).

      We also agree with the reviewer that this “visual bias” is not an eye-centric bias, nor is it restricted to the visual system. A similar bias pattern is observed even if the target is presented proprioceptively (Yousif, Forrence, & McDougle, 2023). As such, this bias may reflect a domain-general distortion in the representation of position within polar space. Accordingly, in the revision, we now refer to this in a more general way, using the term “target bias”, rather than visual bias. We justify this nomenclature when introducing the model in the Results section (Lines 164-169). Please also see Reviewer 1 comment 2 for details.

      Motivated by Reviewer 2, we also examined multiple alternative visual bias models (please refer to our response to Reviewer 2, Point 3.

      The Proprioceptive Bias Model is supposed to reflect errors in the perceived start position. However, in the current study, there is only a single, visible start position, which is not the best design for trying to study the contribution. In fact, my paradigms also use a single, visual start position to minimize the contribution of proprioceptive biases, or at least remove one source of systematic biases. The Vindras study aimed to quantify the effect of start position by using two sets of radial targets from two different, unseen start positions on either side of the body midline. When fitting the 2D reach errors at both the group and individual levels (which showed substantial variability across individuals), the start position predicted most of the 2D errors at the individual level – and substantially more than the target direction. While the authors re-plotted the data to only illustrate angular deviations, they only showed averaged data without confidence intervals across participants. Given the huge variability across their 10 individuals and between the two target sets, it would be more appropriate to plot the performance separately for two target sets and show confidential intervals (or individual data). Likewise, even the VT model predictions should differ across the two targets set since the visual-proprioceptive matching errors from the Wang et al study that the model is based on, are larger for targets on the left side of the body.

      To be clear, in the Transformation bias model, the vector bias at the start position is also an important source of error. The critical difference between the proprioceptive and transformation models is how bias influences motor planning. In the Proprioceptive bias model, movement is planned in visual space. The system perceives the starting hand position in proprioceptive space and transforms this into visual space (Vindras & Viviani, 1998; Vindras et al., 2005). As such, the bias is only relevant in terms of the perceived start position; it does not influence the perceived target location. In contrast, the transformation bias model proposes that while both the starting and target positions are perceived in visual space, movements are planned in proprioceptive space. Consequently, when the start and target positions are visible, both positions must be transformed from visual space to proprioceptive coordinates before movement planning. Thus, bias will influence both the start and target positions. We also note that to set the transformation bias for the start/target position, we referred to studies in which bias is usually referred to as proprioception error measurement. As such, changing the start position has a similar impact on the Transformation and the Proprioceptive Bias models in principle, and would not provide a stronger test to separate them.

      We now highlight the differences between the models in the Results section, making clear that the bias at the start position influences both the Proprioceptive bias and Transformation bias models (Lines 192200).

      “Note that the Proprioceptive Bias model and the Transformation Bias model tap into the same visuo-proprioceptive error map. The key difference between the two models arises in how this error influences motor planning. For the Proprioceptive Bias model, planning is assumed to occur in visual space. As such, the perceived position of the hand (based on proprioception) is transformed into visual space. This will introduce a bias in the representation of the start position. In contrast, the Transformation Bias model assumes that the visually-based representations of the start and target positions need to be transformed into proprioceptive space for motor planning. As such, both positions are biased in the transformation process. In addition to differing in terms of their representation of the target, the error introduced at the start position is in opposite directions due to the direction of the transformation (see fig 1g-h).”

      In terms of fitting individual data, we have conducted a new experiment, reported as Exp 4 in the revised manuscript (details in our response to Reviewer 1, comment 3). The experiment has a larger sample size (n=30) and importantly, examined error for both movement angle and movement distance. We chose to examine the individual differences in 2-D biases using this sample rather than Vindras’ data as our experiment has greater spatial resolution and more participants. At both the group and individual level, the Transformation bias model is the best single source model, and the Transformation + Target Bias model is the best combined model. These results strongly support the idea that the transformation bias is the main source of the motor bias.

      As for the different initial positions in Vindras et al (2005), the two target sets have very similar patterns of motor biases. As such, we opted to average them to decrease noise. Notably, the transformation model also predicts that altering the start location should have limited impact on motor bias patterns: What matters for the model is the relative difference between the transformation biases at the start and target positions rather than the absolute bias.

      Author response image 3.

      I am also having trouble fully understanding the V-T model and its associated equations, and whether visual-proprioception matching data is a suitable proxy for estimating the visuomotor transformation. I would be interested to first see the individual distributions of errors and a response to my concerns about the Proprioceptive Bias and Visual Bias models.

      We apologize for the lack of clarity on this model. To generate the T+V (Now Transformation + Target bias, or TR+TG) model, we assume the system misperceives the target position (Target bias, see Fig S5a) and then transforms the start and misperceived target positions into proprioceptive space (Fig S5b). The system then generates a motor plan in proprioceptive space; this plan will result in the observed motor bias (Fig. S5c). We now include this figure as Fig S5 and hope that it makes the model features salient.

      Regarding whether the visuo-proprioceptive matching task is a valid proxy for transformation bias, we refer the reviewer to the comments made by Public Reviewer 1, comment 1. We define the transformation bias as the discrepancy between corresponding positions in visual and proprioceptive space. This can be measured using matching tasks in which participants either aligned their unseen hand to a visual target (Wang et al., 2021) or aligned a visual target to their unseen hand (Wilson et al., 2010).

      Nonetheless, when fitting the model to the motor bias data, we did not directly impose the visual-proprioceptive matching data. Instead, we used the shape of the transformation biases as a constraint, while allowing the exact magnitude and direction to be free parameters (e.g., a leftward and downward bias scaled by distance from the right shoulder). Reassuringly, the fitted transformation biases closely matched the magnitudes reported in prior studies (Fig. 2h, 1e), providing strong quantitative support for the hypothesized causal link between transformation and motor biases.

      Recommendations for the authors:

      Overall, the reviewers agreed this is an interesting study with an original and strong approach. Nonetheless, there were three main weaknesses identified. First, is the focus on bias in reach direction and not reach extent. Second, the models were fit to average data and not individual data. Lastly, and most importantly, the model development and assumptions are not well substantiated. Addressing these points would help improve the eLife assessment.

      Reviewer #1 (Recommendations for the authors):

      It is mentioned that the main difference between Experiments 1 and 3 is that in Experiment 3, the workspace was smaller and closer to the shoulder. Was the location of the laptop relative to the participant in Experiment 3 known by the authors? If so, variations in this location across participants can be used to test whether the Transformation bias was indeed larger for participants who had the laptop further from the shoulder.

      Another difference between Experiments 1 and 3 is that in Experiment 1, the display was oriented horizontally, whereas it was vertical in Experiment 3. To what extent can that have led to the different results in these experiments?

      This is an interesting point that we had not considered. Unfortunately, for the online work we do not record the participants’ posture.

      Regarding the influence of display orientation (horizontal vs. vertical), Author response image 4 presents three relevant data points: (1) Vandevoorde and Orban de Xivry (2019), who measured motor biases in-person across nine target positions using a tablet and vertical screen; (2) Our Experiment 1b, conducted online with a vertical setup; (3) Our in-person Experiment 3b, using a horizontal monitor. For consistency, we focus on the baseline conditions with feedback, the only condition reported in Vandevoorde. Motor biases from the two in-person studies were similar despite differing monitor orientations: Both exhibited two-peaked functions with comparable peak locations. We note that the bias attenuation in Vandevoorde may be due to their inclusion of reward-based error signals in addition to cursor feedback. In contrast, compared to the in-person studies, the online study showed reduced bias magnitude with what appears to be a four peaked function. While more data are needed, these results suggest that the difference in the workspace (more restricted in our online study) may be more relevant than monitor orientation.

      Author response image 4.

      For the joint-based proprioceptive model, the equations used are for an arm moving in a horizontal plane at shoulder height, but the figures suggest the upper arm was more vertical than horizontal. How does that affect the predictions for this model?

      Please also see our response to your public comment 1. When the upper limb (or the lower limb) is not horizontal, it will influence the projection of the upper limb to the 2-D space. Effectively in the joint-based proprioceptive model, this influences the ratio between L1 and L2 (see  Author response image 5b below). However, adding a parameter to vary L1/L2 ratio would not change the set of the motor bias function that can be produced by the model. Importantly, it will still generate a one-peak function. We simulated 50 motor bias function across the possible parameter space. As shown by  Author response image 5c-d, the peak and the magnitude of the motor bias functions are very similar with and without the L1/L2 term. We characterize the bias function with the peak position and the peak-to-valley distance. Based on those two factors, the distribution of the motor bias function is very similar ( Author response image 5e-f). Moreover, the L1/L2 ratio parameter is not recoverable by model fitting ( Author response image 5c), suggesting that it is redundant with other parameters. As such we only include the basic version of the joint-based proprioceptive model in our model comparisons.

      Author response image 5.

      It was unclear how the models were fit and how the BIC was computed. It is mentioned that the models were fit to average data across participants, but the BIC values were based on all trials for all participants, which does not seem consistent. And the models are deterministic, so how can a log-likelihood be determined? Since there were inter-individual differences, fitting to average data is not desirable. Take for instance the hypothetical case that some participants have a single peak at 90 deg, and others have a single peak at 270 deg. Averaging their data will then lead to a pattern with two peaks, which would be consistent with an entirely different model.

      We thank the reviewer for raising these issues.

      Given the reviewers’ comments, we now report fits at both the group and individual level (see response to reviewer 3 public comment 1). The group-level fitting is for illustration purposes. Model comparison is now based on the individual-level analyses which show that the results are best explained by the transformation model when comparing single source models and best explained by the T+V (now TG+TR) model when consider all models. These new results strongly support the transformation model.

      Log-likelihoods were computed assuming normally distributed motor noise around the motor biases predicted by each model.

      We updated the Methods section as follows (lines 841-853):

      “We used the fminsearchbnd function in MATLAB to minimize the sum of loglikelihood (LL) across all trials for each participant. LL were computed assuming normally distributed noise around each participant’s motor biases:

      [11] LL = normpdf(x, b, c)

      where x is the empirical reaching angle, b is the predicted motor bias by the model, c is motor noise, calculated as the standard deviation of (x − b). For model comparison, we calculated the BIC as follow:

      [12] BIC = -2LL+k∗ln(n)

      where k is the number of parameters of the models. Smaller BIC values correspond to better fits. We report the sum of ΔBIC by subtracting the BIC value of the TR+TG model from all other models.

      For illustrative purposes, we fit each model at the group level, pooling data across all participants to predict the group-averaged bias function.”

      What was the delay of the visual feedback in Experiment 1?

      The visual delay in our setup was ~30 ms, with the procedure used to estimate this described in detail in Wang et al (2024, Curr. Bio.). We note that in calculating motor biases, we primarily relied on the data from the no-feedback block.

      Minor corrections

      In several places it is mentioned that movements were performed with proximal and distal effectors, but it's unclear where that refers to because all movements were performed with a hand (distal effector).

      By 'proximal and distal effectors,' we were referring to the fact that in the online setup, “reaching movements” are primarily made by finger and/or wrist movements across a trackpad, whereas in the inperson setup, the participants had to use their whole arm to reach about the workspace. To avoid confusion, we now refer to these simply as 'finger' versus 'hand' movements.

      In many figures, Bias is misspelled as Bais.

      Fixed.

      In Figure 3, what is meant by deltaBIC (*1000) etc? Literally, it would mean that the bars show 1,000 times the deltaBIC value, suggesting tiny deltaBIC values, but that's probably not what's meant.

      ×1000' in the original figure indicates the unit scaling, with ΔBIC values ranging from approximately 1000 to 4000. However, given that we now fit the models at the individual level, we have replaced this figure with a new one (Figure 3e) showing the distribution of individual BIC values.

      Reviewer #2 (Recommendations for the authors):

      I have concerns that the authors only examine slicing movements through the target and not movements that stop in the target. Biases create two major errors - errors in direction and errors in magnitude and here the authors have only looked at one of these. Previous work has shown that both can be used to understand the planning processes underlying movement. I assume that all models should also make predictions about the magnitude biases which would also help support or rule out specific models.

      Please see our response to Reviewer 1 public review 3.

      As discussed above, three-dimensional reaching movements also have biases and are not studied in the current manuscript. In such studies, biomechanical factors may play a much larger role.

      Please see our response to your public review.

      It may be that I am unclear on what exactly is done, as the methods and model fitting barely explain the details, but on my reading on the methods I have several major concerns.

      First, it feels that the visual bias model is not as well mapped across space if it only results from one study which is then extrapolated across the workspace. In contrast, the transformation model is actually measured throughout the space to develop the model. I have some concerns about whether this is a fair comparison. There are potentially many other visual bias models that might fit the current experimental results better than the chosen visual bias model.

      Please refers to our response to your public review.

      It is completely unclear to me why a joint-based proprioceptive model would predict curved planned movements and not straight movements (Figure S1). Changes in the shoulder and elbow joint angles could still be controlled to produce a straight movement. On the other hand, as mentioned above, the actual movement is likely much more complex if the physical starting position is offset from the perceived hand.

      Natural movements are often curved, reflecting a drive to minimize energy expenditure or biomechanical constraints (e.g., joint and muscle configuration). This is especially the case when the task emphasizes endpoint precision (Codol et al., 2024) like ours. Trajectory curvature was also observed in a recent simulation study in which a neural network was trained to control a biomechanical model (2-limb, 6muscles) with the cost function specified to minimize trajectory error (reach to a target with as straight a movement as possible). Even under these constraints, the movements showed some curvature. To examined whether the endpoint reaching bias somehow reflects the curvature (or bias during reaching), we included the prediction of this new biomechanical model in the paper to show it does not explain the motor bias we observed.

      To be clear, while we implemented several models (Joint-based proprioceptive model and the new biomechanical model) to examine whether motor biases can be explained by movement curvature, our goal in this paper was to identify the source of the endpoint bias. Our modeling results reveal a previously underappreciated source of motor bias—a transformation error that arises between visual and proprioceptive space—plays a dominant role in shaping motor bias patterns across a wide range of experiments, including naturalistic reaching contexts where vision and hand are aligned at the start position. While the movement curvature might be influenced by selectively manipulating factors that introduce a mismatch between the visual starting position and the actual hand position (such as Sober and Sabes, 2003), we think it will be an avenue for future work to investigate this question.

      The model fitting section is barely described. It is unclear how the data is fit or almost any other aspects of the process. How do the authors ensure that they have found the minimum? How many times was the process repeated for each model fit? How were starting parameters randomized? The main output of the model fitting is BIC comparisons across all subjects. However, there are many other ways to compare the models which should be considered in parallel. For example, how well do the models fit individual subjects using BIC comparisons? Or how often are specific models chosen for individual participants? While across all subjects one model may fit best, it might be that individual subjects show much more variability in which model fits their data. Many details are missing from the methods section. Further support beyond the mean BIC should be provided.

      We fit each model 150 times and for each iteration, the initial value of each parameter was randomly selected from a uniform distribution. The range for each parameter was hand tuned for each model, with an eye on making sure the values covered a reasonable range. Please see our response to your first minor comment below for the range of all parameters and how we decide the iteration number for each model.

      Given the reviewers’ comments in the individual difference, we now fit the models at individual level and report a frequency analysis, describing the best fitting model for each participant. In brief, the data for a vast majority of the participants was best explained by the transformation model when comparing single source models and by the T+V (TR+TG) model when consider all models. Please see response to reviewer 3 public comment 1 for the updated result.

      We updated the method session, and it reads as follows (lines 841-853):

      _“_We used the fminsearchbnd function in MATLAB to minimize the sum of loglikelihood (LL) across all trials for each participant. LL were computed assuming normally distributed noise around each participant’s motor biases:

      [11]       𝐿𝐿 = 𝑛𝑜𝑟𝑚𝑝𝑑𝑓(𝑥, 𝑏, 𝑐)

      where x is the empirical reaching angle, b is the predicted motor bias by the model, c is motor noise, calculated as the standard deviation of x-b.

      For model comparison, we calculated the BIC as follows:

      [12] BIC = -2LL+k∗ln(n)

      where k is the number of parameters of the models. Smaller BIC values correspond to better fits. We report the sum of ΔBIC by subtracting the BIC value of the TR+TG model from all other models.

      Line 305-307. The authors state that biomechanical issues would not predict qualitative changes in the motor bias function in response to visual manipulation of the start position. However, I question this statement. If the start position is offset visually then any integration of the proprioceptive and visual information to determine the start position would contain a difference from the real hand position. A calculation of the required joint torques from such a position sent through the mechanics of the limb would produce biases. These would occur purely because of the combination of the visual bias and the inherent biomechanical dynamics of the limb.

      We thank the reviewer for this comment. We have removed the statement regarding inferences about the biomechanical model based on visual manipulations of the start position. Additionally, we have incorporated a recently proposed biomechanical model into our model comparisons to expand our exploration of sources of bias. Please refer to our response to your public review for details.

      Measurements are made while the participants hold a stylus in their hand. How can the authors be certain that the biases are due to the movement and not due to small changes in the hand posture holding the stylus during movements in the workspace. It would be better if the stylus was fixed in the hand without being held.

      Below, we have included an image of the device used in Exp 1 for reference. The digital pen was fixed in a vertical orientation. At the start of the experiment, the experimenter ensured that the participant had the proper grip alignment and held the pen at the red-marked region. With these constraints, we see minimal change in posture during the task.

      Author response image 6.

      Minor Comments

      Best fit model parameters are not presented. Estimates of the accuracy of these measures would also be useful.

      In the original submission, we included a Table S1 that presented the best-fit parameters for the TR+TG (Previously T+V) model. Table S1 now shows the parameters for the other models (Exp 1b and 3b, only). We note the parameter values from these non-optimal models are hard to interpret given that core predictions are inconsistent with the data (e.g., number of peaks).

      We assume that by "accuracy of these measures," the reviewers are referring to the reliability of the model fits. To assess this, we conducted a parameter recovery analysis in which we simulated a range of model parameters for each model and then attempted to recover them through fitting. Each model was simulated 50 times, with the parameters randomly sampled from distributions used to define the initial fitting parameters. Here, we only present the results for the combined models (TR+TG, PropV+V, and PropJ+V), as the nested models would be even easier to fit.

      As shown in Fig. S4, all parameters were recovered with high accuracy, indicating strong reliability in parameter estimation. Additionally, we examined the log-likelihood as a function of fitting iterations (Fig. S4d). Based on this curve, we determined that 150 iterations were sufficient given that the log-likelihood values were asymptotic at this point. Moreover, in most cases, the model fitting can recover the simulated model, with minimal confusion across the three models (Fig. S4e).

      What are the (*1000) and (*100) in the Change in BIC y-labels? I assume they indicate that the values should be multiplied by these numbers. If these indicate that the BIC is in the hundreds or thousands it would be better the label the axes clearly, as the interpretation is very different (e.g. a BIC difference of 3 is not significant).

      ×1000' in the original figure indicates the unit scaling, with ΔBIC values ranging from approximately 1000 to 4000. However, given that we now fit the models at the individual level, we have replaced this figure with a new one showing the distribution of individual BIC values.

      Lines 249, 312, and 315, and maybe elsewhere - the degree symbol does not display properly.

      Corrected.

      Line 326. The authors mention that participants are unaware of their change in hand angle in response to clamped feedback. However, there may be a difference between sensing for perception and sensing for action. If the participants are unaware in terms of reporting but aware in terms of acting would this cause problems with the interpretation?

      This is an interesting distinction, one that has been widely discussed in the literature. However, it is not clear how to address this in the present context. We have looked at awareness in different ways in prior work with clamped feedback. In general, even when the hand direction might have deviated by >20d, participants report their perceived hand position after the movement as near the target (Tsay et al, 2020). We also have used post-experiment questionnaires to probe whether they thought their movement direction had changed over the course of the experiment (volitionally or otherwise). Again, participants generally insist they moved straight to the target throughout the experiment. So it seems that they unaware of any change in action or perception.

      Reaction time data provide additional support that participants are unaware of any change in behavior. The RT function remains flat after the introduction of the clamp, unlike the increases typically observed when participants engage in explicit strategy use (Tsay et al, 2024).

      Figure 1h: The caption suggests this is from the Wang 2021 paper. However, in the text 180-182 it suggests this might be the map from the current results. Can the authors clarify?

      Fig 1e is the data from Wang et al, 2021. We formalized an abstract map based on the spatial constrains observed in Fig 1e, and simulated the error at the start and target position based on this abstraction (Fig 1h). We have revised the text to now read (Lines 182-190):

      “Motor biases may thus arise from a transformation error between these coordinate systems. Studies in which participants match a visual stimulus to their unseen hand or vice-versa provide one way to estimate this error(Jones et al., 2009; Rincon-Gonzalez et al., 2011; van Beers et al., 1998; Wang et al., 12/2020). Two key features stand out in these data: First, the direction of the visuo-proprioceptive mismatch is similar across the workspace: For right-handers using their dominant limb, the hand is positioned leftward and downward from each target. Second, the magnitude increases with distance from the body (Fig 1d). Using these two empirical constraints, we simulated a visual-proprioceptive error map (Fig. 1h) by applying a leftward and downward error vector whose magnitude scaled with the distance from each location to a reference point.”

      Reviewer #3 (Recommendations for the authors):

      The central idea behind the research seems quite promising, and I applaud the efforts put forth. However, I'm not fully convinced that the current model formulations are plausible explanations. While the dataset is impressively large, it does not appear to be optimally designed to address the complex questions the authors aim to tackle. Moreover, the datasets used to formulate the 3 different model predictions are SMALL and exhibit substantial variability across individuals, and based on average (and thus "smoothed") data.

      We hope to have addressed these concerns with the two major changes to revised manuscript: 1) The new experiment in which we examine biases in both angle and extent and 2) the inclusion in the analyses of fits based on individual data sets.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      (1) Discrepancies with previous findings need clarification, especially regarding the absence of similar behavioral effects in F1. Lack of discussion on the decision to modify paradigms instead of using the same model. Presentation of behavioral data in supplementary materials, with a recommendation to include behavioral quantification in main figures. Absence of quantification for freezing behavior, a crucial measure in fear conditioning.

      We agree, thank you. One of the major revisions we have made to this version of the manuscript is the addition of much more thorough analysis of our F1 behavior. While not captured by the (relatively gross) measure of the approach-avoid index, further analysis has highlighted interesting differences between the F1s of unpaired and paired offspring, and in an odor-specific manner. As these analyses have given rise to many new results and conclusions, we have attempted to adjust the manuscript to reflect the major change that we do, in fact, find effects in F1, if subtle. 

      Classical odor-shock pairing was used in both Dias & Ressler’s and our study to directly expand upon the findings of increase in cell number. This enabled our discovery of biasing of newborn OSNs. For our behavioral readouts, we chose to focus on the ethological behavior of avoidance. From our extensive behavioral analysis (Figures 5 & 6), we successfully identified several behavioral differences in the F1 offspring that had not previously been described.

      Reviewer #2 (Public Review):

      (1) The main weakness is the disconnect between the morphological changes reported and the lack of change in aversion to the odorant in F1 progeny. The authors also do not address the mechanisms underlying the inheritance of the phenotype, which may lie outside of the scope of the present study.

      Thank you for your comments. Our revised manuscript includes both new experiments and new analyses that probe the relationship between a change in cell number and a change in avoidance behavior, and we have revised the manuscript text to address this point directly. In short, we find both in the F0 generation (at extended time points) and in the F1, that an increase in cell number does not always correlate with avoidance behavior. However, we do find nuanced behavioral differences between the offspring of unpaired and paired fathers. Whether the increase in cell number in offspring is necessary to observe the behavioral changes is outside the scope of the current study, but certainly a question we are interested in answering in future work. 

      Reviewer #3 (Public Review):

      (1) In the abstract / summary, the authors raise expectations that are not supported by the data. For example, it is claimed that "increases in F0 were due to biased stem cell receptor choice." While an active field of study that has seen remarkable progress in the past decade, olfactory receptor gene choice and its relevant timing in particular is still unresolved. Here, Liff et al., do not pinpoint at what stage during differentiation the "biased choice" is made. 

      EdU is only taken into stem cells in the S phase, and differences in EdU-labeled M71 or MOR23 OSNs across fear conditioning groups indicates a biasing in subtype identity. We do not make claims regarding the exact stage of OSN maturation at which biasing may occur; rather, we demonstrate that the stem cells that were dividing during EdU administration are more likely to mature into an M71 OSN if a mouse receives paired acetophenone conditioning compared to unpaired or no conditioning (and similarly with MOR23 and lyral). This phenomenon must involve receptor choice, as that is the mechanism by which OSN subtypes form. 

      (2) Similarly, the concluding statement that the study provides "insight into the heritability of acquired phenotypes" is somewhat misleading. The experiments do not address the mechanisms underlying heritability. 

      We do not claim to provide direct insight into the mechanisms underlying heritability. Our experiments do provide insight into the heritability of acquired phenotypes, as we corroborate previous studies that this olfactory fear conditioning paradigm induces heritable changes in the nose and in behavior. We also demonstrate odor-specific behavioral differences in the offspring conditioned fathers, suggesting that the mechanisms underlying the specific behavioral phenotypes may be unique to the conditioning odorant, and not one universal mechanism. These results provide basic knowledge that will accelerate our ability to uncover the mechanisms driving heritable changes. 

      (3) The statement that "the percentage of newborn M71 cells is 4-5 times that of MOR23 may simply reflect differences in the birth rates of the two cell populations" should, if true, result in similar differences in the occurrence of mature OSNs with either receptor identity. According to Fig. 1H & J, however, this is not the case. 

      We have removed that statement from the manuscript, as subtype-specific differences in proliferation rates are not the focus of this study and we do not wish to make claims about it based on our EdU experiments. We do not compare our iDISCO cell density counts to EdU co-labeling counts nor ratio counts, as differences between M71 and MOR23 quantification in cleared tissue versus EdU uptake may simply reflect the inherent differences between methodologies. Our claims are solely within M71 cohorts and MOR23 cohorts. 

      (4) An important result is that Liff et al., in contrast to results from other studies, "do not observe the inheritance of odor-evoked aversion to the conditioned odor in the F1 generation." This discrepancy needs to be discussed. 

      This is discussed in the manuscript, and we report behavioral differences revealed by additional analyses. 

      (5) The authors speculate that "the increase in neurons responsive to the conditioned odor could enhance the sensitivity to, or the discrimination of, the paired odor in F0 and F1. This would enable the F1 population to learn that odor predicts shock with fewer training cycles or less odorant when trained with the conditioned odor." This is a fascinating idea that, in fact, could have been readily tested by Liff and coworkers. If this hypothesis were found true, this would substantially enhance the impact of the study for the field.

      We agree that additional F1 behavioral paradigms are a major next step to understand the functional behavioral differences that may emerge from an increase in specific OSN subtype. Due to the nontrivial amount of time and effort it requires to generate F1 offspring (on the order of many months), and because we do not test individual offspring in multiple behavioral assays (such that they are naïve to their father’s conditioning odor), these experiments are outside the scope of this current study. 

      Reviewer #1 (Recommendations For The Authors):

      (1) Considering that the authors are expanding upon the previous findings of Dias and Ressler (2014), it is crucial to clarify the discrepancies in the results between both works in the discussion. While I acknowledge the use of a different experimental design by the authors, if the premise assumes there is a universal mechanism for transgenerational acquired modification it prompts the question: Why don't we observe similar behavioral effects in F1 in the present model? This issue needs extensive discussion in the manuscript to advance the field's understanding of this topic. Additionally, I am also curious about the author's decision to modify the paradigms instead of using exactly the same model to further extend their findings on stem cells, for example. Could you please provide comments on this choice and elaborate on this aspect in the discussion? 

      We agree, thank you. One of the major revisions we have made to this version of the manuscript is the addition of much more thorough analysis of our F1 behavior. While not captured by the (relatively gross) measure of the approach-avoid index, further analysis has highlighted interesting differences between the F1s of unpaired and paired offspring, and in an odor-specific manner. As these analyses have given rise to many new results and conclusions, we have attempted to adjust the manuscript to reflect the major change that we do, in fact, find effects in F1, if subtle. 

      Classical odor-shock pairing was used in both Dias & Ressler’s and our study to directly expand upon the findings of increase in cell number. This enabled our discovery of biasing of newborn OSNs. For our behavioral readouts, we chose to focus on the ethological behavior of avoidance. From our extensive behavioral analysis (Figures 5 & 6), we successfully identified several behavioral differences in the F1 offspring that had not previously been described. We have revised the discussion section to elaborate on these decisions.

      We incorporated the behavioral data into the main figures and included a freezing metric to Figure 5 (F, J, & N). We did do an analysis of time spent freezing in the control vs. conditioned chamber, but since the F0 paired mice spend so little time in the conditioned odor chamber, they also spend most of their time freezing in the control odor chamber. Thus, we felt it was better to show the overall time spent freezing during the trial.

      (2) It is unclear why the authors chose to present all behavioral data to supplementary materials. I strongly recommend not only incorporating the behavioral data into the main figures but also expanding the behavioral quantification. It appears that the author dismissed the potential effects on F1 without a thorough exploration of animals' behaviors. The task contains valuable information that could be further investigated, potentially altering the findings or even the conclusions of the study. Notably, the absence of quantification for freezing behavior is incomprehensive. Freezing is a crucial measure in fear conditioning, and it's surprising that the authors did not mention it throughout the manuscript. I encourage the author to include freezing data in the analysis and other behavioral quantification as follows: a) freezing during odor presentation and ITI for conditioning days. b) freezing during odor preference test in all compartments. c) it is not very clear the design of the Odor preference behavioral testing. Is the odor presented in a discrete manner or the order is constantly presented in the compartment? Could the authors quantify the latency to avoid after the visit in the compartment? d) in the video it is very clear the animals are doing a lot of risk assessment, this could be also analyzed and included as a fear measure.  

      Thanks for the suggestion. We incorporated the behavioral data into the main figures and included a freezing metric to Figure 5 (F, J, & N). We did do an analysis of time spent freezing in the control vs. conditioned chamber, but since the F0 paired mice spend so little time in the conditioned odor chamber, they also spend most of their time freezing in the control odor chamber. Thus, we felt it was better to show the overall time spent freezing during the trial. In the methods section we describe that the odor is continuously bubbled into the chamber throughout the trial, but we have clarified this in the main text as well. As for further behavioral metrics like latencies and risk assessment, initial analyses have not shown anything in the F1 data that we wished to report here. Future work from the lab will investigate this further.

      (3) In the Dias and Ressler paper, a crucial difference exists between the models that could elucidate the absence of transgenerational effects on F1. In their study, the presence of the unconditioned stimulus (US) is consistent across all generations in the startle task. I am curious whether, in the present study, the authors considered pairing the F1 with a US-paired task in a protocol that does not induce fear conditioning (e.g., lower shock intensity or fewer pairings). Could this potentially lead to an increased response in the parental-paired offspring? Did the author consider this approach? I understand how extensive this experiment can be, therefore I'm not directly requesting, although it would be a fantastic achievement if the results are positive. Please consider discussing this fundamental difference in the manuscript. 

      To clarify, the F1 generation is presented with the unconditioned stimulus, just never conditioned with it. In these experiments, we were primarily interested in the F1’s naïve reaction to their father’s conditioning odorant, and whether the presentation of that odor in the absence of a stressor would lead to any fear-like behavioral responses.

      We have considered the experiments you have suggested and have ongoing projects in the lab further investigating F1 effects and whether their father’s experiences affect their ability to learn in conditioning tasks. Because of the amount of time and effort it requires to generate F1 offspring, and because we do not wish to test individual offspring in multiple assays, we do not present any of these experiments in the current manuscript. Ongoing work is looking into whether 1-day (vs. 3-day) conditioning is sufficient in the offspring of paired mice, and we appreciate the suggestion of subthreshold shock intensity. We will also clarify in the discussion that future work will try to answer these questions. 

      (4) If the videos were combined it would be better to appreciate the behavioral differences of paired vs unpaired. 

      Thank you for the suggestion, fixed. Video S1 is now a combination of unpaired and paired example videos. 

      (5) Figure 3E, is there an outlier in the paired group that is driving the difference? Please run an outlier test on the data if this has not been done. If already done, please express the stats. 

      We ran an outlier test using the ROUT method (Q=1%) and did not find any outliers to be removed. We also ran the same test on all other data and removed one mouse from the Acetophenone F1 Paired group in Figure 5 (also described in the Methods section). 

      (6) I understand that using the term "olfactory" twice in the title may seem redundant. However, the authors specifically demonstrate the effects of olfactory fear conditioning. I suggest including "odor-induced" before "fear conditioning" in the title for greater specificity and accuracy. This modification would better reflect the study's focus on olfactory fear conditioning, especially given the authors did not explore fear conditioning broadly (e.g., contextual, and auditory aspects were not examined). 

      Thank you for your feedback. We found “olfactory” twice as cumbersome. We have changed the title to “Fear conditioning biases olfactory sensory neuron expression across generations”, to more accurately highlight the importance of the olfactory sensory neuron expression, intergenerationally. 

      (7) The last page of the manuscript has a list of videos (8 videos), but only two were presented.

      We have made sure to include all 7 videos (videos 1 and 2 were combined) in this version.  

      Reviewer #2 (Recommendations For The Authors):

      (1) The analyses mentioned on lines 210-220 should be presented. 

      Thank you for the suggestion. We have removed this part of the manuscript as we do not have a large enough n to draw conclusions about cell longevity in this paper. Future studies in the lab will incorporate this analysis.

      Reviewer #3 (Recommendations For The Authors):

      (1) The manuscript contains several supplementary figures and movies that are not referred to in the main text. 

      All supplementary figures and movies are now referred to in the manuscript text.

      (2) In the abstract, the authors state that they "investigated changes in the morphology of the olfactory epithelium." I think that is (technically) not what they did. In fact, the authors do not show any morphometry of the epithelium (e.g., thickness, layers, etc.), but count the density of OSNs that share a specific receptor identity. Along the same lines, the authors state in the abstract that recent work has shown that conditioning is "resulting in increases in olfactory receptor frequencies." However, recent studies did not show increased "receptor frequencies", but changes in cell count. Whether (or not) receptor expression per OSN is also changed remains unknown (would be interesting though). 

      Yes, agreed. We changed “morphology” to “cellular composition.” We also changed any references to “receptor frequencies” to “olfactory sensory neuron frequencies.”

      (3) Reference 20 needs to be updated. 

      Thank you, updated.

      (4) l.52: the distribution of OSNs into (four) zones is a somewhat outdated concept as zonal boundaries are rather blurry. Generally, of course, dorsoventral differences are real. 

      Yes, we agree and changed the verbiage to “region” as opposed to “zone.” We mainly bring this up because it later becomes relevant that both M71 and MOR23 are expressed in the same (antero-dorsal) region and thus can be quantified with the same methodology.

      (5) Fig. 3B & C: the EdU background staining is quite peculiar. Any reason why the epithelium is mostly (with the sustentacular nuclei being a noticeable exception) devoid of background? 

      We use the ThermoFisher Click-iT Plus EdU kit (Invitrogen, C10638) and it has consistently produced very good signal to noise ratio.

      Responses to Editor’s note

      We thank the editor for their constructive suggestions. 

      (1) Should you choose to revise your manuscript, please include full statistical reporting including exact p-values wherever possible alongside the summary statistics (test statistic and df) and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05. 

      Thank you for the suggestion. We created two supplementary tables with statistical reporting: Table S1 for the main figure statistics, and Table S2 for the supplementary figure statistics.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The study mainly replicates the authors' previously reported results about generalized and trajectory-specific coding of task structure by prefrontal neurons, and stable and changing representations over learning (Muysers et al., 2024, PMID: 38459033; Muysers et al., 2025, PMID: 40057953), although there are useful results about changes in goal-selective and taskphase selective cells over learning. There are basic shortcomings in the scientific premise of two new points in this manuscript, that of the contribution of pre-existing spatial representations, and the role of replay sequences in the prefrontal cortex, both of which cannot be adequately tested in this experimental design.

      We agree with the reviewer that we have not made sufficiently clear which aspects of our paper add to previous publications. We have now better explained methodological differences.

      Also, we agree that our very general statements on pre-existing spatial representations in the introduction and abstract in the previous manuscript were not properly followed up in the Results section. In the revision, the respective statements are clarified, and we also added analysis of a further control condition (see response to A), which shows that particularly a subset of task cells maintains there firing fields from an early habituation period, arguing that, while the population representation of the task largely develops during learning, there exists a scaffold of small but significant amount of cells that could be interpreted as a schema.

      We also further clarified our view on replay sequences in the prefrontal cortex (see response to B). Particularly, we are grateful to the reviewer for the suggestion to also include other reactivation analysis which led to new results presented in new Figure 3.

      [A] The study denotes neurons that show precise spatial firing equivalently irrespective of goal, as generalized task representations, and uses this as a means to testing whether pre-existing spatial representations can contribute to task coding and learning. …. [I]n order to establish generalization for abstract task rules or cognitive flexibility, as motivated in the manuscript, there is a need to show that these neurons "generalize" not just to firing in the same position during learning of a given task… For an adequate test of pre-existing spatial structure, either a comparison task, as in the examples above, is needed, or at least a control task in which animals can run similar trajectories without the task contingencies. An unambiguous conclusion about pre-existing spatial structure is not possible without these controls.

      We thank the reviewer for this suggestion. We may, however, note that the previous manuscript did not make strong claims about pre-existing structures in the Results or Discussion. Also Schemas were only taken up as a discussion point. We nevertheless agree with the reviewer that assessment of the spatial prestructure requests further analysis. To address their point, we analyzed neuronal activity during the habituation phase before the start of task training, when the animals freely explored the same maze without any task contingency (animals explored mostly in the arms of the maze). We compared the place fields of neurons during this habituation period with their task-related activity. Consistent with the small overlap of firing rate maps between learning and learned phase, also this analysis revealed a small number of cells with significant correlations (up to 20% for task cells; a significant fraction according to a  binomial test). The results are shown as a new Figure supplement to Figure 2.

      [B] The scientific premise for the test of replay sequences is motivated using hippocampal activity in internally guided spatial working memory rule tasks [...] and applied here to prefrontal activity in a sensory-cue guided spatial memory task [...]. There are several issues with the conclusion in the manuscript that prefrontal replay sequences are involved in evaluating behavioral outcomes rather than planning future outcomes.

      We agree with the reviewer that preplay in Hippocampus and mPFC are distinct. We further emphasized this distinctiveness in the respective paragraph in the discussion (see response to B1).

      [B. 1] First, odor sampling in odor-guided memory tasks is an active sensory processing state that leads to beta and other oscillations in olfactory regions, hippocampus, prefrontal cortex, and many other downstream networks [...]. This is an active sensory state, not conducive to internal replay sequences, unlike references used in this manuscript to motivate this analysis, which are hippocampal spatial memory studies with internally guided rather than sensory-cue guided decisions, where internal replay is seen during immobility at reward wells. These two states cannot be compared with the expectation of finding similar replay sequences, so it is trivially expected that internal replay sequences will not be seen during odor sampling.

      We agree with the reviewer that the sampling phase cannot be compared with the “preplay” state in the hippocampus. We have rewritten the manuscript in the results and discussion sections to clarify. We, however, disagree, that the absence of replay sequences in the mPFC 1P calcium data is trivial, since we actually do see many sequences during sampling (Fig 4E, Fig 4 suppl 2 A). These sequences are just not related to task activity and may e.g. reflect activity related to sensing, but do not contain information about goal arm.

      [B. 2] Second, sequence replay is not the only signature of reactivation. Many studies have quantified prefrontal replay using template matching and reactivation strength metrics that do not involve sequences [...].  Third, previous studies have explicitly shown that prefrontal activity can be decoded during odor sampling to predict future spatial choices - this uses sensory-driven ensemble activity in prefrontal cortex and not replay, as odor sampling leads to sensory driven processing and recall rather than a reactivation state [..].

      We thank the reviewer for the suggestion to also perform reactivation analysis (Peyrache et al., 2009, 2010). The results are summarized in the new Figure 3. And show that indeed reactivation is stronger during the sampling phase and it is goal arm specific, arguing that sequence analysis extracts information (partly) complementary to rate covariance based analysis.

      We hope to have convinced the reviewer that, together, the complementary results of reactivation an sequence analysis, as well as the ability to follow these measures over an extended period of time, gives unique insights far beyond the previous publications of these data sets. A consistent analysis of population representation, however, required some reanalyses of previous findings, since we only could focus on a limited number of animals and cells, for which tracking was possible over such a long period of time.

      Reviewer #2 (Public review):

      Further controls are needed to validate the results.

      We thank the reviewer for their generally supportive statements. The revised manuscript contains a number of controls in several new figure supplements.

      Reviewer #3 (Public review):

      [They] conclude that the frequency of TSs and GSs is limited (I believe because most sequence clusters were non-SI - the authors can verify this and write it in the text?). In the discussion, they say, "In addition to GSs and TSs, we found that most of the recurring sequences are not related to behavior".

      The reviewer is correct most clusters were not SI (Fig 5 A). We have added this information in the MS.

      [...] They conclude "Together with our finding of strong changes in sequence expression after learning (Figure 3E) these findings suggest that a representation of task develops during learning, however, it does not reflect previous network structure." I am not sure what is meant here by the second part of this sentence (after "however ..."). Is it the idea that the replay represents network structure, and the lack of Reward replay in the learning condition means that the network structure must have been changed to get to the learned condition? Please clarify.

      The reviewer is correct in their assertion. We rewrote the sentence to clarify: “Together with our finding of strong changes in sequence expression after learning (now Fig 4E) these findings suggest that a representation of task develops during learning, however, it does not reflect sequence structure during learning and habituation”.

      (1) There are some statements that are not clear, such as at the end of the introduction, where the authors write, "Both findings suggest that the mPFC task code is locally established during learning." What is the reasoning behind the "locally established" statement? Couldn't the learning be happening in other areas and be inherited by the mPFC? Or are the authors assuming that newly appearing sequences within a 500-ms burst period must be due to local plasticity?

      We agree that the wording “local” can be misleading, we rephrased the corresponding sentences.

      (2) The threshold for extracting burst events (0.5 standard deviations, presumably above the mean, but the authors should verify this) seems lower than what one usually sees as a threshold for population burst detection. What fraction of all data is covered by 500 ms periods around each such burst? However, it is potentially a strength of this work that their results are found by using this more permissive threshold.

      Since we work with a slow calcium signal, we cannot use as strict thresholds as usually employed using electrophysiology. In addition, our sequence detection approach adds a further level of strictness such that we only consider bursts with recurring sequence structure. In response to this reviewer’s question, we have added quantification of the fraction of all data covered by 500 ms periods in Figure Supplement 1, panels D and E. Indeed we include a large fraction (20 to 40%) (except sleep and habituation), which is consistent with our interpretation that during the outward phase sequences mainly reflect task field firing.

      Reviewer #1(Recommendations for the  Authors):

      It is possible that 1-photon recordings do not have the temporal resolution and information about oscillatory activity to enable these kinds of analyses. Therefore, an unambiguous conclusion about the existence and role of prefrontal reactivation is not possible in this experimental and analytical design.

      We indeed cannot extract information encoded in LFP oscillations from the calcium signal, we now mention the relation between LFP oscillations and olfaction-guided behaviors in the discussion (including the suggested references). However, our finding that sequence and covariance-based analysis yield partly complementary results argues that it indeed allows conclusion about the existence and role of prefrontal reactivation.

      Reviewer #2 (Recommendations for the authors):

      The results of the Muysers et al. (2025) paper need to be discussed in detail and explain why the cell categorization is different, three groups of spatial cells vs two groups here. Also, explain in what aspect the major findings in this work go beyond what was shown in Figure 4 in that paper.

      The main goal of this paper was to explore sequence/replay like activity, which is not at all captured in the Muysers et al. 2025 paper. Because of this focus on sequences, we excluded the inward runs (from reward to sampling point) for better interpretability and thus ended up with only two types of cells. Muysers et al. included backward runs and could thereby also assess whether the place field remains in the outward and inward runs. We added this clarification in the Results section.

      Regarding the reviewer’s question regarding figure 4: Our task cells would largely overlap with the “path-equivalent cells” from Muysers et al. 2025 (albeit not taking into account inward runs). In this sense their finding that the share of path-equivalent cells increases with learning  is consistent with our report of increasing fraction of task cells in Figure 2 C. Our Figure 2 adds that some task cells develop from previous goal cells with fields at the same location (generalizing). Moreover, we use spatial information as a criterion to identify TC and GCs, showing that a large fraction of cells actually is and remains spatially unselective. In Muysers et al. 2025 a statistical criterion was not applied on spatial selectivity but peak height, with fewer neurons failing this test. Moreover, we were analyzing only those cells trackable over the whole period. Despite all these methodological differences, the result of increasing the number of task/path-equivalent cells over learning was consistent. The main reason for recategorization of the cells in the present manuscript was to be able to meaningfully link them to sequence activity (Fig. 5E, F).

      It is not clear from the description how the cell type transitions were quantified. Was the last learning day compared to the first learned day? Given that, particularly during learning, there are changes across days in the spatial representations according to Figure 2 of Muysers et al. (2025), this is the meaningful way to make the comparisons. Nevertheless, it is also not clear whether the daily variations within learning and learned conditions differ from the transition day, so without comparing these three conditions, it is hard to make a firm conclusion from examining only changes in the transition days.

      The analysis of cell type transitions was performed by pooling all learning sessions and comparing them with all learned sessions, without taking into account the chronological order of sessions within each category. This approach allowed us to identify broad changes associated with learning state. Figure supplement 1.C shows the session intervals per animal. We argue that the large interval between learning and learned session justifies this analysis approach.

      Identifying sequences by a clustering method in which sequence patterns of individual events are compared is an interesting idea. Nevertheless, there is a danger, as with any clustering method, that data without clustering tendency could be artificially subdivided into clusters.

      In Figure 4.C, we show three example sequence cluster templates (colored) obtained via hierarchical clustering, along with representative member sequences (black) sorted by cluster membership. In response to this reviewer’s comment, we now included a complete clustering result for one animal, including all sequence clusters and their member sequences. It is provided in Figure 4 supplement 1. This comprehensive visualization serves as an additional control, demonstrating that the clustering approach identifies consistent sequence patterns across the dataset.

      Furthermore, it is possible that some cells at the edge of the cluster boundary may show a more similar sequence tendency to events detected at the overlapping border region of another cluster. Was this controlled for? It would be essential to show that events clustered together all show higher similarity to each other than to events in any other clusters.

      By default, the clusters are rejected if in the adjacency matrix of the graph constructed by significant motif similarity,  the number of within cluster edges is smaller than the number of without cluster edges. In subsequent cluster merges the separation is increased since only those clusters are merged that show significant similarity. As a visual control, we monitor plots as shown in Figure 4 supplement 1. Sequence templates (color dot clouds) are supposed to show no serial correlation when ordered according to any one template other than its own. We have added more clarification to the Methods including a new Figure 6 illustrating the Method.

      From the description, it was not clear how the sequence similarity was established between pairs of individual events. The only way I can see it is that the sequence (orders at which cells fire) is established with one event, and the rank order correlation is calculated with this order for the other event. However, in this case, distance A-B is not the same as distance B-A. Not sure how this is handled with the clustering procedure. Secondly, how the number of clusters is established in the hierarchical clustering procedure needs to be explained. Furthermore, from the method description, it is not clear how GS and TS sequences are identified. Can an event be classified as both a TS and GS event at the same time?

      The reviewer is correct in their assertion that we compute all pairwise rank order correlations (that are then subject to a statistical test detailed in the original method publication Chenani et al., 2019). By nature of the rankorder correlation the coefficients A-B and B-A are symmetric. This is now more carefully explained in the Methods.

      Several control analyses are needed to show that the sequences detected reflect not random patterns but those that repeat at a higher than random chance. This requires, at the first step, to establish to what degree sequences are consistent within a cluster and to what degree individual events show a sequential firing tendency. And at the next stage, these need to be compared with randomised events in which spike timing of cells is jittered or spike identity is randomised, and show that these events result in poorer sequence tendency and less consistent clusters.

      The controls requested by the reviewer are already implemented in our Method (see original publication of the Method in Chenani et al., 2019). This is now made clearer in the Methods section.

      Firing rate and place-related firing of cells alone could generate sequences even if cells otherwise fire independently from each other. In a similar manner, it was shown before that reactivation of waking cell assemblies could be seen in sleep, in which case firing rate differences across cells belonging to the same assembly could also generate sequential patterns without temporal coordination. Appropriate shuffling procedures need to be performed to exclude such scenarios.

      We are aware that the sequential firing in our data (particularly during the outward phase when the animal is performing the task), is most likely resulting from the correlations between rate maps and the animals trajectory. During the reward, this is less likely. An intrinsic control is that during sampling we do not see these sequences. Given the nature of the calcium signal, a direct connection to firing rate is not possible. However, we argue that using our center of mass-approach of the calcium trace effectively normalizes for firing rate effects. Shuffling dF/F amplitudes (as a proxy for firing rates) would thus have no effect on the center of mass sequences. We, however, consider this to be an important methodological difference between sequence analysis with spikes and Calcium signals and have added a related comment to the Methods part.

      The past literature describing mPFC reactivation, replay, and sequences needs to be described, and findings of this work need to be appropriately acknowledged, and those findings compared with this work (starting with this work from 2007 PMID: 18006749). In the current reading, a novice reader of this field might conclude that this is the first work that identified relay and sequences in the mPFC.

      We would like to apologize that the manuscript evokes this impression. This was not our intention, in fact we have given strong emphasis on the Kaefer et al. paper in the Discussion. We have now added early references on PFC replay based on electro-physiological recordings in the Discussion section.

      The analysis of Figure 4H is not sufficient to show that only forward sequences occur. If 50% are forward and 50% are reverse, the median is zero. Some of the presented histograms look like Gaussian distributions with SD=1, which would show that those events were not real sequences. It should be tested whether the distributions are significantly different from the expected Gaussian.

      We agree with the reviewer that we did not explicitly test for significance of individual replays, but only tested for the rightward shift of the median. We have now added these significance tests/p values in Figure 5) and indeed could show that none of the significant backward replays exceed the fraction expected by chance, whereas forward replay significantly exceeds chance levels only in the cases where the median had a significant right ward shift (except for non-SI clusters). We would like to thank the reviewer for this suggestion, which we think makes the analysis stronger.

      Overall, the clarity of the text could be improved, and further examples of reactivated sequences should be shown, and the methods should be illustrated in the figures. At the current version, I fear that even readers in this field would give up on reading the current text given an insufficient level of clarity.

      We have included more examples of reactivated sequence (Suppl2 to Figure 5) and made extensive additions to the methods part. Particularly, we followed the reviewer’s request for method illustration (new Figure 6).

      Reviewer #3 (Recommendations for the authors):

      My main comment here is for the authors to increase the clarity of the manuscript.[...] For instance, it was difficult to follow what was being done to determine TSs and GSs.

      We have made extensive additions to the Methods section including a new Figure 6 depicting the workflow of the sequence analysis in a schematic manner.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      Strengths:

      The innovation on the task alone is likely to be impactful for the field, extending recent continuous report (CPR) tasks to examine other aspects of perceptual decision-making and allowing more naturalistic readouts. One interesting and novel finding is the observation of dyadic convergence of confidence estimates even when the partner is incidental to the task performance, and that dyads tend to be more risk-seeking (indicating greater confidence) than when playing solo. The paper is well-written and clear.”

      We thank reviewer 1 for this encouraging evaluation. Below we address the identified weaknesses and recommendations.

      (1) Do we measure metacognitive confidence?

      One concern with the novel task is whether confidence is disambiguated from a tracking of stimulus strength or coherence. […] But in the context of an RDK task, one simple strategy here is to map eccentricity directly to (subjective) motion coherence - such that the joystick position at any moment in time is a vector with motion direction and strength. This would still be an interesting task - but could be solved without invoking metacognition or the need to estimate confidence in one's motion direction decision. […] what the subjects might be doing is tracking two features of the world - motion strength and direction. This possibility needs to be ruled out if the authors want to claim a mapping between eccentricity and decision confidence […].”

      We thank reviewer 1 for pointing out that the joystick tilt responses of our subjects could potentially be driven by stimulus coherence instead of metacognitive decision confidence. Below, we present four arguments to address this point of concern:

      (1.1) Similar physical coherence between high and low confidence states

      Nominal motion coherence is a discrete value, but the random noisiness in the stimulus causes the actual frame-by-frame coherence to be distributed around this nominal value. Because of this, subjects might scale their joystick tilt report according to the coherence fluctuations around the nominal value. To check if this was the case, we use a median split to separate stimulus states into states with large versus small joystick tilt, individually for each nominal coherence. For each stimulus state, we extracted the actual instantaneous (frame-to-frame) motion coherence, which is based on the individual movements of dots in the stimulus patch between two frames, recorded in our data files.

      First, we compared the motion coherence between stimulus states with large versus small joystick tilt. For each stimulus state, we calculated average instantaneous motion coherence, and analyzed the difference of the medians for the large versus small tilt distributions for each subject and each coherence level. The resulting histograms show the distribution of differences across all 38 subjects for each nominal coherence, and are, except for the coherence of 22%, not significantly different from zero across subjects (Author response image 1). For the 22% coherence condition, the difference amounts to 0.19% – a very small, non-perceptible difference. Thus, we do no find systematic differences between the average motion coherence in states with high versus low joystick tilt.

      Author response image 1.

      Histograms of within-subject difference between medians of average coherence distributions with large and small joystick tilt for all subjects. Coherence is color-coded (cyan – 0%, magenta – 98%). On top, the title of each panel illustrates the number of significant differences (Ranksum test in each subject) without correction for multiple comparisons (see Author response table 1 below). In the second row of the title, we show the result of the population t-test against zero. Only 22% coherence shows a significant bias. Positive values indicate higher average coherence for large joystick tilt.  

      Author response table 1.

      List of all individual significantly different coherence distributions between high and low tilt states, without correction for multiple comparisons. Median differences do not show a consistent bias (i.e. positive values) that would indicate higher average coherence for the large tilts.

      (1.2) Short-term stimulus fluctuations have no effect

      […] But to fully characterise the task behaviour it also seems important to ask how and whether fluctuations in motion energy (assuming that the RDK frames were recorded) during a steady state phase are affecting continuous reporting of direction and eccentricity, prior to asking how social information is incorporated into subjects' behaviour.

      In addition to the analysis of stimulus coherence and tilt averaged across each stimulus state (1.1), we analyzed moment-to-moment relationship between instantaneous coherence and ongoing reports of accuracy and tilt. Below, we provide evidence that short-term fluctuations in the instantaneous coherence (i.e. the motion energy of the stimulus) do not result in correlated changes in joystick responses, neither for tilt nor accuracy. For each continuous stimulus state, we calculated cross-correlation functions between the instantaneous coherence, tilt and accuracy, and then averaged the cross-correlation across all states of the same nominal coherence, and then across subjects. The resulting average cross-correlation functions are essentially flat. This further supports our interpretation that the joystick reports do not reflect short-term fluctuations of motion energy.

      Author response image 2.

      Cross-correlation between the length of the resultant vector with joystick accuracy (left) and tilt (right). Coherence is color-coded. Shaded background illustrates 95% confidence intervals.

      (1.3) Joystick tilt changes over time despite stable average stimulus coherence

      If perceptual confidence is derived from evidence integration, we should see changes over time even when the stimulus is stable. Here, we have analyzed the average slope of the joystick tilt as a function of time within each stimulus state for each subject and each coherence, to verify if our participants tilted their joystick more with additional evidence. This is illustrated with a violin plot below (Author response image 3). The linear slopes of the joystick tilt progression over the course of stimulus states are different between coherence levels. High coherence causes more tilt over time, resulting in positive slopes for most subjects. In contrast, low/no coherence results mostly in flat or negative slopes. This tilt progression over time indicates that low coherence results in lower confidence, as subjects do not wager more with weak evidence. In contrast, high coherence causes subjects to exhibit more confidence, indicated by positive slope of the joystick tilt.

      Author response image 3.

      Violin plots showing the fitted slopes of the joystick tilt time course in the last 200 samples (1667 ms) leading up to a next stimulus direction (cf. Figure 2D). Positive values signify an increase in joystick tilt over time. Each dot shows the average slope for one subject. Coherence is color-coded. The dashed line at zero indicates unchanged joystick tilt over the analyzed time window.

      (1.4) Cross-correlation between response accuracy and joystick tilt

      Similar to 1.2 above, we have cross-correlated the frame-by-frame changes of joystick accuracy and tilt for each individual stimulus state and each subject. Across subjects, changes in tilt occur later than changes in accuracy, indicating that changes in the quality of the report are followed by changes in the size of the wager. Given that this process is not driven by short-term changes in the motion energy of the stimulus (see 1.2 above), we interpret this as additional evidence for a metacognitive assessment of the quality of the behavioral report (i.e. accuracy) reflected in the size of the wager (our measure for confidence). (See Figure 2E).

      (2) Peri-decision wagering is different to post-decision wagering

      […] One route to doing this would be to ask whether the eccentricity reports show statistical signatures of confidence that have been established for more classical punctate tasks. Here a key move has been to identify qualitative patterns in the frame of reference of choice accuracy - with confidence scaling positively with stimulus strength for correct decisions, and negatively with stimulus strength for incorrect decisions (the so-called X-pattern, for instance Sanders et al. 2016 Neuron […].

      We thank reviewer 1 for the constructive feedback. Our behavioral data do not show similar signatures to the previously reported post-decision confidence expression (Desender et al., 2021; Sanders et al., 2016). The previously described patterns show, first of all, that confidence for the incorrect type1 decisions diverges from the correct type1 decisions, declining with stimulus strength (e.g. coherence), as compared to increase for correct decisions. In our task, there is a graded accuracy and (putative) confidence expression, but there are no correct or incorrect decisions – instead, there are hits and misses of the reward targets presented at nominal directions. Instead of a decline for misses, we observe an equally positive scaling with coherence for the confidence, both for hits and misses (Author response image 4A). This is because in our peri-decision wagering task, the expression of confidence causally determines the binary hit or miss outcome. The outcome in our task is a function of the two-dimensional joystick response: higher tilt (confidence) requires a more accurate response to successfully hit a target. Thus, a subject can display a high (but not high enough) level of accuracy and confidence but still remain unsuccessful. If we instead median-split the confidence reports by high and low accuracy (Author response image 4C), we observe a slight separation, especially for higher coherences, but still no clear different in slopes.

      We do observe the other two dynamic signatures of confidence (Desender et al., 2021): signature 2 – monotonically increasing accuracy as a function of confidence (Author response image 4), and signature 3 – steeper type 1 psychometric performance (accuracy) for high versus low confidence (Author response image 4D).

      Author response image 4.

      Confidence (i.e., joystick tilt, left column) and accuracy reports (right column) for different stimulus coherence, sorted by discrete outcome (hit versus miss, upper row) and the complementary joystick dimension (lower row, based on median split).

      Author response image 5.

      Accuracy reports correlate positively with confidence reports. For each stimulus state, we averaged the joystick response in the time window between 500 ms (60 samples) after a direction change until the first reward target appearance. If there was no target, we took all samples until the next RDP direction change into account. This corresponds to data snippets averaged in Figure 2D. Thus, for each stimulus state, we extracted a single value for joystick accuracy and for tilt (confidence). Subsequently, we fitted a linear regression to the accuracy-confidence scatter within each subject and within each coherence level. The plot above shows the average linear regression between accuracy and confidence across all subjects (i.e., the slopes and intercepts were averaged across n=38 subjects). Coherence is color-coded.

      (3)  Additional analyses regarding the continuous nature of our data

      I was surprised not to see more analysis of the continuous report data as a function of (lagged) task variables. […]

      Reviewer 1 requested more analyses regarding the continuous nature of our data. We agree that this is a useful addition to our paper, and thank reviewer 1 for this suggestion. To address this point, we revised main Figure 2 and provided additional panels. Panel D illustrates the continuous ramp-up of both accuracy and tilt (confidence) for high coherence levels, suggesting ongoing evidence integration and meta-cognitive assessment. Panel E shows the cross-correlation between frame-by-frame changes in accuracy and tilt (see 1.4 above). Here, we demonstrate that changes in the accuracy precede changes in joystick tilt, characterizing the continuous nature of the perceptual decision-making process.

      (4) Explicit motivation regarding continuous social experiments

      This paper is innovating on a lot of fronts at once - developing a new CPR task for metacognition, and asking exploratory questions about how a social setting influences performance on this novel task. However, the rationale for this combination was not made explicit. Is the social manipulation there to help validate the new task as a measure of confidence as dissociated from other perceptual variables? (see query 1 below). Or is the claim that the social influence can only be properly measured in the naturalistic CPR task, and not in a more established metacognition task?

      Our rationale for the combination of real-time decision making and social settings was twofold:

      i. Primates, including humans, are social species. Naturally, most behavior is centered around a social context and continuously unfolds in real-time. We wanted to showcase a paradigm in which distinct aspects of continuous perceptual decision-making could be assessed over time in individual and social environments.

      ii. Human behavior is susceptible to what others think and do. We wanted to demonstrate that the sheer presence of a co-acting social partner affects continuous decision-making, and quantify the extent and direction of social modulation.

      We agree that the motivation for combining the new task and this specific type of social co-action should be more clear. We have clarified this aspect in the Introduction, line 92-109. In brief, the continuous, free-flowing nature of the CPR task and real-time availability of social information made this design a very suitable paradigm for assessing unconstrained social influences. We see this study as the first step into disentangling the neural basis of social modulation in primates. See also the response to reviewer 2, point 2, below.

      (5) Response to minor points

      (5.1)  Clarification on behavioral modulation patterns

      Lines 295-298, isn't it guaranteed to observe these three behavioral patterns (both participants improving, both getting worse, only one improving while the other gets worse) even in random data?

      The reviewer is correct. We now simply illustrate these possibilities in Figure 4B and how these patterns could lead to divergence or convergence between the participants (see also line 282). Unlike random data, our results predominantly demonstrate convergence.

      (5.2) Clarification on AUC distributions

      Lines 703-707, it wasn't clear what the AUC values referred to here (also in Figure 3) - what are the distributions that are being compared? I think part of the confusion here comes from AUC being mentioned earlier in the paper as a measure of metacognitive sensitivity (correct vs. incorrect trial distributions), whereas my impression here is that here AUC is being used to investigate differences in variables (e.g., confidence) between experimental conditions.

      We apologize for the confusion. Indeed, the AUC analysis was used for the two purposes:

      (i) To assess the metacognitive sensitivity (line 175, Supplementary Figure 2).

      (ii) To assess the social modulation of accuracy and confidence (starting at line 232, Figures 3-6). 

      We now introduce the second AUC approach for assessing social modulation, and the underlying distributions of accuracy and confidence derived from each stimulus state, separately in each subject, in line 232.

      (5.3) Clarification of potential ceiling effects

      Could the findings of the worse solo player benefitting more than the better solo player (Figure 4c) be partly due to a compressive ceiling effect - e.g., there is less room to move up the psychometric function for the higher-scoring player?

      We thank the reviewer for this insight. First, even better performing participants were not at ceiling most of the times, even at the highest coherence (cf. Figure 2 and Supplementary Figure 3C). To test for the potential ceiling effect in the better solo players, we correlated their social modulation (expressed as AUC as in Figure 4) to the solo performance. There was no significant negative correlation for the accuracy (p > 0.063), but there was a negative correlation for the confidence (r = - 0.39, p = 0.0058), indicating that indeed low performing “better players in a dyad” showed more positive social modulation. We note however that this correlation was driven mainly by few such initially low performing “better” players, who mostly belonged to the dyads where both participants improved in confidence (green dots, Figure 4B), and that even the highest solo average confidence was at ceiling (<0.95). To conclude, the asymmetric social modulation effect we observe is mainly due to the better players declining (orange and red dots, Figure 4B), rather than due to both players improving but the better player improving less (green dots, Figure 4B).

      Reviewer 2:

      Strengths:

      There are many things to like about this paper. The visual psychophysics has been undertaken with much expertise and care to detail. The reporting is meticulous and the coverage of the recent previous literature is reasonable. The research question is novel.

      We thank reviewer 2 for this positive evaluation. Below we address the identified weaknesses and recommendations.

      (1) Streamlining the text to make the paper easier to read

      The paper is difficult to read. It is very densely written, with little to distinguish between what is a key message and what is an auxiliary side note. The Figures are often packed with sometimes over 10 panels and very long captions that stick to the descriptive details but avoid clarity. There is much that could be shifted to supplementary material for the reader to get to the main points.

      We thank reviewer 2 for the honest assessment that our article was difficult to read and understand, and for providing specific examples of confusion. We substantially improved the clarity:

      We added a Glossary that defines key terms, including Accuracy and Hit rate. 

      We replaced the confusing term “eccentricity” with joystick “tilt”.

      We simplified Figures 3 and 5, moving some panels into supplementary figures.

      We substantially redesigned and simplified our main Figure 4, displaying the data in a more straightforward, less convoluted way, and removing several panels. This change was accompanied by corresponding changes in the text (section starting at line 277).

      More generally, we shortened the Introduction, substantially revised the Results and the figure legends, and streamlined the Discussion.

      (2) Dyadic co-action vs joint dyadic decision making

      A third and very important one is what the word "dyadic" refers to in the paper. The subjects do not make any joint decisions. However, the authors calculate some "dyadic score" to measure if the group has been able to do better than individuals. So the word dyadic sometimes refers to some "nominal" group. In other places, dyadic refers to the social experimental condition. For example, we see in Figure 3c that AUC is compared for solo vs dyadic conditions. This is confusing.

      […] my key criticism is that the paper makes strong points about collective decision-making and compares its own findings with many papers in that field when, in fact, the experiments do not involve any collective decision-making. The subjects are not incentivized to do better as a group either. […]

      The reviewer is correct to highlight these important aspects. We did, in fact, not investigate a situation where two players had to reach a joint decision with interdependent payoff and there was no incentive to collaborate or even incorporate the information provided by the other player. To make the meaning of “dyadic” in our context more explicit, we have clarified the nature of the co-action and independent payoff (e.g. lines 107, 211, 482, 755 - Glossary), and used the term “nominal combined score” (line 224) and “nominal “average accuracy” within a dyad” (line 439).

      Concerning the key point about embedding our findings into the literature on collective decision-making, we would like to clarify our motivation. Outside of the recent study by Pescetelli and Yeung, 2022, we are not aware of any perceptual decision-making studies that investigated co-action without any explicit joint task. So naturally, we were stimulated by the literature on collective decisions, and felt it is appropriate to compare our findings to the principles derived from this exciting field.  Besides developing continuous – in time and in “space” (direction) – peri-decision wagering CPR game, the social co-action context is the main novel contribution of our work. Although it is possible to formulate cooperative or competitive contexts for the CPR, we leveraged the free-flowing continuous nature of the task that makes it most readily amendable to study spontaneously emerging social information integration.

      We now more explicitly emphasize that most prior work has been done using the joint decision tasks, in contrast to the co-action we study here, in Introduction and Discussion.

      (3) Addition of relevant literature to Discussion

      […] To see why this matters, look at Lorenz et al PNAS (https://www.pnas.org/doi/10.1073/pnas.1008636108) and the subsequent commentary that followed it from Farrell (https://www.pnas.org/doi/full/10.1073/pnas.1109947108). The original paper argued that social influence caused herding which impaired the wisdom of crowds. Farrell's reanalysis of the paper's own data showed that social influence and herding benefited the individuals at the expense of the crowd demonstrating a form of tradeoff between individual and joint payoff. It is naive to think that by exposing the subjects to social information, we should, naturally, expect them to strive to achieve better performance as a group.

      Another paper that is relevant to the relationship between the better and worse performing members of the dyad is Mahmoodi et al PNAS 2015 (https://www.pnas.org/doi/10.1073/pnas.1421692112). Here too the authors demonstrate that two people interacting with one another do not "bother" figuring out each others' competence and operate under "equality assumption". Thus, the lesser competent member turns out to be overconfident, and the more competent one is underconfident. The relevance of this paper is that it manages to explain patterns very similar to Schneider et al by making a much simpler "equality bias" assumption.

      We thank reviewer 2 for pointing out these highly relevant references, which we have now integrated in the Discussion (lines 430 and 467). Regarding the debate of Lorenz et al and Farell, although it is about very different type of tasks – single-shot factual knowledge estimation, it is very illuminating for understanding the differing perspectives on individual vs group benefit. We fully agree that it is naïve to assume that during independent co-action in our highly demanding task participants would strive to achieve better performance as a group – if anything, we expected less normative and more informational, reliability-driven effects as a way to cope with task demands.

      Mahmoodi et al. is a particularly pertinent and elegant study, and the equality bias they demonstrate may indeed underlie the effects we see. We admit that we did not know this paper at the time of our initial writing, but it is encouraging to see the convergence [pun intended] despite task and analysis differences. As highlighted above (2), our novel contributions remain that we observe mutual alignment, or convergence, in real-time without explicitly formulated collective decision task and associated social pressure, and that we separate asymmetric social effects on accuracy and confidence.

      Other reviewer-independent changes:

      Additional information: Angular error in Figure 2

      In panel A of the main Figure 2, we have added the angular error of the solo reports (blue dashed line) to give readers an impression about the average deviation of subjects’ joystick direction from the nominal stimulus direction. We have pointed out that angular error is the basis for accuracy calculation.

      Data alignment

      In the previous version of the manuscript, we have presented data with different alignments: Accuracy values were aligned to the appearance of the first target in a stimulus state (target-alignment) to avoid the predictive influence of target location within the remaining stimulus state, while the joystick tilt was extracted at the end of each stimulus state (state-alignment) to allow subjects more time to make a deliberate, confidence-guided report (Methods). We realized that this is confusing as it compares the social modulation of the two response dimensions at different points in time. In the revision, we use state-aligned data in most figures and analyses and clearly indicate which alignment type has been used. We kept the target-alignment for the illustration of the angular error in the solo-behavior (Figure 2). Specifically, this has only changed the reporting on accuracy statistics. None of the results have changed fundamentally, but the social modulation on accuracy became even stronger in state-aligned data.

      In summary, we hope that these revisions have resulted in an easier-to-understand and convincing article, with clear terminology and concise and important takeaway messages.

      We thank both reviewers and the editors again for their time and effort, and look forward to the reevaluation of our work.

      References

      Desender K, Donner TH, Verguts T. 2021. Dynamic expressions of confidence within an evidence accumulation framework. Cognition 207:104522. doi:10.1016/j.cognition.2020.104522

      Pescetelli N, Yeung N. 2022. Benefits of spontaneous confidence alignment between dyad members. Collective Intelligence 1. doi:10.1177/26339137221126915

      Sanders JI, Hangya B, Kepecs A. 2016. Signatures of a Statistical Computation in the Human Sense of Confidence. Neuron 90:499–506. doi:10.1016/j.neuron.2016.03.025

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      This manuscript uses optical coherence tomography (OCT) to visualize tissue microstructures about 1-2 mm under the finger pad skin surface. Their geometric features are tracked and used to generate tissue strains upon skin surface indentation by a series of transparent stimuli both normal and tangential to the surface. Then movements of the stratum corneum and the upper portion of the viable epidermis are evaluated. Based upon this data, across a number of participants and ridges, around 300 in total, the findings report upon particular movements of these tissue microstructures in various loading states. A better understanding of the mechanics of the skin microstructures is important to understand how surface forces propagate toward the locations of mechanoreceptive end organs, which lie near the edge of the epidermis and dermis, from which tactile responses of at least two peripheral afferents originate. Indeed, the microstructures of the skin are likely to be important in shaping how neural afferents respond and enhance their sensitivity, receptive field characteristics, etc. 

      Strengths: 

      The use of OCT in the context of analyzing the movements of skin microstructures is novel. Also novel and powerful is the use of distinct loading cases, e.g., normal, tangential, and stimulus features, e.g., edges, and curves. I am unaware of other empirical visualization studies of this sort. They are state-of-the-art in this field.

      Moreover, in addition to the empirical imaging observations, strain vectors in the tissues are calculated over time. 

      Weaknesses: 

      The interpretation of the results and their framing relative to the overall hypotheses/questions and prior works could be articulated more clearly. In particular, the major findings of the manuscript are in newly describing a central concept regarding "ridge flanks," but such structures are neither anatomically nor mechanistically defined in a clear fashion. For example, "... it appears that the primary components of ridge deformation and, potentially, neural responses are deformations of the ridge flanks and their relative movement, rather than overall bending of the ridges themselves." From an anatomical perspective, I think what the authors mean by "ridge flanks" is a differential in strain from one lateral side of a papillary ridge to the other. But is it unclear what about the continuous layers of tissue would cause such behaviors. Perhaps a sweat duct or some other structure (not visible to OCT) would subdivide the "flanks" of a papillary ridge somehow? If not due to particular anatomy, then is the importance of the "ridge flank" due to a mechanistic phenomenon of some sort? Given that the findings of the manuscript center upon the introduction of this new concept, I think a greater effort should be made to define what exactly are the "ridge flanks." It is clear from the results, especially the sliding case, that there is something important that the manuscript is getting at with this concept. 

      We apologize for the confusion around our use of ‘ridge flanks’. To recap the overall goal briefly, we wanted to measure the deformation of papillary ridges and their associated sub-surface structures to different tactile stimuli. Capturing these deformations and comparing them against different proposed ideas, for example bending (horizontal shear) of the entire ridge versus differential deformations of different sub-parts, constrains neural activation mechanisms, has implications for how well tactile stimuli can be spatially resolved on the skin, and for whether sub-surface deformations can be easily predicted from surface movements alone. Our mesh was dense enough to compare the stratum corneum and the viable epidermis directly, where we expected some differences due to their previously documented mechanical differences, as well as the ridge flanks, which refers to the two (proximal and distal) sides of a single papillary ridge and their associated structure in the SC and VE (as correctly surmised by the reviewer). Differential behaviour across ridge flanks might be seen, because various observations of the surface of the stratum corneum had suggested mechanical differences between the papillary ridges and the grooves dividing them, potentially leading to differential deformations of these two halves depending on which direction they were facing tissue with different mechanical properties.

      We now provide a clearer definition of ridge flanks in Figure 1 and in the main text. Importantly, existing prior research is better connected to our own investigation in the Introduction and we now specifically explain why we investigate ridge flanks.

      The OCT used herein cannot visualize deep and fully into what the manuscript refers to as a "ridge"(note others have previously broken apart this concept apart into "papillary", "intermediate" and "limiting" ridges) near locations of the mechanoreceptive end organs lie at the epidermal-dermal border. Therefore, the OCT must make inferences about the movements of these deeper tissues, but cannot see them directly, and it is the movements of these deeper tissues that are likely driving the intricacies of neural firing. Note the word "ridge" is used often in the manuscript's abstract, introduction, and discussion but the definition in Fig. 1 and elsewhere differs in important ways from prior works of Cauna (expert in anatomy). Therefore, the manuscript should clarify if "ridge" refers to the papillary ridge (visible at the exterior of the skin), intermediate ridge (defined by Cauna as what the authors refer to as the primary ridge), and limiting ridge (defined by Cauna as what the authors refer to as the secondary ridge). What the authors really mean (I think) is some combination of the papillary and intermediate ridge structures, but not the full intermediate ridge. The manuscript acknowledges this in the "Limitations and future work" section, stating that these ridges cannot be resolved. This is important because the manuscript is oriented toward tracking this structure. It sets up the narrative and hypotheses to evaluate the prior works of Cauna, Gerling, Swensson, and others who all directly addressed the movement of this anatomical feature which is key to understanding ultimately how stresses at these locations might move the peripheral end organs (i.e., Merkel cells, Meissner corpuscles). 

      Thank you for these observations. Indeed, our terminology was not consistent. We have now switched to Cauna’s terminology and added additional labels in Figure 1, explaining all mentioned structures in the main text. We have also changed the language in many instances in the main text to make it clearer whether we are referring to individual anatomical ridges (papillary, limiting, etc.) or the whole structure. Additionally, it is now clearer from the start which features are tracked, and we specifically state  that intermediate ridges are excluded from our tracking.

      Regarding the intermediate ridge, it indeed plays a big role in Cauna’s lever hypothesis. Given the intermediate ridge is excluded from our analysis, we can neither prove nor disprove this hypothesis in our current work. However, there are many mechanical mysteries to solve regarding the structures directly above, which are the main focus of this paper. We have rewritten the introduction to make these questions clearer. For example, Cauna observed pliability of the papillary ridges in surface experiments. Swensson found differential expression patterns of keratin in epidermis tissue in and above the intermediate ridges, but the direct mechanical consequences that are proposed in their paper concern the behaviour of papillary ridges, rather than relying on a mechanical role of intermediate ridges. Even Cauna’s lever idea implies specific deformation of the stratum corneum, which would be measurable in our study, as the upper handle of the ‘lever’ needs turning. We observed little movement in accordance with this idea, putting the lever mechanism into question. While this does not rule out a mechanical role of the intermediate ridge, these findings constrain its potential mechanisms.

      Reviewer #2 (Public Review): 

      Summary: 

      The authors investigate sub-skin surface deformations to a number of different, relevant tactile stimuli, including pressure and moving stimuli. The results demonstrate and quantify the tension and compression applied from these types of touch to fingerprint ridges, where pressure flattens the ridges. Their study further revealed that on lateral movement, prominent vertical shearing occurred in ridge deformation, with somewhat inconsistent horizontal shear. This also shows how much the deeper skin layers are deformed in touch, meaning the activation of all cutaneous mechanoreceptors, as well as the possibility of other deeper non-cutaneous mechanoreceptors. 

      Strengths: 

      The paper has many strengths. As well as being impactful scientifically, the methods are sound and innovative, producing interesting and detailed results. The results reveal the intricate workings of the skin layers to pressure touch, as well as sliding touch over different conditions. This makes it applicable to many touch situations and provides insights into the differential movements of the skin, and thus the encoding of touch in regards to the function of fingerprints. The work is very clearly written and presented, including how their work relates to the literature and previous hypotheses about the function of fingerprint ridges. The figures are very well-presented and show individual and group data well. The additional supplementary information is informative and the video of the skin tracking demonstrates the experiments well. 

      Weaknesses: 

      There are very few weaknesses in the work, rather the authors detail well the limitations in the discussion. Therefore, this opens up lots of possibilities for future work. 

      We thank the reviewer for these encouraging comments.

      Impact/significance: 

      Overall, the work will likely have a large impact on our understanding of the mechanics of the skin. The detail shown in the study goes beyond current understanding, to add profound insights into how the skin actually deforms and moves on contact and sliding over a surface, respectively. The method could be potentially applied in many other different settings (e.g. to investigate more complex textures, and how skin deformation changes with factors like dryness and aging). This fundamental piece of work could therefore be applied to understand skin changes and how these impact touch perception. It can further be applied to understand skin mechanoreceptor function better and model these. Finally, the importance of fingertip ridges is well-detailed, demonstrating how these play a role in directly shaping our touch perception and how they can shape the interactions we have with surfaces. 

      Reviewer #3 (Public Review): 

      Summary: 

      The publication presents unique in-vivo images of the upper layer of the epidermis of the glabrous skin when a flat object compresses or slides on the fingertip. The images are captured using OCT, and are the process of recovering the strain that fingerprints experience during the mechanical stimulation. 

      The most important finding is, in my opinion, that fingerprints undergo pure compression/tension without horizontal shear, hinting at the fact that the shear stress caused by the tangential load is transferred to the deeper tissues and ultimately to the mechanoreceptors (SA-I / RA-I). 

      Strengths: 

      Fascinating new insights into the mechanics of glabrous skin. To the best of my knowledge, this is the first experimental evidence of the mechanical deformation of fingerprints when subjected to dynamic mechanical stimulation. The OCT measurement allows an unprecedented measurement of the depth of the skin whereas previous works were limited to tracking the surface deformation.  - The robust data analysis reveals the continuum mechanics underlying the deformation of the fingerprint ridges. 

      Weaknesses: 

      I do not see any major weaknesses. The work is mainly experimental and is rigorously executed. Two points pique my curiosity, however: 

      (1) How do the results presented in this study compare with previous finite element analysis? I am curious to know if the claim that the horizontal shear strain is transferred to the previous layer is also captured by these models. The reason is that the FEA models typically use homogeneous materials and whether or not the behavior in-silico and in-vivo matches would offer an idea of the nature of the stratum corneum. 

      Very few modeling studies have examined combined normal and tangential loading of the fingertip. Additionally, results are often expressed in terms of Von Mises stresses, and not deformation [1,2], making direct comparison challenging. Nevertheless, one multilayered study [3] supports our finding that the largest deformations are found in deeper tissues.

      (1) Shao, F., Childs, T. H. C., Barnes, C. J. & Henson, B. Finite element simulations of static and sliding contact between a human fingertip and textured surfaces. Tribology International 43, 2308–2316 (2010).

      (2) Tang, W. et al. Investigation of mechanical responses to the tactile perception of surfaces with different textures using the finite element method. Advances in Mechanical Engineering 8, (2016).

      (3) Amaied, E., Vargiolu, R., Bergheau, J. M. & Zahouani, H. Aging effect on tactile perception: Experimental and modelling studies. Wear 332–333, 715–724 (2015). 

      (2) Was there a specific reason why the authors chose to track only one fingerprint? From the method section, it seems that nothing would have prevented tracking a denser point cloud and reconstructing the stain on a section of the skin rather than just one ridge. With such data, the author could extend their analysis to multiple ridges interaction and get a better sense of the behavior of the entire strip of skin. 

      We apologise for the confusion regarding this point. While in our illustration and the accompanying videos, we only show a single tracked ridge for clarity, we do indeed track all visible ridges in every frame. As imaging slices were 4 mm wide, often 8-9 ridges were visible concurrently. However, during the sliding experiments the skin was sometimes dragged along with the stimulus, causing some ridges to disappear from view for certain periods and then re-enter the frame. This would make it difficult to expand the analysis to multiple ridges, but in any case, we found neighbouring ridges to behave very consistently within a given trial, so that their mechanical behaviour (relative to the tactile feature, if any) could be averaged in the analysis.

      Reviewer #1 (Recommendations For The Authors): 

      Discussion, line 213, "Thus, the primary mechanism through which the ridge conforms to the object involves the relative movement and shearing of the ridge flanks, rather than relying on the groves as articulated joints." I don't see this as definitely proven in the imaging and analysis. This could be a hypothesis to come from this work for further evaluation but is a quite strong statement not obviously supported by the evidence. 

      We have rephrased this statement as a proposal for further testing:

      “Therefore, we propose that the primary mechanism through which a ridge conforms to an object might involve the relative movement and shearing of the ridge flanks, rather than relying on the grooves as articulated joints.”

      Discussion, line 220, "Our findings strongly indicate that the majority of the surface movement of the skin was observed by deeper tissue rather than surface layers of the skin." But since there are no measurements of such tissues, or of collagen bundle tightening, etc. it is not obvious to me how this can be proven as it is not directly observable and was not modeled. 

      We have reworded this paragraph to be more cautious and have included potential avenues for future testing of this idea:

      “It is possible that the majority of the surface movement of the skin was absorbed by deeper tissues rather than the surface layers of the skin imaged in the present study. If that is the case, recent modeling work has suggested that tissue deformations are highly dependent on the orientation of collagen fibers in these tissues (Duprez et al., 2024), which might be amenable to tracking in future OCT work to test this idea directly. Additionally, previous work investigating tactile afferent responses to tangential skin movements has reported strong activation of SA-2 receptors, thought to measure skin stretch mainly in deeper tissues (Saal et al., 2025), providing further indirect evidence.”

      Figure 1, A. As noted elsewhere, there are issues with the naming of the anatomy, and there is no definition of the concept of "ridge flanks." Also, it does not indicate the depth point to which OCT can resolve. 

      We have updated and expanded the labels in Figure 1A to clarify the anatomy (along with changes in the text described above). Figure 1C now includes a sentence about the resolvability of features below the mesh:

      “Detail view of a single OCT frame showing ridged skin structure and clear boundary between the stratum corneum and viable epidermis. A mesh covering the stratum corneum and the upper part of the viable epidermis (without the intermediate ridge) is overlaid spanning a single papillary ridge. The border between the viable epidermis and dermis is less clearly delineated, but some deeper features are resolved less well.”

      The concept of a ridge flank is now illustrated in Figure 1B(i) and Figure 1B(iv), and referred to in both the caption and main text. Updated figure caption text:

      “These deformations need not apply to the whole ridge structure but might affect different parts separately, e.g. via shearing in different directions across both ridge flanks  as shown on the far right

      (see darker shading to highlight a single ridge flank).”

      Updated text in the main manuscript:

      “Additionally, if there are indeed mechanical differences between papillary ridges and their neighbouring grooves at the level of the stratum corneum, this might result in differential movements of the two sides of each papillary ridge, here referred to as ridge flanks (see Figure 1B-iv, right, for a potential example).”

      Note that Figure 4B also includes an illustration of this concept.

      Figure 1, B. This mechanical representation does not capture the entirety of the papillary-intermediate ridge unit in question, as set up by the authors in the introduction. Also, in the caption it is not ridge deformation, but upper SC and VE deformation. And the OCT cannot resolve the whole ridge. 

      We have reworded the figure caption”

      “Potential deformations of the tracked ridge structure, including the stratum corneum and the bulk of the viable epidermis, during tactile interactions, with arrows indicating the directions of relative deformation. [...]”

      Importantly, the main manuscript text has been rewritten in the introduction section to clarify our research question and how much of the sub-surface ridge structure is tracked:

      “From a mechanical standpoint, these conflicting interpretations raise the question of how the outermost two skin layers typically deform at the resolution of single papillary ridges, whether by tension, compression, or shear (see examples in Figure 1B). Additionally, such deformations might apply to individual papillary ridges and all their sub-surface structures equally, for example horizontal shearing that bends the papillary ridge in a certain direction, while levering its sub-surface aspects in the opposite direction. Conversely, individual parts of the ridge structure might deform differently. For example, the viable epidermis might deform to a different extent or in different directions due to its lower stiffness and different morphology. Additionally, if there are indeed mechanical differences between papillary ridges and their neighbouring grooves at the level of the stratum corneum, this might result in differential movements of the two sides of each papillary ridge, here referred to as ridge flanks (see Figure 1B-iv, right, for a potential example). To empirically address these questions, we employed Optical Coherence Tomography (OCT) to precisely measure the sub-surface deformation of individual fingerprint ridges in response to a variety of mechanical events. Specifically, we focused on the stratum corneum and the bulk of the viable epidermis (excluding intermediate ridges), which could be robustly resolved and tracked by our setup.”

      Figure 1, C: While it is noted in the caption that the locations of the intermediate and limiting ridges, as well as the collagen bundles, are clearly visible, it is not clear to me, although the caption uses these words. This is especially the case below the orange mesh. From the picture, and because this is not labeled, it leaves it up to my interpretation, it seems like the secondary ridge (limiting) is larger than the primary (intermediate). 

      We have reworded the caption as follows:

      “Detail view of a single OCT frame showing ridged skin structure and clear boundary between the stratum corneum and viable epidermis. A mesh covering the stratum corneum and the upper part of the viable epidermis (without the intermediate ridge) is overlaid spanning a single papillary ridge. The border between the viable epidermis and dermis is less clearly delineated.”

      Indeed, while the intermediate ridge was often visible in the OCT images, its size was rather inconsistent and it could appear as larger or smaller than the limiting ridge, while in histological images it is generally shown as larger (however note that there is somewhat limited data). This difference might be due to imaging artifacts, e.g. limited visibility into the deeper tissues, might reflect individual differences between participants, or could indicate that intermediate ridges are not of a consistent height in the (out-of-plane) direction along a given ridge. We have clarified this in the Limitations section of the Discussion:

      “[...] while we could confidently track landmarks associated with the stratum corneum, we could not reliably identify intermediate ridges in the viable epidermis, though they were visible in some of the frames, limiting the depth of the fitted mesh. We hypothesize that the additional depth of these ridges combined with their slender morphology might have degraded the signal. 3D OCT imaging (see below) might help to resolve these features in future work and settle open questions regarding their precise morphology.”

      Figure 1, D, and E: How do these measurements compare with the literature? They seem reasonable to me based on a cursory review, but there is a need to directly compare, especially since measurements in this context with the OCT are novel and could be valuable. 

      We have clarified this in the main text and added more references to the existing literature:

      “We measured an average ridge width of 0.47 mm across participants (Figure 1D), consistent with previous studies (Moore, 1989; Ohler and Cummins, 1942). Average skin layer thickness was 0.38 mm for the stratum corneum and 0.12 mm for the viable epidermis across our dataset (Figure 1E), again in agreement with previous studies using both in vivo imaging and ex vivo histology (Fruhstorfer et al., 2000; Lintzeri et al., 2022; Maiti et al., 2020).”

      Abstract 4th sentence's structure makes me think that hundreds of individual fingerprint ridges can be tracked at the same time. Perhaps it could be tweaked to clearly indicate that hundreds were tracked between trials between participants. 

      We have changed the sentence to now read:

      “Here, we used optical coherence tomography to image and track sub-surface deformations of hundreds of individual fingerprint ridges across ten participants and four individual contact events at high spatial resolution in vivo.”

      Introduction, 1st sentence, the fingertip per se is not an organ, though the skin is an organ. 

      Changed the wording from “organ” to “structure”.

      Introduction, 1st sentence, "... that convert skin deformations ..." Need to add word skin to be clear. 

      Done.

      Introduction, 3rd paragraph, "Alternately, the grooves may be stiffer or less ...". In this paragraph, and this sentence in particular, Cauna is cited and the words groves and ridges are used. But this is not adequately explained. Cauna had distinct terminology, where he referred to papillary, intermediate, and limiting ridges, that exist in addition to ready ridges. It is important because the manuscript uses the word "ridges" in a non-specific way. This is done not just here but throughout the manuscript, and is central to the questions which can be addressed with OCT. 

      Anatomy has been better defined and more extensively labelled in Figure 1A, including labels for ‘papillary ridges’ and ‘grooves’. We have reworded this paragraph to better explain the concepts and how they relate to the subsequent analyses in the paper

      “Consequently, the mechanical response of the skin below its immediate surface remains largely unknown, leading to conflicting interpretations in the literature. For instance, it has been proposed that the papillary ridges are stiffer than the neighbouring grooves (Swensson et al., 1998), which might imply that normal loading of the skin might not affect the ridges’ profile appreciably. Conversely, other observations have suggested that the grooves are relatively stiff, allowing the papillary ridges to deform considerably (Cauna, 1954; Johansson and LaMotte, 1983). However, the sub-surface consequences of this putative pliability during object contact or stick-to-slip transitions (see e.g. Delhaye et al., 2016) are unclear: the whole ridge structure might bend as proposed in Cauna’s lever mechanism (Cauna, 1954), but this view has proved controversial (see e.g. Gerling and Thomas, 2008), with direct empirical evidence lacking.”

      Figure 1. Avoid red-green dots for colorblind accessibility. PMMA is not in the caption. 

      We have switched the colors of the mechanoreceptors in panel A to a colorblind-friendly scheme. We now also specify the material of the plates in the figure 1 caption.

      Results, line 102. "... papillary ridge structure...." Is this the ridge to which is being referred? 

      In conjunction with the updated labeling in Figure 1A, we have updated the terminology throughout the paper to be more consistent.

      Results, line 99. "We noted a small increase in the area of the strateum corneum, which was likely an artifact due to the fit of the mesh to the ridge's curvature ..." There is very little discussion of Fig. F's finding related to an increase in area in the SC and decrease in the VE. It makes me question if this finding in this panel is an artifact. With stiff tissue like stratum corneum, how would the area increase? 

      This finding could be a measurement artifact or it could be the result of skin from neighbouring regions pushing into the imaged space. We have reworded the brief description in the Results:

      “We noted a small increase in the area of the stratum corneum, which was possibly an artifact due to the imperfect fit of the mesh to the ridge's curvature (but see Discussion for an alternative explanation).”

      Additionally, we have added a short section in the Discussion in the Limitations section:

      “Some of our tactile interactions might have caused skin deformations out-of-plane that were thus not measurable. For example, the slight increase in thickness of the stratum corneum under normal load might be explained as a measurement artifact due to the coarse nature of the mesh fitted, but could alternatively reflect tissue from out-of-plane regions pushing into the imaged space. Indeed, recent surface measurements of the skin's behaviour during initial object contact have reported compression of the skin in the plane parallel to its surface (Doumont et al., 2025), which would result in increasing thickness, assuming that the stratum corneum is incompressible. Future studies could consider creating three-dimensional reconstructions of the fingerprint structure to study such effects.”

      Figure 3. The colors used in slip and stick are not colorblind accessible. 

      We have changed the background colors in Figure 3A,B,C to a colorblind accessible version.

      Results, line 151, "Thus, most of this shearing must be sustained by deeper tissues." But there are no direct observations as such. Also, in the next sentence, "collagen fiber bundles" are referred to in a non-specific way. This section is highly speculative with no systematic visualization of these structures, and should probably be moved to the discussion. 

      We have reworded this sentence to be more cautious. We have now also highlighted collagen fiber bundles visible in the figure. Systematic analysis of these is beyond the scope of the present study, as these were not tracked, but might be possible in future studies. The reworded sentence reads as follows:

      “Thus, it is possible that shearing is sustained by deeper tissues, an effect that could be tested in future studies by directly tracking the angle and orientation of collagen fiber bundles anchoring the epidermis to deeper tissues (see highlighted examples in Figure 3B).”

      Results, line 161, " Horizontal shear ..." do you mean surface shear, per the Fig. 1 definition? 

      For consistency, we have changed the labels to ‘Horizontal shear’ and ‘Vertical shear’ in Figure 1A(iii) and Figure 1A(iv) as these are the terms used throughout the paper.

      Discussion, line 198, "... flatten even at relatively low forces." This is an interesting point and it would be useful to note how low exactly. 

      We have reworded this sentence to better reflect the findings described earlier:

      “We found that individual ridges tended to flatten considerably at relatively low forces of 0.5 N, with higher forces increasing deformations only moderately.”

      Reviewer #2 (Recommendations For The Authors): 

      Minor comments that could improve the paper even further 

      In the abstract, it may be good to specify that the stimuli were all applied to the finger, this was not an active, self-generated tactile interaction, e.g. change 'in response to a variety of tactile stimuli' to 'in response to a variety of passively-applied tactile stimuli'. 

      Done.

      Comment on the grey/blue colours in the figures. I like the combination of blue/orange for different conditions, but sometimes the blue is very difficult to see against the grey background. Is there any way of making the grey background shading lighter and/or the blue darker/more vivid?

      We have changed the color of the SC mesh to a darker shade of blue, which is more easily distinguished from the grey background. This applies to figures 2B/C, 3D, 4A/B/D/E, and all supplementary figures.

      Methods. Could you please add a little more detail about exactly where the images were taken, e.g. in the exact middle of the fingerpad, at the fingertip? Did you line up the skin fingerprint ridges to be in a plane? It is just to better understand how the stimulus moved against the skin, which itself is rounded, and whether it was at a point where the ridges were relatively linear or curved. 

      We have added the following text in the “Experimental set-up” section of the Methods:

      “The participant's finger was secured in a finger holder, which was positioned in such a way that the flat part of the fingertip distal to the whorl made initial contact with the plate as it was lowered onto the fingertip. The scanner was positioned such that its scan path aligned with the distal-proximal axis of the plate, targeting the centre line of the fingerpad so that the fingerprint ridges were oriented orthogonally to the line scan.”

      and

      “For these experiments, imaging focused on the central flat part of the contact area, such that all fingerprint ridges visible in the imaged region were in contact with the plate throughout the trial.”

      Methods. There is no section about statistics, yet you do use them in the paper. It may be good to add a few details in the methods to outline the package you used to do the statistics, as well as why you chose the tests you carried out. 

      We have added a new Statistics section at the end of the Methods:

      “Statistical tests were run in Python using the scipy.stats package. As distributions were skewed, we used non-parametric analyses throughout the study. Bonferroni corrections were used when multiple comparisons were made.”

      A very minor point. Discussion, line 210: 'In this study...' is vague, which study exactly? It is preferable to be more precise, e.g. 'In the present/current study...'. 

      Fixed.

      Discussion. One point you may want to add is the possibility of looking at other skin regions. For example, would this approach work on the palm, on border glabrous/hairy skin, on various hairy skin sites, and on the foot? The possibilities could be endless if it could be applied anywhere, but it may depend on the technical positioning and skin itself. However, it would be interesting to know. 

      We have added the following text at the end of the Discussion section:

      “Finally, while we focused on the fingertip only, many other skin regions present interesting mechanical challenges waiting to be explored. The general ridged structure observed on the fingertip is common to all glabrous skin, but the local ridge mechanics might still differ: glabrous skin on the foot sole exhibits some morphological differences in order to support large weights that might well influence its mechanical response (Boyle et al., 2019). For example, the morphology of transverse ridges (running orthogonal to and connecting limiting with intermediate ridges) differs across regions on the foot sole (Nagashima and Tsuchida, 2011) and very likely from the hand (Yamada et al., 1996). Our method should be directly applicable to study deformations of these ridges, though three-dimensional observations might be needed to resolve some of the open questions. Hairy skin in contrast differs from glabrous skin in that the stratum corneum is much thinner. It also lacks the clearly organised ridge structure, but exhibits more loosely oriented skin folds instead, which very likely also serve a mechanical function (Leyva-Mendivil et al., 2015) and in principle are amenable to study using OCT.”

      In the last lines of the discussion, you mention the possible effects of skin moisturization. The Tomlinson et al. paper refers to the hydration of the skin with regard to water, which I would say is a slightly different factor. I think you can mention this paper and talk about the water level of the skin/hydration, but also add specifically that moisturization (i.e. by an emollient, humectant, or occlusive substance) is another factor to consider (e.g. effects found by Dione et al, 2023 Sci Rep). Overall, these two points relate to the dryness of the skin and the humidity of surfaces being contacted, therefore you could expand on both. 

      Thank you for the correction! We now mention both skin hydration and moisturization separately in this section.

  3. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. [m1] Anya Kamenetz. Facebook's own data is not as conclusive as you think about teens and mental health. NPR, October 2021. URL: https://www.npr.org/2021/10/06/1043138622/facebook-instagram-teens-mental-health (visited on 2023-12-08). [m2] Anya Kamenetz. Selfies, Filters, and Snapchat Dysmorphia: How Photo-Editing Harms Body Image. Psychology Today, February 2020. URL: https://www.psychologytoday.com/us/articles/202002/selfies-filters-and-snapchat-dysmorphia-how-photo-editing-harms-body-image (visited on 2023-12-08). [m3] Terry Gross. Director Bo Burnham On Growing Up With Anxiety — And An Audience. NPR, July 2018. URL: https://www.npr.org/2018/07/18/630069876/director-bo-burnham-on-growing-up-with-anxiety-and-an-audience (visited on 2023-12-08). [m4] Sarah McQuate. 'I don't even remember what I read': People enter a 'dissociative state' when using social media. ScienceDaily, May 2022. URL: https://www.sciencedaily.com/releases/2022/05/220523135018.htm (visited on 2023-12-08). [m5] Robinson Meyer. Everything We Know About Facebook’s Secret Mood-Manipulation Experiment. The Atlantic, June 2014. URL: https://www.theatlantic.com/technology/archive/2014/06/everything-we-know-about-facebooks-secret-mood-manipulation-experiment/373648/ (visited on 2023-12-08). [m6] Digital detox. November 2023. Page Version ID: 1187412856. URL: https://en.wikipedia.org/w/index.php?title=Digital_detox&oldid=1187412856 (visited on 2023-12-08). [m7] Lauren Collee. The Great Offline. Real Life, December 2021. URL: https://reallifemag.com/the-great-offline/ (visited on 2023-12-08). [m8] Merriam-Webster. On ‘Doomsurfing’ and ‘Doomscrolling’. 2023. URL: https://www.merriam-webster.com/wordplay/doomsurfing-doomscrolling-words-were-watching (visited on 2023-12-08). [m9] Ethan Jacobs [@ethanjacobslaw]. OK doomscrolling is bad but have you SEEN the quality of the doom this week? January 2021. URL: https://twitter.com/ethanjacobslaw/status/1347434641540538368 (visited on 2023-12-08). [m10] 24-hour news cycle. November 2023. Page Version ID: 1184581615. URL: https://en.wikipedia.org/w/index.php?title=24-hour_news_cycle&oldid=1184581615 (visited on 2023-12-08). [m11] Trauma Dumping. August 2021. URL: https://knowyourmeme.com/memes/trauma-dumping (visited on 2023-12-08). [m12] Pamela B. Rutledge. How to Overcome Social Media Trauma Dumping. Psychology Today, September 2021. URL: https://www.psychologytoday.com/us/blog/positively-media/202109/how-overcome-social-media-trauma-dumping (visited on 2023-12-08). [m13] Factitious disorder imposed on self. November 2023. Page Version ID: 1184183450. URL: https://en.wikipedia.org/w/index.php?title=Factitious_disorder_imposed_on_self&oldid=1184183450 (visited on 2023-12-08). [m14] Róisín Lanigan. The Internet Has a Cancer-Faking Problem. The Atlantic, May 2019. URL: https://www.theatlantic.com/health/archive/2019/05/faking-cancer-online/588334/ (visited on 2023-12-08). [m15] Jules Montague. Münchausen by internet: the sickness bloggers who fake it online. The Guardian, April 2015. URL: https://www.theguardian.com/society/2015/apr/29/jules-gibson-munchausen-by-internet-sickness-bloggers-fake-it-whole-pantry (visited on 2023-12-08). [m16] What is self-harm? URL: https://www.mind.org.uk/information-support/types-of-mental-health-problems/self-harm/about-self-harm/ (visited on 2023-12-08). [m17] Juli Fraga. When Teens Cyberbully Themselves. NPR, April 2018. URL: https://www.npr.org/sections/health-shots/2018/04/21/604073315/when-teens-cyberbully-themselves (visited on 2023-12-08). [m18] ContraPoints. Contrapoints. URL: https://www.youtube.com/c/ContraPoints (visited on 2023-12-08). [m19] Incel. December 2023. Page Version ID: 1188569777. URL: https://en.wikipedia.org/w/index.php?title=Incel&oldid=1188569777 (visited on 2023-12-08). [m20] Chad. March 2012. URL: https://knowyourmeme.com/memes/chad (visited on 2023-12-08). [m21] Incel. December 2023. Page Version ID: 1188569777. URL: https://en.wikipedia.org/w/index.php?title=Incel&oldid=1188569777#Mass_murders_and_violence (visited on 2023-12-08). [m22] Rhitu Chatterjee. The new 988 mental health hotline is live. Here's what to know. NPR, July 2022. URL: https://www.npr.org/sections/health-shots/2022/07/15/1111316589/988-suicide-hotline-number (visited on 2023-12-08). [m23] Amanda Baughan. Make Peace with Social Media. Medium, May 2022. URL: https://amandabaughan.medium.com/make-peace-with-social-media-113877582006 (visited on 2023-12-08). [m24] Yim Register. Yim Register. URL: http://students.washington.edu/yreg/ (visited on 2023-12-08). [m25] MLEducation and YimRegister. Art/socialmediatips at main MLEducation/Art. 2021. URL: MLEducation/Art (visited on 2023-12-08). [m26] Casey Fiesler. What I Learned About the Internet From The Baby-Sitters Club. Slate, February 2017. URL: https://slate.com/technology/2017/02/what-i-learned-about-the-internet-from-the-baby-sitters-club.html (visited on 2023-12-08). [m27] Emily St. James. Trans Twitter and the beauty of online anonymity. Vox, September 2020. URL: https://www.vox.com/culture/21432987/trans-twitter-reddit-online-anonymity (visited on 2023-12-08). [m28] Jen Tribbet. Social Media Has Become A Place To Talk About Mental Illness. But Is That Helpful? NPR, November 2019. URL: https://www.npr.org/2019/11/13/779015105/social-media-has-become-a-place-to-talk-about-mental-illness-but-is-that-helpful (visited on 2023-12-08). [m29] Raisedbynarcissists: for the children of abusive parents. 2023. URL: https://www.reddit.com/r/raisedbynarcissists/?rdt=50656 (visited on 2023-12-08). [m30] Benjamin Goggin. Inside Facebook's suicide algorithm: Here's how the company uses artificial intelligence to predict your mental state from your posts. Business Insider, January 2019. URL: https://www.businessinsider.com/facebook-is-using-ai-to-try-to-predict-if-youre-suicidal-2018-12 (visited on 2023-12-08). [m31] Unalive. March 2022. URL: https://knowyourmeme.com/memes/unalive (visited on 2023-12-08). [m32] Christina Farr. Apple and UCLA kick off a three-year depression study. CNBC, August 2020. URL: https://www.cnbc.com/2020/08/04/apple-ucla-to-study-depression.html (visited on 2023-12-08). [m33] Kate Crawford. Time to regulate AI that interprets human emotions. Nature, 592(7853):167–167, April 2021. URL: https://www.nature.com/articles/d41586-021-00868-5 (visited on 2023-12-08), doi:10.1038/d41586-021-00868-5. [m34] Cheryl Teh. 'Every smile you fake' — an AI emotion-recognition system can assess how 'happy' China's workers are in the office. Insider, June 2021. URL: https://www.insider.com/ai-emotion-recognition-system-tracks-how-happy-chinas-workers-are-2021-6 (visited on 2023-12-08). [m35] C. L. Lynch. Invisible Abuse: ABA and the things only autistic people can see. NeuroClastic, March 2019. URL: https://neuroclastic.com/invisible-abuse-aba-and-the-things-only-autistic-people-can-see/ (visited on 2023-12-08). [m36] Gary Shkedy, Dalia Shkedy, and Aileen H. Sandoval-Norton. Long-term ABA Therapy Is Abusive: A Response to Gorycki, Ruppel, and Zane. Adv Neurodev Disord, 5(2):126–134, June 2021. URL: https://doi.org/10.1007/s41252-021-00201-1 (visited on 2023-12-08), doi:10.1007/s41252-021-00201-1. [m37] Neurodiversity. November 2023. Page Version ID: 1187185735. URL: https://en.wikipedia.org/w/index.php?title=Neurodiversity&oldid=1187185735 (visited on 2023-12-08). [m38] C. L. Lynch. “Autism is a Spectrum” Doesn’t Mean What You Think. NeuroClastic, May 2019. URL: https://neuroclastic.com/its-a-spectrum-doesnt-mean-what-you-think/ (visited on 2023-12-08). [m39] Alannah Oleson. Beyond “Average” Users: Building Inclusive Design Skills with the CIDER Technique. Bits and Behavior, October 2022. URL: https://medium.com/bits-and-behavior/beyond-average-users-building-inclusive-design-skills-with-the-cider-technique-413969544e6d (visited on 2023-12-08).

      I found [m23] “Make Peace with Social Media” by Amanda Baughan (2022) really interesting because it challenges the idea that social media is automatically bad for mental health. Instead of calling it an addiction, Baughan suggests treating it more like a relationship — one that you can manage, improve, and set boundaries for. I think this approach is a lot healthier than the “digital detox” mindset, which feels unrealistic for people who rely on social media for community or work.

      Her perspective connects to the “Healing your social media” section in the chapter, especially the idea of replacing “I should” with “I enjoy.” It made me realize that guilt-based thinking about screen time doesn’t help — awareness and intention do. Personally, this made me reflect on how I use social media to learn and connect with people who share my goals, rather than just scroll out of habit.

    1. Although we may think we have an understanding of what personality is, professional psychologists always seek to move beyond what people think they know in order to determine what is actually real or at least as close to real as we can come.

      I would argue you can never really know anyone 😵‍💫

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This study provides a valuable contribution to understanding how negative affect influences food-choice decision making in bulimia nervosa, using a mechanistic approach with a drift diffusion model (DDM) to examine the weighting of tastiness and healthiness attributes. The solid evidence is supported by a robust crossover design and rigorous statistical methods, although concerns about low trial counts, possible overfitting, and the absence of temporally aligned binge-eating measures limit the strength of causal claims. Addressing modeling transparency, sample size limitations, and the specificity of mood induction effects, would enhance the study's impact and generalizability to broader populations.

      We thank the Editor and Reviewers for their summary of the strengths of our study, and for their thoughtful review and feedback on our manuscript. We apologize for the confusion in how we described the multiple steps performed to ensure that the hierarchical model reported in the main text was the best fit for the data but was not overfitted. Regarding “model transparency,” as described in our response to Reviewer 1 below, we have now more clearly explained (with references) that the use of hierarchical estimation procedures allows for information sharing across participants, which improves the reliability and stability of parameter estimates—even when the number of trials per individual is small. We have clarified for the less familiar reader how our Bayesian model selection criterion penalizes models with more parameters (e.g., more complex models).

      Details about model diagnostics, recoverability, and posterior predictive checks are all provided in the Supplementary Materials. We have clarified how these steps ensure that the parameters we estimate are identifiable and interpretable, while confirming that the model can reproduce key patterns in the data, ultimately supporting the validity of the winning model. Additionally, we have provided all scripts for estimating the models by linking to our public Github repository. Furthermore, we have edited language throughout to eliminate any implication of causal claims and acknowledged the limitation of the small sample size. Given these efforts, we are concerned that the current wording about “modeling transparency” in the public eLife Assessment may inadvertently misrepresent the modeling practices in our paper. Would it be possible to revise or remove that particular phrase to better reflect the steps we have taken? We believe this would help avoid confusion for readers.

      We have also taken additional steps to ensure that we have used “appropriate and validated methodology in line with current state-of-the-art," and we have added references to recent papers supporting our approaches.

      All changes in the revised text are marked in blue.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Using a computational modeling approach based on the drift diffusion model (DDM) introduced by Ratcliff and McKoon in 2008, the article by Shevlin and colleagues investigates whether there are differences between neutral and negative emotional states in:

      (1) The timings of the integration in food choices of the perceived healthiness and tastiness of food options between individuals with bulimia nervosa (BN) and healthy participants.

      (2) The weighting of the perceived healthiness and tastiness of these options.

      Strengths:

      By looking at the mechanistic part of the decision process, the approach has the potential to improve the understanding of pathological food choices. The article is based on secondary research data.

      Weaknesses:

      I have two major concerns and a major improvement point.

      The major concerns deal with the reliability of the results of the DDM (first two sections of the Results, pages 6 and 7), which are central to the manuscript, and the consistency of the results with regards to the identification of mechanisms related to binge eating in BN patients (i.e. last section of the results, page 7).

      (1) Ratcliff and McKoon in 2008 used tasks involving around 1000 trials per participant. The Chen et al. experiment the authors refer to involves around 400 trials per participant. On the other hand, Shevlin and colleagues ask each participant to make two sets of 42 choices with two times fewer participants than in the Chen et al. experiment. Shevlin and colleagues also fit a DDM with additional parameters (e.g. a drift rate that varies according to subjective rating of the options) as compared to the initial version of Ratcliff and McKoon. With regards to the number of parameters estimated in the DDM within each group of participants and each emotional condition, the 5- to 10-fold ratio in the number of trials between the Shevlin and colleagues' experiment and the experiments they refer to (Ratcliff and McKoon, 2008; Chen et al. 2022) raises serious concerns about a potential overfitting of the data by the DDM. This point is not highlighted in the Discussion. Robustness and sensitivity analyses are critical in this case.

      We thank the Reviewer for their thoughtful critique. We agree that a limited number of trials can impede reliable estimation, which we acknowledge in the Discussion section. However, we used a hierarchical estimation approach which leverages group information to constrain individual-level estimates. This use of group-level parameters to inform individual-level estimates reduces overfitting and noise that can arise when trial counts are low, and the regularization inherent in hierarchical fitting prevents extreme parameter estimates that could arise from noisy or limited data (Rouder & Lu, 2005). As a result, hierarchical estimation has been repeatedly shown to work well in settings with low trial counts, including as few as 40 trials per condition (Lerche et al., 2017; Ratcliff & Childers, 2015; Wiecki et al., 2013). In addition, previous applications of the time-varying DDM to food choice task data has included experiments with as few as 60 trials per condition (Maier et al., 2020). We have added references to these more recent approaches and specifically note their advantages for the modeling of tasks with fewer trials. Finally, our successful parameter recovery described in the Supplementary Materials supports the robustness of the estimation procedure and the reliability of our results.

      The authors compare different DDMs to show that the DDM they used to report statistical results in the main text is the best according to the WAIC criterion. This may be viewed as a robustness analysis. However, the other DDM models (i.e. M0, M1, M2 in the supplementary materials) they used to make the comparison have fewer parameters to estimate than the one they used in the main text. Fits are usually expected to follow the rule that the more there are parameters to estimate in a model, the better it fits the data. Additionally, a quick plot of the data in supplementary table S12 (i.e. WAIC as a function of the number of parameters varying by food type in the model - i.e. 0 for M0, 2 for M1, 1 for M2 and 3 for M3) suggests that models M1 and potentially M2 may be also suitable: there is a break in the improvement of WAIC between model M0 and the three other models. I would thus suggest checking how the results reported in the main text differ when using models M1 and M2 instead of M3 (for the taste and health weights when comparing M3 with M1, for τS when comparing M3 with M2). If the differences are important, the results currently reported in the main text are not very reliable.

      We thank the Reviewer for highlighting that it would be helpful to explicitly note that we specifically selected WAIC as one of two methods to assess model fit because it penalizes for model complexity. We now explicitly state that, in addition to being more robust than other metrics like AIC or BIC when comparing hierarchical Bayesian models like those in the current study, model fit metrics like WAIC penalize for model complexity based on the number of parameters (Watanabe, 2010). Therefore, more complex models (i.e., those with more parameters) do not automatically have lower WAIC. Additionally, we now more clearly note that our second method to assess model fit, posterior predictive checks, demonstrate that only model M3 can reproduce key behavioral patterns present in the empirical data. As described in the Supplementary Materials, M1 and M2 miss key patterns in the data. In summary, we used best practices to assess model fit and reliability (Wilson & Collins, 2019): results from the WAIC comparison (which penalizes models with more parameters) and results from posterior predictive checks align in showing that M3 provided the best fit to our data. We have added a sentence to the manuscript to state this explicitly.

      (2) The second main concern deals with the association reported between the DDM parameters and binge eating episodes (i.e. last paragraph of the results section, page 7). The authors claim that the DDM parameters "predict" binge eating episodes (in the Abstract among other places) while the binge eating frequency does not seem to have been collected prospectively. Besides this methodological issue, the interpretation of this association is exaggerated: during the task, BN patients did not make binge-related food choices in the negative emotional state. Therefore, it is impossible to draw clear conclusions about binge eating, as other explanations seem equally plausible. For example, the results the authors report with the DDM may be a marker of a strategy of the patients to cope with food tastiness in order to make restrictive-like food choices. A comparison of the authors' results with restrictive AN patients would be of interest. Moreover, correlating results of a nearly instantaneous behavior (i.e. a couple of minutes to perform the task with the 42 food choices) with an observation made over several months (i.e. binge eating frequency collected over three months) is questionable: the negative emotional state of patients varies across the day without systematically leading patients to engage in a binge eating episode in such states.

      I would suggest in such an experiment to collect the binge craving elicited by each food and the overall binge craving of patients immediately before and after the task. Correlating the DDM results with these ratings would provide more compelling results. Without these data, I would suggest removing the last paragraph of the Results.

      We thank the Reviewer for these interesting and important suggestions, and we agree that claims about causal connections between our decision parameters and symptom severity metrics would be inappropriate. Per the Reviewer’s suggestions, we have eliminated the use of the word “predict” to describe the tested association with symptom metrics. We also agree that more time-locked associations with craving ratings and near-instantaneous behavior would be useful, and we have added this as an important direction for future research in the discussion. However, associating task-based behavior with validated self-report measures that assess symptom severity over long periods of time that precede the task visit (e.g., over the past 2 weeks in depression, over the past month in eating disorders) is common practice in computational psychiatry, psychiatric neuroimaging, and clinical cognitive neuroscience (Hauser et al., 2022; Huys et al., 2021; Wise et al., 2023), and this approach has been used several times specifically with food choice tasks (Dalton et al., 2020; Steinglass et al., 2015). We have revised the language throughout the manuscript to clarify: the results suggest that individuals whose task behavior is more reactive to negative affect tend to be the most symptomatic, but the results do not allow us to determine whether this reactivity causes the symptoms.

      In response to this Reviewer’s important point about negative affect not always producing loss-of-control eating in individuals with BN, we now explicitly note that while several studies employing ecological momentary assessments (EMA) have repeatedly shown that increases in negative affect significantly increase the likelihood of subsequent loss-of-control eating (Alpers & Tuschen-Caffier, 2001; Berg et al., 2013; Haedt-Matt & Keel, 2011; Hilbert & Tuschen-Caffier, 2007; Smyth et al., 2007), not all loss-of-control eating occurs in the context of negative affect. We further note that future studies should integrate food choice task data pre and post-affect inductions with measures capturing the specific frequency of loss of control eating episodes that occur during states of high negative affect.

      (3) My major improvement point is to tone down as much as possible any claim of a link with binge eating across the entire manuscript and to focus more on the restrictive behavior of BN patients in between binge eating episodes (see my second major concern about the methods). Additionally, since this article is a secondary research paper and since some of the authors have already used the task with AN patients, if possible I would run the same analyses with AN patients to test whether there are differences between AN (provided they were of the restrictive subtype) and BN.

      We appreciate the Reviewer’s very helpful suggestions. We have adjusted our language linking loss-of-control eating frequency with decision parameters, and we have added sentences focusing on the implications for the restrictive behavior of patients with BN between binge eating episodes. In the Supplementary Materials, we have added an analysis of the restraint subscale of the EDE-Q and confirmed no relationship with parameters of interest. While we agree additional analyses with AN patients would be of interest, this is outside the scope of the paper. Our team have collected data from individuals with AN using this task, but not with any affect induction or measure of affect. Therefore, we have added this important direction for future research to the discussion.

      Reviewer #2 (Public review):

      Summary:

      Binge eating is often preceded by heightened negative affect, but the specific processes underlying this link are not well understood. The purpose of this manuscript was to examine whether affect state (neutral or negative mood) impacts food choice decision-making processes that may increase the likelihood of binge eating in individuals with bulimia nervosa (BN). The researchers used a randomized crossover design in women with BN (n=25) and controls (n=21), in which participants underwent a negative or neutral mood induction prior to completing a food-choice task. The researchers found that despite no differences in food choices in the negative and neutral conditions, women with BN demonstrated a stronger bias toward considering the 'tastiness' before the 'healthiness' of the food after the negative mood induction.

      Strengths:

      The topic is important and clinically relevant and methods are sound. The use of computational modeling to understand nuances in decision-making processes and how that might relate to eating disorder symptom severity is a strength of the study.

      Weaknesses:

      The sample size was relatively small and may have been underpowered to find differences in outcomes (i.e., food choice behaviors). Participants were all women with BN, which limits the generalizability of findings to the larger population of individuals who engage in binge eating. It is likely that the negative affect manipulation was weak and may not have been potent enough to change behavior. Moreover, it is unclear how long the negative affect persisted during the actual task. It is possible that any increases in negative affect would have dissipated by the time participants were engaged in the decision-making task.

      We thank the Reviewer for their comments on the strengths of the paper, and for highlighting these important considerations regarding the sample demographics and the negative affect induction. As in the original paper that focused only on ultimate food choice behaviors, we now specifically acknowledge that the study was only powered to detect small to medium group differences in the effect of negative emotion on these final choice behaviors.

      Regarding the sample demographics, we agree that the study’s inclusion of only female participants is a limitation. Although the original decision for this sampling strategy was informed by data suggesting that bulimia nervosa is roughly six times more prevalent among females than males (Udo & Grilo, 2018), we now note in the discussion that our female-only sample limits the generalizability of the findings.

      We also agree with the Reviewer’s noted limitations of the negative mood induction, and based on the reviewer’s suggestions, we have expanded our original description of these limitations in the Discussion. Specifically, we now note that although the task was completed immediately after the affect induction, the study did not include intermittent mood assessments throughout the choice task, so it is unclear how long the negative affect persisted during the actual task.

      Reviewer #3 (Public review):

      Summary:

      The study uses the food choice task, a well-established method in eating disorder research, particularly in anorexia nervosa. However, it introduces a novel analytical approach - the diffusion decision model - to deconstruct food choices and assess the influence of negative affect on how and when tastiness and healthiness are considered in decision-making among individuals with bulimia nervosa and healthy controls.

      Strengths:

      The introduction provides a comprehensive review of the literature, and the study design appears robust. It incorporates separate sessions for neutral and negative affect conditions and counterbalances tastiness and healthiness ratings. The statistical methods are rigorous, employing multiple testing corrections.

      A key finding - that negative affect induction biases individuals with bulimia nervosa toward prioritizing tastiness over healthiness - offers an intriguing perspective on how negative affect may drive binge eating behaviors.

      Weaknesses:

      A notable limitation is the absence of a sample size calculation, which, combined with the relatively small sample, may have contributed to null findings. Additionally, while the affect induction method is validated, it is less effective than alternatives such as image or film-based stimuli (Dana et al., 2020), potentially influencing the results.

      We agree that the limited sample size and specific affect induction method may have contributed to the null model-agnostic behavioral findings. Based on this Reviewer’s and Reviewer 2’s comments, we have added these factors to our acknowledgements of limitations in the discussion.

      Another concern is the lack of clarity regarding which specific negative emotions were elicited. This is crucial, as research suggests that certain emotions, such as guilt, are more strongly linked to binge eating than others. Furthermore, recent studies indicate that negative affect can lead to both restriction and binge eating, depending on factors like negative urgency and craving (Leenaerts et al., 2023; Wonderlich et al., 2024). The study does not address this, though it could explain why, despite the observed bias toward tastiness, negative affect did not significantly impact food choices.

      We thank the Reviewer for raising these important points and possibilities. In the Supplementary Materials, we have added an additional analysis of the specific POMS subscales that comprise the total negative affect calculation that was reported in the original paper (Gianini et al., 2019). We also report total negative affect scores from the POMS in the main text. Ultimately, we found that, across both groups, the negative affect induction increased responses related to anger, confusion, depression, and tension while reducing vigor.

      We agree with the Reviewer that factors like negative urgency and cravings are relevant here. The study did not collect any measures of craving, and in response to Reviewer 1 and this Reviewer, we now note in the discussion that replication studies including momentary craving assessments will be important. While we do not have any measurements of cravings, we did measure negative urgency. The original paper (Gianini et al., 2019) did not find that negative urgency was related to restrictive food choices. We have now repeated those analyses, and we also were unable to find any meaningful patterns related to negative urgency. Nonetheless, we have added an analysis of negative urgency scores and decision parameters to the Supplementary Materials.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Please improve the description of the computational methods: the fit of the DDM, the difference between the models used in the DDM, and the difference between the DDM model and the models used in the linear mixed models (the word "model" is at the end confusing as it may refer either to the DDM or to the statistical analysis of the DDM parameters).

      We thank the Reviewer for highlighting the unclear language. We have updated the main text to clarify when the term “model” refers to the DDM itself versus the regression models assessing DDM parameters. As described above, we have clarified that both tests of model fit (WAIC and posterior predictive checks) suggest that Model 3 was the best fit to the data. We have also clarified the differences between the tested models in the Supplementary Materials.

      Please avoid reporting estimates of main effects in statistical models when an interaction is included: the estimates of the main effects may be heavily biased by the interaction term (this can be checked by re-running the model without the interaction term).

      We sincerely appreciate the Reviewer’s comment regarding the interpretation of main effects in the presence of significant interaction terms. In the revised manuscript, we no longer discuss significant main effects and instead focus on interpreting the interaction terms.

      Additionally, to help unpack interaction effects, we now include exploratory simple effects analyses in the supplementary materials. Simple effects analyses allow us to examine the effects of one independent variable at specific values of other independent variables (Aiken et al., 1991; Brambor et al., 2006; Jaccard & Turrisi, 2003; Winer et al., 1991).

      Supplementary tables S5 and S6 are excessive: there is no third-level interaction (supplementary tables S3 and S4) to justify a split between BN and healthy participants. Please perform rather a descending regression. Accordingly, the results reported in the second paragraph of page 7 should be entirely rewritten.

      We agree with the Reviewer’s suggestion that these tables are unnecessary. We have updated them to include details about simple effects analyses described above. We have revised the main text to reflect these changes.

      The words such as "predictive" indicating a causality link is used in several places in the manuscript including the supplementary materials while the experimental design does not allow such claims. This should be rephrased.

      We agree with the Reviewer that the term “predicted” in the main text improperly suggested a causal relationship between symptom severity and DDM parameters that our methods cannot evaluate. We have updated the main text with more appropriate language. However, our use of the term “predicted” in the Supplementary Materials refers to predicting the probability of a choice based on trial-level features which is standard use of the term in the computational cognitive modeling literature (Piray et al., 2019; Wilson & Collins, 2019; Zhang et al., 2020).

      The word "evaluated" appears twice in line 42 of the supplementary materials. Same with "in" at line 50.

      Thank you very much for highlighting this. We have removed the repeated words.

      Reviewer #2 (Recommendations for the authors):

      (1) I think it would be helpful if the authors noted in the Methods how long the food-choice task took. Prior research has suggested that in-lab mood inductions are very short-lasting (e.g., max 7 minutes) and it is likely that the task itself may have impacted the mood states of participants. Expanding on this in the Discussion/limitations seems important.

      The Reviewer raises an important point regarding the duration of our affect manipulation. Since we did not measure mood during or after the Food Choice Task, we cannot determine how long these effects persisted. We have added this limitation to the discussion section, noting that the absence of continuous affect measures following mood induction is a widespread limitation in the field.

      (2) Personally, I was a bit confused about what data the researchers were using to extrapolate information on whether or not participants were considering healthiness or tastiness. How was this operationalized? Is this an assumption being made based on how quickly someone chose a low-fat vs. high-fat food?

      We thank this Reviewer for highlighting that our models’ complexity warrants a more thorough explanation.

      Since we collected tastiness and healthiness attribute ratings during the first phase of the Food Choice Task, we can use those values to determine how these attribute values influence decision-making. Independently, foods were classified as low-fat or high-fat based on their objective properties (i.e., the percentage of calories from fat). However, the primary information we used to compute model parameters were participants’ attribute ratings, choices, and response times.

      In these models, the drift rate parameter captures the speed and direction of evidence accumulation. As the unsigned magnitude of the drift rate increases, the decision-maker is making up their mind more quickly. Once the evidence accumulates to a response boundary, the option associated with that boundary is selected. A positive drift rate means they are moving toward choosing one option (i.e., upper boundary), and a negative drift rate means they are moving toward choosing the other (i.e., lower boundary). In these decisions, decision-makers often consider multiple attributes, such as perceived healthiness and tastiness. Each of these attributes can influence the evidence accumulation process with different strengths, or weights.

      In addition, decision-makers do not consider all attributes at the same time. Inspired by earlier work on multi-attribute decision-making (Maier et al., 2020; Sullivan & Huettel, 2021), our modeling approach computes a parameter (i.e., relative attribute onset) which captures the time delay between when each attribute starts influencing the evidence accumulation process. This parameter gives us a way to estimate when decision-makers are considering different attributes, and tells us how much influence each attribute has, because if the attribute starts late, it has less time to influence the decision. These models use a piecewise drift rate function to describe how evidence changes over time within a trial: sometimes the decision maker only considers taste, sometimes only health, and other times both. Importantly, models with a relative attribute onset parameter can produce key behavioral patterns observed in mouse-tracking studies that models without this parameter are unable to replicate (Maier et al., 2020).

      In summary, the computational model describes decision-makers’ behaviors (what they would choose, and how fast they would choose) using different potential values of the drift weights and relative start time parameters. We then used Bayesian estimation methods to compare the model's predictions to the actual data. By examining how reaction times and choices change depending on the attribute values of the presented options, the model allows us to infer when each attribute is considered, and how strongly it influences the final choice.

      We have clarified this in the main text.

      Reviewer #3 (Recommendations for the authors):

      I wonder whether there were any measures concerning negative affect before and after the mood induction? This would make it clearer whether there was a significant change before and after. If different emotions were assessed, which emotion showed the strongest change?

      We thank the Reviewer for flagging this point. We realize that the main text did not make it clear that mood was assessed before and after the mood induction using the POMS (McNair et al., 1989). While these analyses were conducted and the results were reported in the original manuscript (Gianini et al., 2019), we now report them in the main text for completeness. Additionally, we added more details about how specific emotions changed by analyzing the subscales of the POMS in the Supplementary Materials. As mentioned above, we found that, across both groups, the negative affect induction increased responses related to anger, confusion, depression, and tension while reducing vigor.

      Thank you again for your consideration and for the reviewers’ comments and suggestions. We believe their incorporation has significantly strengthened the paper. In addition, thank you for the opportunity to publish our work in eLife. We look forward to hearing your response.

      References

      Aiken, L. S., West, S. G., & Reno, R. R. (1991). Multiple regression: Testing and interpreting interactions. Sage Publications, Inc.

      Alpers, G. W., & Tuschen-Caffier, B. (2001). Negative feelings and the desire to eat in bulimia nervosa. Eating Behaviors, 2(4), 339–352. https://doi.org/10.1016/S1471-0153(01)00040-X

      Berg, K. C., Crosby, R. D., Cao, L., Peterson, C. B., Engel, S. G., Mitchell, J. E., & Wonderlich, S. A. (2013). Facets of negative affect prior to and following binge-only, purge-only, and binge/purge events in women with bulimia nervosa. Journal of Abnormal Psychology, 122(1), 111–118. https://doi.org/10.1037/a0029703

      Brambor, T., Clark, W. R., & Golder, M. (2006). Understanding Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82. https://doi.org/10.1093/pan/mpi014

      Dalton, B., Foerde, K., Bartholdy, S., McClelland, J., Kekic, M., Grycuk, L., Campbell, I. C., Schmidt, U., & Steinglass, J. E. (2020). The effect of repetitive transcranial magnetic stimulation on food choice-related self-control in patients with severe, enduring anorexia nervosa. International Journal of Eating Disorders, 53(8), 1326–1336. https://doi.org/10.1002/eat.23267

      Gianini, L., Foerde, K., Walsh, B. T., Riegel, M., Broft, A., & Steinglass, J. E. (2019). Negative affect, dietary restriction, and food choice in bulimia nervosa. Eating Behaviors, 33, 49–54. https://doi.org/10.1016/j.eatbeh.2019.03.003

      Haedt-Matt, A. A., & Keel, P. K. (2011). Revisiting the affect regulation model of binge eating: A meta-analysis of studies using ecological momentary assessment. Psychological Bulletin, 137(4), 660–681. https://doi.org/10.1037/a0023660

      Hauser, T. U., Skvortsova, V., Choudhury, M. D., & Koutsouleris, N. (2022). The promise of a model-based psychiatry: Building computational models of mental ill health. The Lancet Digital Health, 4(11), e816–e828. https://doi.org/10.1016/S2589-7500(22)00152-2

      Hilbert, A., & Tuschen-Caffier, B. (2007). Maintenance of binge eating through negative mood: A naturalistic comparison of binge eating disorder and bulimia nervosa. International Journal of Eating Disorders, 40(6), 521–530. https://doi.org/10.1002/eat.20401

      Huys, Q. J. M., Browning, M., Paulus, M. P., & Frank, M. J. (2021). Advances in the computational understanding of mental illness. Neuropsychopharmacology, 46(1), 3–19. https://doi.org/10.1038/s41386-020-0746-4

      Jaccard, J., & Turrisi, R. (2003). Interaction effects in multiple regression (2nd ed.). Sage Publications, Inc.

      Lerche, V., Voss, A., & Nagler, M. (2017). How many trials are required for parameter estimation in diffusion modeling? A comparison of different optimization criteria. Behavior Research Methods, 49(2), 513–537. https://doi.org/10.3758/s13428-016-0740-2

      Maier, S. U., Raja Beharelle, A., Polanía, R., Ruff, C. C., & Hare, T. A. (2020). Dissociable mechanisms govern when and how strongly reward attributes affect decisions. Nature Human Behaviour, 4(9), Article 9. https://doi.org/10.1038/s41562-020-0893-y

      McNair, D., Lorr, M., & Droppleman, L. (1989). Profile of mood states (POMS).

      Piray, P., Dezfouli, A., Heskes, T., Frank, M. J., & Daw, N. D. (2019). Hierarchical Bayesian inference for concurrent model fitting and comparison for group studies. PLOS Computational Biology, 15(6), e1007043. https://doi.org/10.1371/journal.pcbi.1007043

      Ratcliff, R., & Childers, R. (2015). Individual differences and fitting methods for the two-choice diffusion model of decision making. Decision, 2(4), 237–279. https://doi.org/10.1037/dec0000030

      Rouder, J. N., & Lu, J. (2005). An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychonomic Bulletin & Review, 12(4), 573–604. https://doi.org/10.3758/BF03196750

      Smyth, J. M., Wonderlich, S. A., Heron, K. E., Sliwinski, M. J., Crosby, R. D., Mitchell, J. E., & Engel, S. G. (2007). Daily and momentary mood and stress are associated with binge eating and vomiting in bulimia nervosa patients in the natural environment. Journal of Consulting and Clinical Psychology, 75(4), 629–638. https://doi.org/10.1037/0022-006X.75.4.629

      Steinglass, J., Foerde, K., Kostro, K., Shohamy, D., & Walsh, B. T. (2015). Restrictive food intake as a choice—A paradigm for study. International Journal of Eating Disorders, 48(1), 59–66. https://doi.org/10.1002/eat.22345

      Sullivan, N., & Huettel, S. A. (2021). Healthful choices depend on the latency and rate of information accumulation. Nature Human Behaviour, 5(12), Article 12. https://doi.org/10.1038/s41562-021-01154-0

      Udo, T., & Grilo, C. M. (2018). Prevalence and Correlates of DSM-5–Defined Eating Disorders in a Nationally Representative Sample of U.S. Adults. Biological Psychiatry, 84(5), 345–354. https://doi.org/10.1016/j.biopsych.2018.03.014

      Watanabe, S. (2010). Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. Journal of Machine Learning Research, 11, 3571–3594.

      Wiecki, T. V., Sofer, I., & Frank, M. J. (2013). HDDM: Hierarchical Bayesian estimation of the drift-diffusion model in Python. Frontiers in Neuroinformatics, 7. https://doi.org/10.3389/fninf.2013.00014

      Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, e49547. https://doi.org/10.7554/eLife.49547

      Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design (3rd ed). McGraw-Hill.

      Wise, T., Robinson, O. J., & Gillan, C. M. (2023). Identifying Transdiagnostic Mechanisms in Mental Health Using Computational Factor Modeling. Biological Psychiatry, 93(8), 690–703. https://doi.org/10.1016/j.biopsych.2022.09.034

      Zhang, L., Lengersdorff, L., Mikus, N., Gläscher, J., & Lamm, C. (2020). Using reinforcement learning models in social neuroscience: Frameworks, pitfalls and suggestions of best practices. Social Cognitive and Affective Neuroscience, 15(6), 695–707. https://doi.org/10.1093/scan/nsaa089

    1. Reviewer #1 (Public review):

      This is a well-designed and very interesting study examining the impact of imprecise feedback on outcomes on decision-making. I think this is an important addition to the literature and the results here, which provide a computational account of several decision-making biases, are insightful and interesting.

      I do not believe I have substantive concerns related to the actual results presented; my concerns are more related to the framing of some of the work. My main concern is regarding the assertion that the results prove that non-normative and non-Bayesian learning is taking place. I agree with the authors that their results demonstrate that people will make decisions in ways that demonstrate deviations from what would be optimal for maximizing reward in their task under a strict application of Bayes rule. I also agree that they have built reinforcement learning models which do a good job of accounting for the observed behavior. However, the Bayesian models included are rather simple- per the author descriptions, applications of Bayes' rule with either fixed or learned credibility for the feedback agents. In contrast, several versions of the RL models are used, each modified to account for different possible biases. However more complex Bayes-based models exist, notably active inference but even the hierarchical gaussian filter. These formalisms are able to accommodate more complex behavior, such as affect and habits, which might make them more competitive with RL models. I think it is entirely fair to say that these results demonstrate deviations from an idealized and strict Bayesian context; however, the equivalence here of Bayesian and normative is I think misleading or at least requires better justification/explanation. This is because a great deal of work has been done to show that Bayes optimal models can generate behavior or other outcomes that are clearly not optimal to an observer within a given context (consider hallucinations for example) but which make sense in the context of how the model is constructed as well as the priors and desired states the model is given.

      As such, I would recommend that the language be adjusted to carefully define what is meant by normative and Bayesian and to recognize that work that is clearly Bayesian could potentially still be competitive with RL models if implemented to model this task. An even better approach would be to directly use one of these more complex modelling approaches, such as active inference, as the comparator to the RL models, though I would understand if the authors would want this to be a subject for future work.

      Abstract:

      The abstract is lacking in some detail about the experiments done, but this may be a limitation of the required word count? If word count is not an issue, I would recommend adding details of the experiments done and the results. One comment is that there is an appeal to normative learning patterns, but this suggests that learning patterns have a fixed optimal nature, which may not be true in cases where the purpose of the learning (e.g. to confirm the feeling of safety of being in an in-group) may not be about learning accurately to maximize reward. This can be accommodated in a Bayesian framework by modelling priors and desired outcomes. As such the central premise that biased learning is inherently non-normative or non-Bayesian I think would require more justification. This is true in the introduction as well.

      Introduction:

      As noted above the conceptualization of Bayesian learning being equivalent to normative learning I think requires either further justification. Bayesian belief updating can be biased an non-optimal from an observer perspective, while being optimal within the agent doing the updating if the priors/desired outcomes are set up to advantage these "non-optimal" modes of decision making.

      Results:

      I wonder why the agent was presented before the choice - since the agent is only relevant to the feedback after the choice is made. I wonder if that might have induced any false association between the agent identity and the choice itself. This is by no means a critical point but would be interesting to get the authors' thoughts.

      The finding that positive feedback increases learning is one that has been shown before and depends on valence, as the authors note. They expanded their reinforcement learning model to include valence; but they did not modify the Bayesian model in a similar manner. This lack of a valence or recency effect might also explain the failure of the Bayesian models in the preceding section where the contrast effect is discussed. It is not unreasonable to imagine that if humans do employ Bayesian reasoning that this reasoning system has had parameters tuned based on the real world, where recency of information does matter; affect has also been shown to be incorporable into Bayesian information processing (see the work by Hesp on affective charge and the large body of work by Ryan Smith). It may be that the Bayesian models chosen here require further complexity to capture the situation, just like some of the biases required updates to the RL models. This complexity, rather than being arbitrary, may be well justified by decision-making in the real world.

      The methods mention several symptom scales- it would be interesting to have the results of these and any interesting correlations noted. It is possible that some of individual variability here could be related to these symptoms, which could introduce precision parameter changes in a Bayesian context and things like reward sensitivity changes in an RL context.

      Discussion:

      (For discussion, not a specific comment on this paper): One wonders also about participant beliefs about the experiment or the intent of the experimenters. I have often had participants tell me they were trying to "figure out" a task or find patterns even when this was not part of the experiment. This is not specific to this paper, but it may be relevant in the future to try and model participant beliefs about the experiment especially in the context of disinformation, when they might be primed to try and "figure things out".

      As a general comment, in the active inference literature, there has been discussion of state-dependent actions, or "habits", which are learned in order to help agents more rapidly make decisions, based on previous learning. It is also possible that what is being observed is that these habits are at play, and that they represent the cognitive biases. This is likely especially true given, as the authors note, the high cognitive load of the task. It is true that this would mean that full-force Bayesian inference is not being used in each trial, or in each experience an agent might have in the world, but this is likely adaptive on the longer timescale of things, considering resource requirements. I think in this case you could argue that we have a departure from "normative" learning, but that is not necessarily a departure from any possible Bayesian framework, since these biases could potentially be modified by the agent or eschewed in favor of more expensive full-on Bayesian learning when warranted. Indeed in their discussion on the strategy of amplifying credible news sources to drown out low-credibility sources, the authors hint to the possibility of longer term strategies that may produce optimal outcomes in some contexts, but which were not necessarily appropriate to this task. As such, the performance on this task- and the consideration of true departure from Bayesian processing- should be considered in this wider context. Another thing to consider is that Bayesian inference is occurring, but that priors present going in produce the biases, or these biases arise from another source, for example factoring in epistemic value over rewards when the actual reward is not large. This again would be covered under an active inference approach, depending on how the priors are tuned. Indeed, given the benefit of social cohesion in an evolutionary perspective, some of these "biases" may be the result of adaptation. For example, it might be better to amplify people's good qualities and minimize their bad qualities in order to make it easier to interact with them; this entails a cost (in this case, not adequately learning from feedback and potentially losing out sometimes), but may fulfill a greater imperative (improved cooperation on things that matter). Given the right priors/desired states, this could still be a Bayes-optimal inference at a social level and as such may be ingrained as a habit which requires effort to break at the individual level during a task such as this.

      The authors note that this task does not relate to "emotional engagement" or "deep, identity-related, issues". While I agree that this is likely mostly true, it is also possible that just being told one is being lied to might elicit an emotional response that could bias responses, even if this is a weak response.

      Comments on first revisions:

      In their updated version the authors have made some edits to address my concerns regarding the framing of the 'normative' Bayesian model, clarifying that they utilized a simple Bayesian model which is intended to adhere in an idealized manner to the intended task structure, though further simulations would have been ideal.

      The authors, however, did not take my recommendation to explore the symptoms in the symptom scales they collected as being a potential source of variability. They note that these were for hypothesis generation and were exploratory, fair enough, but this study is not small and there should have been sufficient sample size for a very reasonable analysis looking at symptom scores.

      However, overall the toned-down claims and clarifications of intent are adequate responses to my previous review.

      Comments on second revisions:

      While I believe an exploration of symptom scores would have been a valuable addition, this is not required for the purpose of the paper, and as such, I have no further comments.

    2. Author response:

      The following is the authors’ response to the previous reviews

      eLife Assessment

      This study provides an important extension of credibility-based learning research with a well-controlled paradigm by showing how feedback reliability can distort reward-learning biases in a disinformation-like bandit task. The strength of evidence is convincing for the core effects reported (greater learning from credible feedback; robust computational accounts, parameter recovery) but incomplete for the specific claims about heightened positivity bias at low credibility, which depend on a single dataset, metric choices (absolute vs relative), and potential perseveration or cueing confounds. Limitations concerning external validity and task-induced cognitive load, and the use of relatively simple Bayesian comparators, suggest that incorporating richer active-inference/HGF benchmarks and designs that dissociate positivity bias from choice history would further strengthen this paper.

      We thank the editors and reviewers for a careful assessment.

      In response, we have toned down our claims regarding heightened positivity biases, explicitly stating that the findings are equivocal and depend on the scale (i.e., metric) and study (whereas previously we stated our hypothesis was supported). We have also clarified which aspects of the findings extend beyond perseveration. We believe the evidence now presented provides convincing support for this more nuanced claim.

      We wish to emphasize that dissociating positivity bias from perseveration is a challenge not just for our work, but for the entire field of behavioral reinforcement learning. In fact, in a recent preprint (Learning asymmetry or perseveration? A critical re-evaluation and solution to a pervasive confound, Vidal-Perez et al., 2025; https://osf.io/preprints/psyarxiv/xdse5_v1) we argue that, to date, all studies claiming evidence for positivity bias beyond perseveration suffered flaws, and that there are currently no robust, behavioral, model-agnostic signatures that dissociate effects of positivity bias from perseveration. While this remains a limitation, we would stress that, relative to the state of the art in the field, our work goes beyond what has previously been reported. We believe this should also be reflected in the assessment of our work.

      We elaborate more on these issues in our responses to R3 below.

      Public Reviews:

      Reviewer #1 (Public review):

      Comments on revisions:

      In their updated version the authors have made some edits to address my concerns regarding the framing of the 'normative' bayesian model, clarifying that they utilized a simple bayesian model which is intended to adhere in an idealized manner to the intended task structure, though further simulations would have been ideal.

      The authors, however, did not take my recommendation to explore the symptoms in the symptom scales they collected as being a potential source of variability. They note that these were for hypothesis generation and were exploratory, fair enough, but this study is not small and there should have been sufficient sample size for a very reasonable analysis looking at symptom scores.

      However, overall the toned down claims and clarifications of intent are adequate responses to my previous review.

      We thank the reviewer. We remain convinced that targeted hypotheses tested using betterpowered designs is the most effective way to examine how our findings relate to symptom scales, something we hope to pursue in future studies.

      Reviewer #2 (Public review):

      This important paper studies the problem of learning from feedback given by sources of varying credibility. The convincing combination of experiment and computational modeling helps to pin down properties of learning, while opening unresolved questions for future research.

      Summary:

      This paper studies the problem of learning from feedback given by sources of varying credibility. Two bandit-style experiments are conducted in which feedback is provided with uncertainty, but from known sources. Bayesian benchmarks are provided to assess normative facets of learning, and alternative credit assignment models are fit for comparison. Some aspects of normativity appear, in addition to possible deviations such as asymmetric updating from positive and negative outcomes.

      Strengths:

      The paper tackles an important topic, with a relatively clean cognitive perspective. The construction of the experiment enables the use of computational modeling. This helps to pinpoint quantitatively the properties of learning and formally evaluate their impact and importance. The analyses are generally sensible, and advanced parameter recovery analyses (including cross-fitting procedure) provide confidence in the model estimation and comparison. The authors have very thoroughly revised the paper in response to previous comments.

      Weaknesses:

      The authors acknowledge the potential for cognitive load and the interleaved task structure to play a meaningful role in the results, though leave this for future work. This is entirely reasonable, but remains a limitation in our ability to generalize the results. Broadly, some of the results obtain in cases where the extent of generalization is not always addressed and remains uncertain.

      We thank the reviewer once more for a thoughtful assessment of our work.

      Reviewer #3 (Public review):

      Summary

      This paper investigates how disinformation affects reward learning processes in the context of a twoarmed bandit task, where feedback is provided by agents with varying reliability (with lying probability explicitly instructed). They find that people learn more from credible sources, but also deviate systematically from optimal Bayesian learning: They learned from uninformative random feedback, learned more from positive feedback, and updated too quickly from fully credible feedback (especially following low-credibility feedback). Overall, this study highlights how misinformation could distort basic reward learning processes, without appeal to higher order social constructs like identity.

      Strengths

      • The experimental design is simple and well-controlled; in particular, it isolates basic learning processes by abstracting away from social context

      • Modeling and statistics meet or exceed standards of rigor

      • Limitations are acknowledged where appropriate, especially those regarding external validity - The comparison model, Bayes with biased credibility estimates, is strong; deviations are much more compelling than e.g. a purely optimal model

      • The conclusions are of substantial interest from both a theoretical and applied perspective

      Weaknesses

      The authors have addressed most of my concerns with the initial submission. However, in my view, evidence for the conclusion that less credible feedback yields a stronger positivity bias remains weak. This is due to two issues.

      Absolute or relative positivity bias?

      The conclusion of greater positivity bias for lower credible feedback (Fig 5) hinges on the specific way in which positivity bias is defined. Specifically, we only see the effect when normalizing the difference in sensitivity to positive vs. negative feedback by the sum. I appreciate that the authors present both and add the caveat whenever they mention the conclusion. However, without an argument that the relative definition is more appropriate, the fact of the matter is that the evidence is equivocal.

      We thank the reviewer for an insightful engagement with our manuscript. The reviewer’s comments on the subtle interplay between perseveration and learning asymmetries were so thought-provoking that they have inspired a new article that delves deeply into how gradual choice-perseveration can lead to spurious conclusions about learning asymmetries in Reinforcement Learning (Learning asymmetry or perseveration? A critical re-evaluation and solution to a pervasive confound, Vidal-Perez et al., 2025; https://osf.io/preprints/psyarxiv/xdse5_v1).

      To the point- we agree with the reviewer the evidence for this hypothesis is equivocal, and we took on board the suggestion to tone down our interpretation of the findings. We now state explicitly, both in the results section (“Positivity bias in learning and credibility”) and in the Discussion, that the results provide equivocal support for our hypothesis:

      RESULTS

      “However, we found evidence for agent-based modulation of positivity bias when this bias was measured in relative terms. Here we calculated, for each participant and agent, a relative Valence Bias Index (rVBI) as the difference between the Credit Assignment for positive feedback (CA+) and negative feedback (CA-), relative to the overall magnitude of CA (i.e., |CA+| + |CA-|) (Fig. 5c). Using a mixed effects model, we regressed rVBIs on their associated credibility (see Methods), revealing a relative positivity bias for all credibility levels [overall rVBI (b=0.32, F(1,609)=68.16), 50% credibility (b=0.39, t(609)=8.00), 75% credibility (b=0.41, F(1,609)=73.48) and 100% credibility (b=0.17, F(1,609)=12.62), all p’s<0.001]. Critically, the rVBI varied depending on the credibility of feedback (F(2,609)=14.83, p<0.001), such that the rVBI for the 3-star agent was lower than that for both the 1-star (b=-0.22, t(609)=-4.41, p<0.001) and 2-start agent (b=-0.24, F(1,609)=24.74, p<0.001). Feedback with 50% and 75% credibility yielded similar rVBI values (b=0.028, t(609)=0.56,p=0.57). Finally, a positivity bias could not stem from a Bayesian strategy as both Bayesian models predicted a negativity bias (Fig. 5b-c; Fig. S8; and SI 3.1.1.3 Table S11-S12, 3.2.1.1, and 3.2.1.2). Taken together, this provides equivocal support for our initial hypothesis, depending on the measurement scale used to assess the effect (absolute or relative).”

      “Previous research has suggested that positivity bias may spuriously arise from pure choice-perseveration (i.e., a tendency to repeat previous choices regardless of outcome) (49–51). While our models included a perseveration-component, this control may not be perfect. Therefore, in additional control analyses, we generated (using ex-post simulations based on best fitting parameters) synthetic datasets using models including choice-perseveration but devoid of feedback-valence bias, and fitted them with our credibilityvalence model (see SI 3.6.1). These analyses confirmed that a pure perseveration account can masquerade as an apparent positivity bias and even predict the qualitative pattern of results related to credibility (i.e., a higher relative positivity bias for low-credibility feedback). Critically, however, this account consistently predicted a reduced magnitude of credibility-effect on relative positivity bias as compared to the one we observed in participants, suggesting some of the relative amplification of positivity bias goes above and beyond a contribution from perseveration.”

      DISCUSSION

      “Previous reinforcement learning studies, report greater credit-assignment based on positive compared to negative feedback, albeit only in the context of veridical feedback (43,44,63). Here, we investigated whether a positivity bias is amplified for information of low credibility, but our findings are equivocal and vary as a function of scaling (absolute or relative) and study. We observe selective absolute amplification of a positivity bias for information of low and intermediate credibility in the discovery study alone. In contrast, we find a relative (to the overall extent of CA) amplification of confirmation bias in both studies. Importantly, the magnitude of these amplification effects cannot be reproduced in ex-post simulations of a model incorporating simple choice perseveration without an explicit positivity bias, suggesting that at least part of the amplification reflects a genuine increase in positivity bias.”

      There is also a good reason to think that the absolute definition is more appropriate. As expected, participants learn more from credible feedback. Thus, normalizing by average learning (as in the relative definition) amounts to dividing the absolute difference by increasingly large numbers for more credible feedback. If there is a fixed absolute positivity bias (or something that looks like it), the relative bias will necessarily be lower for more credible feedback. In fact, the authors own results demonstrate this phenomenon (see below). A reduction in relative bias thus provides weak evidence for the claim.

      We agree with the reviewer that absolute and relative measures can yield conflicting impressions. To some extent, this is precisely why we report both (i.e., if the two would necessarily agree, reporting both would be redundant). However, we are unconvinced that one measure is inherently more appropriate than the other. In our view, both are valid as long as they are interpreted carefully and in the right context. To illustrate, consider salary changes, which can be expressed on either an absolute or a relative scale. If Bob’s £100 salary increases to £120 and Alice’s £1000 salary increases to £1050, then Bob’s raise is absolutely smaller but relatively larger. Is one measure more appropriate than the other? Economists would argue not; rather, the choice of scale depends on the question at hand.

      In the same spirit, we have aimed to be as clear and transparent as possible in stating that 1) in the main study, there is no effect in the absolute sense, and 2) framing positivity bias in relative terms is akin to expressing it as a percentage change.

      It is interesting that the discovery study shows evidence of a drop in absolute bias. However, for me, this just raises questions. Why is there a difference? Was one a just a fluke? If so, which one?

      We are unsure why we didn’t find absolute amplification effect within the main studies. However, we don’t think the results from the preliminary study were just a ‘fluke’. We have recently conducted two new studies (in preparation for publication), where we have been able to replicate the finding of increased positivity bias for lower-credibility sources in both absolute and relative terms. We agree current results leave unresolved questions and we hope to follow up on these in the near future.

      Positivity bias or perseveration?

      Positivity bias and perseveration will both predict a stronger relationship between positive (vs. negative) feedback and future choice. They can thus be confused for each other when inferred from choice data. This potentially calls into question all the results on positivity bias.

      The authors clearly identify this concern in the text and go to considerable lengths to rule it out. However, the new results (in revision 1) show that a perseveration-only model can in fact account for the qualitative pattern in the human data (the CA parameters). This contradicts the current conclusion:

      Critically, however, these analyses also confirmed that perseveration cannot account for our main finding of increased positivity bias, relative to the overall extent of CA, for low-credibility feedback.

      Figure 24c shows that the credibility-CA model does in fact show stronger positivity bias for less credible feedback. The model distribution for credibility 1 is visibly lower than for credibilities 0.5 and 0.75.

      The authors need to be clear that it is the magnitude of the effect that the perseveration-only model cannot account for. Furthermore, they should additionally clarify that this is true only for models fit to data; it is possible that the credibility-CA model could capture the full size of the effect with different parameters (which could fit best if the model was implemented slightly differently).

      The authors could make the new analyses somewhat stronger by using parameters optimized to capture just the pattern in CA parameters (for example by MSE). This would show that the models are in principle incapable of capturing the effect. However, this would be a marginal improvement because the conclusion would still rest on a quantitative difference that depends on specific modeling assumptions.

      We thank the reviewer for raising this important point. We agree our original wording could have been more carefully formulated and are grateful for this opportunity to refine this. The reviewer is correct that a model with only perseveration can qualitatively reproduce the pattern of increased relative positivity bias for less credible feedback in the main study (but not in the discovery study), and our previous text did not acknowledge this. As stated in the previous section, we have revised the manuscript (in the Results, Discussion, and SI) to ensure we address this in full. Our revised text now makes it explicit that while a pure perseveration account predicts the qualitative pattern, it does not predict the magnitude of the effects we observe in our data.

      RESULTS

      “Previous research has suggested that positivity bias may spuriously arise from pure choice-perseveration (i.e., a tendency to repeat previous choices regardless of outcome) (49–51). While our models included a perseveration-component, we acknowledge this control is not perfect. Therefore, in additional control analyses, we generated (using ex-post simulations based on best fitting parameters) synthetic datasets using models including choice-perseveration, but devoid of feedback-valence bias, and fitted these with our credibility-valence model (see SI 3.6.1). These analyses confirmed that a pure perseveration account can masquerade as an apparent positivity bias, and even predict the qualitative pattern of results related to credibility (i.e., a higher relative positivity bias for low-credibility feedback). Critically, however, this account consistently predicted a reduced magnitude of credibility-effect on relative positivity bias as compared to the one we observed in participants, suggesting at least some of the relative amplification of positivity bias goes above and beyond contributions from perseveration.”

      DISCUSSION

      “Previous reinforcement learning studies, report greater credit-assignment based on positive compared to negative feedback, albeit only in the context of veridical feedback (43,44,63). Here, we investigated whether a positivity bias is amplified for information of low credibility, but our findings on this matter were equivocal and varied as a function of scaling (absolute or relative) and study. We observe selective absolute amplification of the positivity bias for information of low and intermediate credibility in the discovery study only. In contrast, we find a relative (to the overall extent of CA) amplification of confirmation bias in both studies. Importantly, the magnitude of these amplification effects cannot be reproduced in ex-post simulations of a model incorporating simple choice perseveration without an explicit positivity bias, suggesting that at least part of the amplification reflects a genuine increase in positivity bias.”

      SI (3.6.1)

      “Interestingly, a pure perseveration account predicted an amplification of the relative positivity bias under low (compared to full) credibility (with the two rightmost histograms in Fig. S24d falling in the positive range). However, the magnitude of this effect was significantly smaller than the empirical effect (as the bulk of these same histograms lies below the green points). Moreover, this account predicted a negative amplification (i.e., attenuation) of an absolute positivity bias, which was again significantly smaller than the empirical effect (see corresponding histograms in S24b). This pattern raises an intriguing possibility that perseveration may, at least partially, mask a true amplification of absolute positivity bias.”

      Furthermore, our revisions make it now explicit that these analyses are based on ex-post simulations using the model best-fitting parameters. We do not argue that this pattern can’t be captured by other parameters crafted specifically to capture this pattern. However, we believe that the ex-post fitting is the best practice to check whether a model can produce an effect of interest (see for example The Importance of Falsification in Computational Cognitive Modeling, Palminteri et al., 2017; https://www.sciencedirect.com/science/article/pii/S1364661317300542?via%3Dihub). Based on this we agree with the reviewer the benefit from the suggested additional analyses is minimal.

      New simulations clearly demonstrate the confound in relative bias

      Figure 24 also speaks to the relative vs. absolute question. The model without positivity bias shows a slightly stronger absolute "positivity bias" for the most credible feedback, but a weaker relative bias. This is exactly in line with the logic laid out above. In standard bandit tasks, perseveration can be quite well-captured by a fixed absolute positivity bias, which is roughly what we see in the simulations (I'm not sure what to make of the slight increase; perhaps a useful lead for the authors). However, when we divide by average credit assignment, we now see a reduction. This clearly demonstrates that a reduction in relative bias can emerge without any true differences in positivity bias.

      This relates back to the earlier point about scaling. However, we wish to clarify that this is not a confound in the usual sense i.e., an external variable that varies systematically with the independent variable (credibility) and influences the dependent variable (positivity bias), thereby undermining causal inference. Rather, we consider it is a scaling issue: measuring absolute versus relative changes in the same variable can yield conflicting impressions.

      Given everything above, I think it is unlikely that the present data can provide even "solid" evidence for the claim that positivity bias is greater with less credible feedback. This confound could be quickly ruled out, however, by a study in which feedback is sometimes provided in the absence of a choice. This would empirically isolate positivity bias from choice-related effects, including perseveration.

      We trust our responses make clear we have tempered our claims and stated explicitly where a conclusion is equivocal. We believe we have convincing evidence for a nuanced claim regarding how credibility affects positivity bias.

      We are grateful for the reviewer’s suggestion of a study design to empirically isolate positivity bias from choice-related effects. We have considered this carefully, but do not believe the issue is as straightforward as suggested. As we understand it, the suggestion assumes that positivity bias should persist when people process feedback in the absence of choice (where perseverative tendencies would not be elicited). While this is possible, there is existing work that indicates otherwise. In particular, Chambon et al. (2020, Nature Human Behavior) compared learning following free versus forced choices and found that learning asymmetries, including a positivity bias, were selectively evident in free-choice trials but not in forced-choice trials. This implies that a positivity bias is intricately tied to the act of choosing, rather than a general learning artifact that emerges independently of choice context. This is further supported by arguments that the positivity bias in reinforcement learning is better understood as a form of confirmation bias, whereby feedback confirming a choice is weighted more heavily (Palminteri et al., 2017, Plos Comp. Bio.). In other words, it is unclear whether one should expect positivity/confirmation bias to emerge when feedback is provided in the absence of choice.

      That said, we agree fully with a need to have task designs that better dissociate positivity bias from perseveration. We now acknowledge in our Discussion that such designs can benefit future studies on this topic:

      Future studies could also benefit from using designs that are better suited for dissociating learning asymmetries from gradual perseveration (51).

      We hope to be able to pursue this direction in the future.

      Recommendations for the Authors:

      I greatly appreciate the care with which you responded to my comments. I'm sorry that I can't improve my overall evaluation, given the seriousness of the concerns in the public review (which the new results have unfortunately bolstered more than assuaged). If it were me, I would definitely collect more data because both issues could very likely be strongly addressed with slight modifications of the current task.

      Alternatively, you could just dramatically de-emphasize the claim that positivity bias is higher for less credible feedback. I will be sad because it was my favorite result, but you have many other strong results, and I would still label the paper "important" without this one.

      We thank the reviewer for an exceptionally thorough and insightful engagement with our manuscript. Your meticulous attention to detail, and sharp conceptual critiques, have been invaluable, and our paper is immeasurably stronger and more rigorous as a direct result of this input. Indeed, the referee’s comments inspired us to prepare a new article that delves deeply into the confound of dissociating between gradual choice-perseveration and learning asymmetries in RL (Learning asymmetry or perseveration? A critical re-evaluation and solution to a pervasive confound, Vidal-Perez et al., 2025; https://osf.io/preprints/psyarxiv/xdse5_v1).

      Specifically, in this new paper we address the point that dissociating positivity bias from perseveration is a challenge not just for our work, but for the entire field of behavioral reinforcement learning. In fact, we argue that all studies claiming evidence for positivity bias, over and above an effect of perseveration, are subject to flaws, including being biased to find evidence for positivity/confirmation bias. Furthermore, we agree with the reviewer’s wish to see modelagnostic support and note there are currently no robust, behavioral, model-agnostic signatures implicating positivity bias over and above an effect of perseveration. While this remains an acknowledged limitation within our current work, we trust the reviewer will agree that relative to other efforts in the field, our current work pushes the boundary and takes several important steps beyond what has previously been done in this area.

      Below are some minor notes, mostly on the new content-hopefully easy; please don't put much time into addressing these!

      Main text

      where individuals preferably learn from . Perhaps "preferentially"?

      The text has been modified to accommodate the reviewer’s comment:

      “Additionally, in both experiments, participants exhibited increased learning from trustworthy information when it was preceded by non-credible information and an amplified normalized positivity bias for noncredible sources, where individuals preferentially learn from positive compared to negative feedback (relative to the overall extent of learning).”

      One interpretation of this model is as a "sophisticated" logistic ... the CA parameters take the role of "regression coefficients"

      Consider removing "sophisticated" and also the quotations around "regression coefficients". This came across as unprofessional to me.

      The text has been modified to accommodate the reviewer’s comment:

      “The probability to choose a bandit (say A over B) in this family of models is a logistic function of the contrast choice-propensities between these two bandits. One interpretation of this model is as a logistic regression, where the CA parameters take the role of regression coefficients corresponding to the change in log odds of repeating the just-taken action in future trials based on the feedback (+/- CA for positive or negative feedback, respectively; the model also includes gradual perseveration which allows for constant log-odd changes that are not affected by choice feedback).”

      These models operate as our instructed-credibility and free-credibility Bayesian models, but also incorporate a perseveration values, updated in each trial as in our CA models (Eqs. 3 and 5).

      Is Eq 3 supposed to be Eq 4 here? I don't see how Eq 3 is relevant. Relatedly, please use a variable other than P for perseveration because P(chosen) reads as "probability chosen" - and you actually use P in latter sense in e.g. Eq 11

      The text has been modified to accommodate the reviewer’s comment. P values have been changed to Pers and P(bandit) has been replaced by Prob(bandit). “All models also included gradual perseveration for each bandit. In each trial the perseveration values (Pers) were updated according to

      Where PERS is a free parameter representing the P-value change for the chosen bandit, and fP (Î[0,1]) is the free parameter denoting the forgetting rate applied to the Pers value. Additionally, the Pers-values of all the non-chosen bandits (i.e., again, the unchosen bandit of the current pair, and all the bandits from the not-shown pairs) were forgotten as follows:

      We modelled choices using a softmax decision rule, representing the probability of the participant to choose a given bandit over the alternative:

      SI

      Figure 24 and Figure 26: in the x tick labels, consider using e.g. "0.5 vs 1" rather than "0.5-1". I initially read this as a bin range.

      We thank the reviewer for pointing this out. Our intention was to denote a direct subtraction (i.e., the effect for 0.5 credibility minus the effect for 1.0 credibility). We were concerned that not noting the subtraction might confuse readers about the direction of the plotted effect. We have clarified this in the figure legends:

      “Figure 24: Predicted positivity bias results for participants and for simulations of the Credibility-CA (including perseveration, but no valence-bias component). a, Valence bias results measured in absolute terms (by regressing the ML CA parameters, on their associated valence and credibility). b, Difference in positivity bias (measured in absolute terms) across credibility levels. On the x-axis, the hyphen (-) represents subtraction, such that a label of '0.5-1' indicates the difference in the measurement for the 0.5 and 1.0 credibility conditions. Such differences are again based in the same mixed effects model as plot a. The inflation of aVBI for lower-credibility agents is larger than the one predicted by a pure perseveration account. c, Valence bias results measured in relative terms (by regressing the rVBIs on their associated credibility). Participants present a higher rVBI than what would be predicted by a perseveration account (except for the completely credible agent). d, Difference in rVBI across credibility levels. Such differences are again based in the same mixed effects model as plot c. The inflation of rVBI for lower-credibility agents is larger than the one predicted by a pure perseveration account. Histograms depict the distribution of coefficients from 101 simulated group-level datasets generated by the Credibility-CA model and fitted with the Credibility-Valence CA model. Gray circles represent the mean coefficient from these simulations, while black/green circles show the actual regression coefficients from participant behaviour (green for significant effects in participants, black for non-significant). Significance markers (* p<.05, ** p<.01) indicate that fewer than 5% or 1% of simulated datasets, respectively, predicted an effect as strong as or stronger than that observed in participants, and in the same direction as the participant effect.”

      However, importantly, these simulations did not predict a change in the level of positivity bias as a function of feedback credibility

      You're confirming the null hypothesis here; running more simulations would likely yield a significant effect. The simulation shows a pretty clear pattern of increasing positivity bias with higher credibility. Crucially, this is the opposite of what people show. Please adjust the language accordingly.

      The text has been modified to accommodate the reviewer’s comment.

      “However, importantly, these simulations did not reveal a significant change in the level of positivity bias as a function of feedback credibility, neither at an absolute level (F(3,412)=1.43,p=0.24), nor at a relative level (F(3,412)=2.06,p=0.13) (Fig. S25a-c). Numerically, the trend was towards an increasing (rather than decreasing) positivity bias as a function of credibility.”

      More importantly, the inflation in positivity bias for lower credibility feedback is substantially higher in participants than what would be predicted by a pure perseveration account, a finding that holds true for both absolute (Fig. S24b) and relative (Fig. S24d) measures.

      A statistical test would be nice here, e.g. a regression like rVBI ~ credibility_1 * is_model. Alternatively, clearly state what to look for in the figure, where it is pretty clear when you know exactly what you're looking for.

      The text has been modified to make sure that the figure is easier to interpret (we pointed out to readers what they should look at):

      “Interestingly, a pure perseveration account predicted an amplification of the relative positivity bias under low (compared to full) credibility (with the two rightmost histograms in Fig. S24c falling in the positive range). However, the magnitude of this effect was significantly smaller than the empirical effect (as the bulk of these same histograms lies below the green points). Moreover, this account predicted a negative amplification (i.e., attenuation) of an absolute positivity bias, which was again significantly smaller than the empirical effect (see corresponding histograms in S24b). This pattern raises an intriguing possibility that perseveration may partially mask a true amplification of absolute positivity bias.”

    1. Author response:

      General Statements

      We thank the reviewers for providing us the opportunity to revise our manuscript titled “Identifying regulators of associative learning using a protein-labelling approach in C. elegans.” We appreciate the insightful feedback that we received to improve this work. In response, we have extensively revised the manuscript with the following changes: we have (1) clarified the criteria used for selecting candidate genes for behavioural testing, presenting additional data from ‘strong’ hits identified in multiple biological replicates (now testing 26 candidates, previously 17), (2) expanded our discussion of the functional relevance of validated hits, including providing new tissue-specific and neuron class-specific analyses, and (3) improved the presentation of our data, including visualising networks identified in the ‘learning proteome’, to better highlight the significance of our findings. We also substantially revised the text to indicate our attempts to address limitations related to background noise in the proteomic data and outlined potential refinements for future studies. All revisions are clearly marked in the manuscript in red font. A detailed, point-by-point response to each comment is provided below.

      Point-by-point description of the revisions:

      Reviewer #1 (Evidence, reproducibility and clarity):

      Summary:

      Rahmani et al., utilize the TurboID method to characterize the global proteome changes in the worm's nervous system induced by a salt-based associative learning paradigm. Altogether, Rahmani et al., uncover 706 proteins that are tagged by the TurboID method specifically in samples extracted from worms that underwent the memory inducing protocol. Next, the authors conduct a gene enrichment analysis that implicates specific molecular pathways in saltassociative learning, such as MAP-kinase and cAMP-mediated pathways. The authors then screen a representative group of the hits from the proteome analysis. The authors find that mutants of candidate genes from the MAP-kinase pathway, namely dlk-1 and uev-3, do not affect the performance in the learning paradigm. Instead multiple acetylcholine signaling mutants significantly affected the performance in the associative memory assay, e.g., acc-1, acc-3, gar-1, and lgc-46. Finally, the authors demonstrate that the acetylcholine signaling mutants did not exhibit a phenotype in similar but different conditioning paradigms, such as aversive salt-conditioning or appetitive odor conditioning, suggesting their effect is specific to appetitive salt conditioning.

      Major comments:

      (1) The statistical approach and analysis of the behavior assay:

      The authors use a 2-way ANOVA test which assumes normal distribution of the data. However, the chemotaxis index used in the study is bounded between -1 and 1, which prevents values near the boundaries to be normally distributed.

      Since most of the control data in this assay in this study is very close to 1, it strongly suggests that the CI data is not normally distributed and therefore 2-way ANOVA is expected to give skewed results.

      I am aware this is a common mistake and I also anticipate that most conclusions will still hold also under a more fitting statistical test.

      We appreciate the point raised by Reviewer 1 and understand the importance of performing the correct statistical tests.

      The statistical tests used in this study were chosen since parametric tests, particularly ANOVA tests to assess differences between multiple groups, are commonly used to assess behaviour in the C. elegans learning and memory field. Below is a summary of the tests used by studies that perform similar behavioural tests cited in this work, as examples:

      Author response table 1.

      A summary for the statistical tests performed by similar studies for chemotaxis assay data. References (listed in the leftmost column) were observed to (A) use parametric tests only or (B) performed either a parametric or non-parametric test on each chemotaxis assay dataset depending on whether the data passed a normality test. Listings for ANOVA tests are in bold to demonstrate their common use in the C. elegans learning and memory field.

      We note Reviewer 1's concern that this may stem from a common mistake. As stated, Two-way ANOVA generally relies on normally distributed data. We used GraphPad Prism to perform the Shapiro-Wilk normality test on our chemotaxis assay data as it is generally appropriate for sample sizes < 50 (α = 0.05), and found that most data passes this test including groups with skewed indices. For example, this is the data for Figure S8C:

      Author response table 2.

      Shapiro-Wilk normality test results for chemotaxis assay data in Figure S8C. Chemotaxis assay data was generated to assess salt associative learning capacity for wild-type (WT) versus lgc-46(-) mutant C. elegans. Three experimental groups were prepared for each C. elegans strain (naïve, high-salt control, and trained). From top-to-bottom, the data below displays the ‘W’ value, ‘P value’, a binary yes/no for whether the data passes the Shapiro-Wilk normality test, and a ‘P value summary’ (ns = nonsignificant). W values measure the similarity between a normal distribution and the chemotaxis assay data. Data is considered normal in the Shapiro-Wilk normality test when a W value is near 1.0 and the null hypothesis is not rejected (i.e., P value > 0.05).

      The manuscript now includes the use of the Shapiro-Wilk normality test to assess chemotaxis assay data before using two-way ANOVA on page 51.

      Nevertheless an appropriate statistical analysis should be performed. Since I assume the authors would wish to take into consideration both the different conditions and biological repeats, I can suggest two options:

      - Using a Generalized linear mixed model, one can do with R software.

      - Using a custom bootstrapping approach.

      We thank Reviewer 1 for suggesting these two options. We carefully considered both approaches and consulted with the in-house statistician at our institution (Dr Pawel Skuza, Flinders University) for expert advice to guide our decision. In summary:

      (1) Generalised linear mixed models: Generalised linear mixed models (GLMMs) are generally most appropriate for nested/hierarchal data. However, our chemotaxis assay data does not exhibit such nesting. Each biological replicate (N) consists of three technical replicates, which are averaged to yield a single chemotaxis index per N. Our statistical comparisons are based solely on these averaged values across experimental groups, making GLMMs less applicable in this context.

      (2) Bootstrapping: Based on advice from our statistician, while bootstrapping can be a powerful tool, its effectiveness is limited when applied to datasets with a low number of biological replicates (N). Bootstrapping relies on resampling existing data to simulate additional observations, which may artificially inflate statistical power and potentially suggest significance where the biological effect size is minimal or not meaningful. Increasing the number of biological replicates to accommodate bootstrapping could introduce additional variability and compromise the interpretability of the results.

      The total number of assays, especially controls, varies quite a bit between the tested mutants. For example compare the acc-1 experiment in Figure 4.A., and gap-1 or rho-1 in Figure S4.A and D. It is hard to know the exact N of the controls, but I assume that for example, lowering the wild type control of acc-1 to equivalent to gap-1 would have made it non significant. Perhaps the best approach would be to conduct a power analysis, to know what N should be acquired for all samples.

      We thoroughly evaluated performing the power analysis: however, this is typically performed with the assumption that an N = 1 represents a singular individual/person. An N =1 in this study is one biological replicate that includes hundreds of worms, which is why it is not typically employed in our field for this type of behavioural test.

      Considering these factors, we have opted to continue using a two-way ANOVA for our statistical analysis. This choice aligns with recent publications that employ similar experimental designs and data structures. Crucially, we have verified that our data meet the assumptions of normality, addressing key concerns regarding the suitability of parametric testing. We believe this approach is sufficiently rigorous to support our main conclusions. This rationale is now outlined on page 51.

      To be fully transparent, our aim is to present differences between wild-type and mutant strains that are clearly visible in the graphical data, such that the choice of statistical test does not become a limiting factor in interpreting biological relevance. We hope this rationale is understandable, and we sincerely appreciate the reviewer’s comment and the opportunity to clarify our analytical approach.

      We hope that Reviewer 1 will appreciate these considerations as sufficient justification to retain the statistical tests used in the original manuscript. Nevertheless, to constructively address this comment, we have performed the following revisions:

      (1) Consistent number of biological replicates: We performed additional biological replicates of the learning assay to confirm the behavioural phenotypes for the key candidates described (KIN-2 , F46H5.3, ACC-1, ACC-3, LGC-46). We chose N = 5 since most studies cited in this paper that perform similar behavioural tests do the same (see Author response table 3 below).

      Author response table 3.

      A summary for sample sizes generated by similar studies for chemotaxis assay data. References (listed in the leftmost column) were observed to the sample sizes (N) below corresponding to biological replicates of chemotaxis assay data. N values are in bold when the study uses N ≤ 5.

      (1) Grouped presentation of behavioural data: We now present all behavioural data by grouping genotypes tested within the same biological replicate, including wild-type controls, rather than combining genotypes tested separately. This ensures that each graph displays data from genotypes sharing the same N, also an important consideration for performing parametric tests. Accordingly, we re-performed statistical analyses using this reduced N for relevant graphs. As anticipated, this rendered some comparisons non-significant. All statistical comparisons are clearly indicated on each graph.

      (2) Improved clarity of figure legends: We revised figure legends for Figures 5, 6, S7, S8, & S9 to make clear how many biological replicates have been performed for each genotype by adding N numbers for each genotype in all figures.

      The authors use the phrasing "a non-significant trend", I find such claims uninterpretable and should be avoided. Examples: Page 16. Line 7 and Page 18, line 16.

      This is an important point. While we were not able to find the specific phrasing "a non-significant trend" from this comment in the original manuscript, we acknowledge that referring to a phenotype as both a trend and non-significant may confuse readers, which was originally stated in the manuscript in two locations.

      The main text has been revised on pages 27 & 28 when describing comparisons between trained groups between two C. elegans lines, by removing mentions of trends and retaining descriptions of non-significance.

      (2) Neuron-specific analysis and rescue of mutants:

      Throughout the study the authors avoid focusing on specific neurons. This is understandable as the authors aim at a systems biology approach, however, in my view this limits the impact of the study. I am aware that the proteome changes analyzed in this study were extracted from a pan neuronally expressed TurboID. Yet, neuron-specific changes may nevertheless be found. For example, running the protein lists from Table S2, in the Gene enrichment tool of wormbase, I found, across several biological replicates, enrichment for the NSM, CAN and RIG neurons. A more careful analysis may uncover specific neurons that take part in this associative memory paradigm. In addition, analysis of the overlap in expression of the final gene list in different neurons, comparing them, looking for overlap and connectivity, would also help to direct towards specific circuits.

      This is an important and useful suggestion. We appreciate the benefit in exploring the data from this study from a neuron class-specific lens, in addition to the systems-level analyses already presented.

      The WormBase gene enrichment tool is indeed valuable for broad transcriptomic analyses (the findings from utilising this tool are now on page 16); however, its use of Anatomy Ontology (AO) terms also contains annotations from more abundant non-neuronal tissues in the worm. To strengthen our analysis and complement the Wormbase tool, we also used the CeNGEN database as suggested by Reviewer 3 Major Comment 1 (Taylor et al., 2021), which uses single cell RNA-Seq data to profile gene expression across the C. elegans nervous system. We input our learning proteome data into CeNGEN as a systemic analysis, identifying neurons highly represented by the learning proteome (on pages 16-20). To do this, we specifically compared genes/proteins from high-salt control worms and trained worms to identify potential neurons that may be involved in this learning paradigm. Briefly, we found:

      - WormBase gene enrichment tool: Enrichment for anatomy terms corresponding to specific interneurons (ADA, RIS, RIG), ventral nerve cord neurons, pharyngeal neurons (M1, M2, M5, I4), PVD sensory neurons, DD motor neurons, serotonergic NSM neurons, and CAN.

      - CeNGEN analysis: Representation of neurons previously implicated in associative learning (e.g., AVK interneurons, RIS interneurons, salt-sensing neuron ASEL, CEP & ADE dopaminergic neurons, and AIB interneurons), as well as neurons not previously studied in this context (pharyngeal neurons I3 & I6, polymodal neuron IL1, motor neuron DA9, and interneuron DVC). Methods are detailed on pages 50 & 51.

      These data are summarised in the revised manuscript as Table S7 & Figure 4.

      To further address the reviewer’s suggestion, we examined the overlap in expression patterns of the validated learning-associated genes acc-1, acc-3, lgc-46, kin-2, and F46H5.3 across the neuron classes above, using the CeNGEN database. This was done to explore potential neuron classes in which these regulators may act in to regulate learning. This analysis revealed both shared and distinct expression profiles, suggesting potential functional connectivity or co-regulation among subsets of neurons. To summarise, we found:

      - All five learning regulators are expressed in RIM interneurons and DB motor neurons.

      - KIN-2 and F46H5.3 share the same neuron expression profile and are present in many neurons, so they may play a general function within the nervous system to facilitate learning.

      - ACC-3 is expressed in three sensory neuron classes (ASE, CEP, & IL1).

      - In contrast, ACC-1 and LGC-46 are expressed in neuron classes (in brackets) implicated in gustatory or olfactory learning paradigms (AIB, AVK, NSM, RIG, & RIS) (Beets et al., 2012, Fadda et al., 2020, Wang et al., 2025, Zhou et al., 2023, Sato et al., 021), neurons important for backward or forward locomotion (AVE, DA, DB, & VB) (Chalfie et al., 1985), and neuron classes for which their function is yet detailed in the literature (ADA, I4, M1, M2, & M5).

      These neurons form a potential neural circuit that may underlie this form of behavioural plasticity, which we now describe in the main text on pages 16-20 & 34-35 and summarise in Figure 4.

      OPTIONAL: A rescue of the phenotype of the mutants by re-expression of the gene is missing, this makes sure to avoid false-positive results coming from background mutations. For example, a pan neuronal or endogenous promoter rescue would help the authors to substantiate their claims, this can be done for the most promising genes. The ideal experiment would be a neuron-specific rescue but this can be saved for future works.

      We appreciate this suggestion and recognise its potential to strengthen our manuscript. In response, we made many attempts to generate pan-neuronal and endogenous promoter reexpression lines. However, we faced several technical issues in transgenic line generation, including poor survival following microinjection likely due to protein overexpression toxicity (e.g., C30G12.6, F46H5.3), and reduced animal viability for chemotaxis assays, potentially linked to transgene-related reproductive defects (e.g., ACC-1). As we have previously successfully generated dozens of transgenic lines in past work (e.g. Chew et al., Neuron 2018; Chew et al., Phil Trans B 2018; Gadenne/Chew et al., Life Science Alliance 2022), we believe the failure to produce most of these lines is not likely due to technical limitations. For transparency, these observations have been included in the discussion section of the manuscript on pages 39 & 40 as considerations for future troubleshooting.

      Fortunately, we were able to generate a pan-neuronal promoter line for KIN-2 that has been tested and included in the revised manuscript. This new data is shown in Figure 5B and described on pages 23 & 24. Briefly, this shows that pan-neuronal expression of KIN-2 from the ce179 mutant allele is sufficient to reproduce the enhanced learning phenotype observed in kin2(ce179) animals, confirming the role of KIN-2 in gustatory learning.

      To address the potential involvement of background mutations (also indicated by Reviewer 4 under ‘cross-commenting’), we have also performed experiments with backcrossed versions of several mutants. These experiments aimed to confirm that salt associative learning phenotypes are due to the expected mutation. Namely, we assessed kin-2(ce179) mutants that had been backcrossed previously by another laboratory, as well as C30G12.6(-) and F46H5.3(-) animals backcrossed in this study. Although not all backcrossed mutants retained their original phenotype (i.e., C30G12.6) (Figure 6D, a newly added figure), we found that backcrossed versions of KIN-2 and F46H5.3 both robustly showed enhanced learning (Figures 5A & 6B).

      This is described in the text on pages 23-26.

      Minor comments:

      (1) Lack of clarity regarding the validation of the biotin tagging of the proteome.

      The authors show in Figure 1 that they validated that the combination of the transgene and biotin allows them to find more biotin-tagged proteins. However there is significant biotin background also in control samples as is common for this method. The authors mention they validated biotin tagging of all their experiments, but it was unclear in the text whether they validated it in comparison to no-biotin controls, and checked for the fold change difference.

      This is an important point: We validated our biotin tagging method prior to mass spectrometry by comparing ‘no biotin’ and ‘biotin’ groups. This is shown in Figure S1 in the revised manuscript, which includes a western blot comparing untreated and biotin treated animals that are nontransgenic or expressing TurboID. As expected, by comparing biotinylated protein signal for untreated and treated lanes within each line, biotin treatment increased the signal 1.30-fold for non-transgenic and 1.70-fold for TurboID C. elegans. This is described on page 8 of the revised manuscript.

      To clarify, for mass spectrometry experiments, we tested a no-TurboID (non-transgenic) control, but did not perform a no-biotin control. We included the following four groups: (1) No-TurboID ‘control’ (2) No-TurboID ‘trained’, (3) pan-neuronal TurboID ‘control’ and (4) pan-neuronal TurboID ‘trained’, where trained versus control refers to whether ‘no salt’ was used as the conditioned stimulus or not, respectively (illustrated in Figure 1A). Due to the complexity of the learning assay (which involves multiple washes and handling steps, including a critical step where biotin is added during the conditioning period), and the need to collect sufficient numbers of worms for protein extraction (>3,000 worms per experimental group), adding ‘no-biotin’ controls would have doubled the number of experimental groups, which we considered unfeasible for practical reasons. This is explained on pages 8 & 9 of the revised manuscript.

      Also, it was unclear which exact samples were tested per replicate. In Page 9, Lines 17-18: "For all replicates, we determined that biotinylated proteins could be observed ...", But in Page 8, Line 24 : "We then isolated proteins from ... worms per group for both 'control' and 'trained' groups,... some of which were probed via western blotting to confirm the presence of biotinylated proteins".

      Could the authors specify which samples were verified and clarify how?

      Thank you for pointing out these unclear statements: We have clarified the experimental groups used for mass spectrometry experiments as detailed in the response above on pages 8 & 9. In addition, western blots corresponding to each biological replicate of mass spectrometry data described in the main text on page 10 and have been added to the revised manuscript (as Figure S3). These western blots compare biotinylation signal for proteins extracted from (1) NoTurboID ‘control’ (2) No-TurboID ‘trained’, (3) pan-neuronal TurboID ‘control’ and (4) panneuronal TurboID ‘trained’. These blots function to confirm that there were biotinylated proteins in TurboID samples, before enrichment by streptavidin-mediated pull-down for mass spectrometry.

      OPTIONAL: include the fold changes of biotinylated proteins of all the ones that were tested. Similar to Figure 1.C.

      This is an excellent suggestion. As recommended by the reviewer, we have included foldchanges for biotinylated protein levels between high-salt control and trained groups (on pages 9 & 10 for replicate #1 and in Table S2 for replicates #2-5). This was done by measuring protein levels in whole lanes for each experimental group per biological replicate within western blots (Figure 1C for replicate #1 and Figure S3 for replicates #2-5) of protein samples generated for mass spectrometry (N = 5).

      (2) Figure 2 does not add much to the reader, it can be summarized in the text, as the fraction of proteins enriched for specific cellular compartments.

      I would suggest to remove Figure 2 (originally written as figure 3) to text, or transfer it to the supplementry material.

      As noted in cross-comment response to Reviewer 4, there were typos in the original figure references, we have corrected them above. Essentially, this comment is referring to Figure 2.

      We appreciate this feedback from Reviewer 1. We agree that the original Figure 2 functions as a visual summary from analysis of the learning proteome at the subcellular compartment level. However, it also serves to highlight the following:

      - Representation for neuron-specific GO terms is relatively low, but even this small percentage represents entire protein-protein networks that are biologically meaningful, but that are difficult to adequately describe in the main text.

      - TurboID was expressed in neurons so this figure supports the relevance of the identified proteome to biological learning mechanisms.

      - Many of these candidates could not be assessed by learning assay using single mutants since related mutations are lethal or substantially affect locomotion. These networks therefore highlight the benefit in using strategies like TurboID to study learning.

      We have chosen to retain this figure, moving it to the supplementary material as Figure S4 in the revised manuscript, as suggested.

      OPTIONAL- I would suggest the authors to mark in a pathway summary figure similar to Figure 3 (originally written as Figure 4) the results from the behavior assay of the genetic screen. This would allow the reader to better get the bigger picture and to connect to the systemic approach taken in Figures 2 and 3.

      We think this is a fantastic suggestion and thank Reviewer 1 for this input. In the revised manuscript, we have added Figure 7, which summarises the tested candidates that displayed an effect on learning, mapped onto potential molecular pathways derived from networks in the learning proteome. This figure provides a visual framework linking the behavioural outcomes to the network context. This is described in the main text on pages 32-33.

      (3) Typo in Figure 3: the circle of PPM1: The blue right circle half is bigger than the left one.

      We thank the Reviewer for noticing this, the node size for PPM-1.A has been corrected in what is now Figure 2 in the revised work.

      (4) Unclarity in the discussions. In the discussion Page 24, Line 14, the authors raise this question: "why are the proteins we identified not general learning regulators?. The phrasing and logic of the argumentation of the possible answers was hard to follow. - Can you clarify?

      We appreciate this feedback in terms of unclarity, as we strive to explain the data as clearly and transparently as possible. Our goal in this paragraph was to discuss why some candidates were seen to only affect salt associative learning, as opposed to showing effects in multiple learning paradigms (i.e., which we were defining as a ‘general learning regulator’). We have adjusted the wording in several places in this paragraph now on pages 36 & 37 to address this comment. We hope the rephrased paragraph provides sufficient rationalisation for the discussion regarding our selection strategy used to isolate our protein list of potential learning regulators, and its potential limitations.

      Cross-Commenting

      Firstly, we would like to express our appreciation for the opportunity for reviewers to crosscomment on feedback from other reviewers. We believe this is an excellent feature of the peer review process, and we are grateful to the reviewers for their thoughtful engagement and collaborative input.

      I would like to thank Reviewer #4 for the great cross comment summary, I find it accurate and helpful.

      I also would like to thank Reviewer #4 for spotting the typos in my minor comments, their page and figure numbers are the correct ones.

      We have corrected these typos in the relevant comments, and have responded to them accordingly.

      Small comment on common point 1 - My feeling is that it is challanging to do quantitative mass spectrometry, especially with TurboID. In general, the nature of MS data is that it hints towards a direction but a followup validation work is required in order to assess it. For example, I am not surprised that the fraction of repeats a hit appeared in does not predict well whether this hit would be validated behavioraly. Given these limitations, I find the authors' approach reasonable.

      We thank Reviewer 1 for this positive and thoughtful feedback. We also appreciate Reviewer 4’s comment regarding quantitative mass spectrometry and have addressed this in detail below (see response to Reviewer 4). However, we agree with Reviewer 1 that there are practical challenges to performing quantitative mass spectrometry with TurboID, primarily due to the enrichment for biotinylated proteins that is a key feature of the sample preparation process.

      Importantly, we whole-heartedly agree with Reviewer 1’s statement that “In general, the nature of MS data is that it hints towards a direction but a follow-up validation work is required in order to assess it”. This is the core of our approach: however, we appreciate that there are limitations to a qualitative ‘absent/present’ approach. We have addressed some of these limitations by clarifying the criteria used for selecting candidate genes, based additionally on the presence of the candidate in multiple biological replicates (categorised as ‘strong’ hits). Based on this method, we were able to validate the role of several novel learning regulators (Figures 5, 6, & S7). We sincerely hope that this manuscript can function as a direction for future research, as suggested by this Reviewer.

      I also would like to highlight this major comment from reviewer 4:

      "In Experimental Procedures, authors state that they excluded data in which naive or control groups showed average CI < 0.6499, and/or trained groups showed average CI < -0.0499 or > .5499 for N2 (page 36, lines 5-7). "

      This threshold seems arbitrary to me too, and it requires the clarifications requested by reviewer 4.

      As detailed in our response to Reviewer 4, Major Comment 2, data were excluded only in rare cases, specifically when N2 worms failed to show strong salt attraction prior to training, or when trained N2 worms did not exhibit the expected behavioural difference compared to untrained controls – this can largely be attributed to clear contamination or over-population issues, which are visible prior to assessing CTX plates and counting chemotaxis indices.

      These criteria were initially established to provide an objective threshold for excluding biological replicates, particularly when planning to assay a large number of genetic mutants. However, after extensive testing across many replicates, we found that N2 worms (that were not starved, or not contaminated) consistently displayed the expected phenotype, rendering these thresholds unnecessary. We acknowledge that emphasizing these criteria may have been misleading, and have therefore removed them from page 50 in the revised manuscript to avoid confusion and ensure clarity.

      Reviewer #1 (Significance):

      This study does a great job to effectively utilize the TurboID technique to identify new pathways implicated in salt-associative learning in C. elegans. This technique was used in C. elegans before, but not in this context. The salt-associative memory induced proteome list is a valuable resource that will help future studies on associative memory in worms. Some of the implicated molecular pathways were found before to be involved in memory in worms like cAMP, as correctly referenced in the manuscript. The implication of the acetylcholine pathway is novel for C. elgeans, to the best of my knowledge. The finding that the uncovered genes are specifically required for salt associative memory and not for other memory assays is also interesting.

      However overall I find the impact of this study limited. The premise of this work is to use the Turbo-ID method to conduct a systems analysis of the proteomic changes. The work starts by conducting network analysis and gene enrichment which fit a systemic approach. However, since the authors find that ~30% of the tested hits affect the phenotype, and since only 17/706 proteins were assessed, it is challenging to draw conclusive broad systemic claims.

      Alternatively, the authors could have focused on the positive hits, and understand them better, find the specific circuits where these genes act. This could have increased the impact of the work. Since neither of these two options are satisfied, I view this work as solid, but not wide in its impact and therefore estimate the audience of this study would be more specialized.

      My expertise is in C. elegans behavior, genetics, and neuronal activity, programming and machine learning.

      We thank the Reviewer for these comments and appreciate the recognition of the value of the proteomic dataset and the identification of novel molecular pathways, including the acetylcholine pathway, as well as the specificity of the uncovered genes to salt-associative memory. Regarding the reviewer’s concern about the overall impact and scope of the study, we respectfully offer the following clarification. Our aim was to establish a systems-level approach for investigating learning-related proteomic changes using TurboID, and we acknowledge that only a subset of the identified proteins was experimentally tested (now 26/706 proteins in the revised manuscript). Although only five of the tested single gene mutants showed a robust learning phenotype in the revised work (after backcrossing, more stringent candidate selection, improved statistical analysis in addressing reviewer comments), our proteomic data provides us a unique opportunity to define these candidates within protein-protein networks (as illustrated in Figure 7). Importantly, our functional testing focused on single-gene mutants, which may not reveal phenotypes for genes that act redundantly (now mentioned on pages 28-30). This limitation is inherent to many genetic screens and highlights the value of our proteomic dataset, which enables the identification of broader protein-protein interaction networks and molecular pathways potentially involved in learning.

      To support this systems-level perspective, we have added Figure 7, which visually integrates the tested candidates into molecular pathways derived from the learning proteome for learning regulators KIN-2 and F46H5.3. We also emphasise more explicitly in the text (on pages 32-33) the value of our approach by highlighting the functional protein networks that can be derived from our proteomics dataset.

      We fully acknowledge that the use of TurboID across all neurons limits the resolution needed to pinpoint individual neuron contributions, and understand the benefit in further experiments to explore specific circuits. Many circuits required for salt sensing and salt-based learning are highly explored in the literature and defined explicitly (see Rahmani & Chew, 2021), so our intention was to complement the existing literature by exploring the protein-protein networks involved in learning, rather than on neuron-neuron connectivity. However, we recognise the benefit in integrating circuit-level analyses, given that our proteomic data suggests hundreds of candidates potentially involved in learning. While validating each of these candidates is beyond the scope of the current study, we have taken steps to suggest candidate neurons/circuits by incorporating tissue enrichment analyses and single-cell transcriptomic data (Table S7 & Figure 4). These additions highlight neuron classes of interest and suggest possible circuits relevant to learning.

      We hope this clarification helps convey the intended scope and contribution of our study. We also believe that the revisions made in response to Reviewer 1’s feedback have strengthened the manuscript and enhanced its significance within the field.

      Reviewer #2 (Evidence, reproducibility and clarity):

      Summary:

      In this study by Rahmani in colleagues, the authors sought to define the "learning proteome" for a gustatory associative learning paradigm in C. elegans. Using a cytoplasmic TurboID expressed under the control of a pan-neuronal promoter, the authors labeled proteins during the training portion of the paradigm, followed by proteomics analysis. This approach revealed hundreds of proteins potentially involved in learning, which the authors describe using gene ontology and pathways analysis. The authors performed functional characterization of some of these genes for their requirement in learning using the same paradigm. They also compared the requirement for these genes across various learning paradigms, and found that most hits they characterized appear to be specifically required for the training paradigm used for generating the "learning proteome".

      Major Comments:

      (1) The definition of a "hit" from the TurboID approach is does not appear stringent enough. According to the manuscript, a hit was defined as one unique peptide detected in a single biological replicate (out of 5), which could give rise to false positives. In figure S2, it is clear that there relatively little overlap between samples with regards to proteins detected between replicates, and while perhaps unintentional, presenting a single unique peptide appears to be an attempt to inflate the number of hits. Defining hits as present in more than one sample would be more rigorous. Changing the definition of hits would only require the time to re-list genes and change data presented in the manuscript accordingly.

      We thank Reviewer 2 for this valuable comment, and the following related suggestion. We agree with the statement that “Defining hits as present in more than one sample would be more rigorous”. Therefore, to address this comment, we have now separated candidates into two categories in Table 2 in the revised manuscript: ‘strong’ (present in 3 or more biological replicates) and ‘weak’ candidates (present in 2 or fewer biological replicates). However, we think these weaker candidates should still be included in the manuscript, considering we did observe relationships between these proteins and learning. For example, ACC-1, which influences salt associative learning in C. elegans, was detected in one replicate of mass spectrometry as a potential learning regulator (Figure S8A). We describe this classification in the main text on pages 21-22.

      We also agree with Reviewer 2 that the overlap between individual candidate hits is low between biological replicates; the inclusion of Figure S2 in the original manuscript serves to highlight this limitation. However, it is also important to consider that there is notable overlap for whole molecular pathways between biological replicates of mass spectrometry data as shown in Figure 2 in the revised manuscript (this consideration is now mentioned on pages 13-14). We have included Figure 3 to illustrate representation for two metabolic processes across several biological replicates normally indispensable to animal health, as an example to provide additional visual aid for the overlap between replicates of mass spectrometry. We provide this figure (described on pages 13 & 15) to demonstrate the strength of our approach in that it can detect candidates not easily assessable by conventional forward or reverse genetic screens.

      We also appreciate the opportunity to explain our approach. The criteria of “at least one unique peptide” was chosen based on a previous work for which we adapted for this manuscript (Prikas et al., 2020). It was not intended to inflate the number of hits but rather to ensure sensitivity in detecting low-abundance neuronal proteins. We have clarified this in our Methods (page 46).

      (2) The "hits" that the authors chose to functionally characterize do not seem like strong candidate hits based on the proteomics data that they generated. Indeed, most of the hits are present in a single, or at most 2, biological replicate. It is unclear as to why the strongest hits were not characterized, which if mutant strains are publicly available, would not be a difficult experiment to perform.

      We thank the reviewer for this important suggestion. To address this, we have described two molecular pathways with multiple components that appear in more than one biological replicate of mass spectrometry data in Figure 3 (main text on page 13). In addition, we have included Figures 6 & S7 where 9 additional single mutants corresponding to candidates in three or more biological replicates of mass spectrometry were tested for salt associative learning. Briefly, we found the following (number of replicates that a protein was unique to TurboID trained animals is in brackets):

      - Novel arginine kinase F46H5.3 (4 replicates) displays an effect in both salt associative learning and salt aversive learning in the same direction (Figures 6A, 6B, & S9A, pages 31-32 & 37-38).

      - Worms with a mutation for armadillo-domain protein C30G12.6 (3 replicates) only displayed an enhanced learning phenotype when non-backcrossed, not backcrossed. This suggests the enhanced learning phenotype was caused by a background mutation (Figure 6, pages 24-25).

      - We did not observe an effect on salt associative learning when assessing mutations for the ciliogenesis protein IFT-139 (5 replicates), guanyl nucleotide factors AEX-3 or TAG52 (3 replicates), p38/MAPK pathway interactor FSN-1 (3 replicates), IGCAM/RIG-4 (3 replicates), and acetylcholine components ACR-2 (4 replicates) and ELP-1 (3 replicates) (Figure S7, on pages 27-30). However, we note throughout the section for which these candidates are described that only single gene mutants were tested, meaning that genes that function in redundant or compensatory pathways may not exhibit a detectable phenotype.

      Because of the lack of strong evidence that these are indeed proteins regulated in the context of learning based on proteomics, including evidence of changes in the proteins (by imaging expression changes of fluorescent reporters or a biochemical approach), would increase confidence that these hits are genuine.

      We thank Reviewer 2 for this suggestion – we agree that it would have been ideal to have additional evidence suggesting that changes in candidate protein levels are associated directly with learning. Ideally, we would have explored this aspect further; however, as outlined in response to Reviewer 1 Major Comment 2 (OPTIONAL), this was not feasible within the scope of the current study due to several practical challenges. Specifically, we attempted to generate pan-neuronal and endogenous promoter rescue lines for several candidates, but encountered significant challenges, including poor survival post-microinjection (likely due to protein overexpression toxicity) and reduced viability for behavioural assays, potentially linked to transgene-related reproductive defects. This information is now described on pages 39 & 40 of the revised work.

      To address these limitations, we performed additional behavioural experiments where possible. We successfully generated a pan-neuronal promoter line for kin-2, which was tested and included in the revised manuscript (Figure 5B, pages 30 & 31). In addition, to confirm that observed learning phenotypes were due to the expected mutations and not background effects, we conducted experiments using backcrossed versions of several mutant lines as suggested by Reviewer 4 Cross Comment 3 (Figure 6, pages 23-24 & 24-26). Briefly, this shows that panneuronal expression of KIN-2 from the ce179 mutant allele is sufficient to repeat the enhanced learning phenotype observed in backcrossed kin-2(ce179) animals, providing additional evidence that the identified hits are required for learning. We also confirmed that F46H5.3 modulates salt associative learning, given both non-backcrossed and backcrossed F46H5.3(-) mutants display a learning enhancement phenotype. The revised text now describes this data on the page numbers mentioned above.

      Minor Comments:

      (1) The authors highlight that the proteins they discover seem to function uniquely in their gustatory associative paradigm, but this is not completely accurate. kin-2, which they characterize in figure 4, is required for positive butanone association (the authors even say as much in the manuscript) in Stein and Murphy, 2014.

      We appreciate this correction and thank the Reviewer for pointing this out. We have amended the wording appropriately on page 31 to clarify our meaning.

      “Although kin-2(ce179) mutants were not shown to impact salt aversive learning, they have been reported previously to display impaired intermediate-term memory (but intact learning and short-term memory) for butanone appetitive learning (Stein and Murphy, 2014).”

      Reviewer #2 (Significance):

      General Assessment:

      The approach used in this study is interesting and has the potential to further our knowledge about the molecular mechanisms of associative behaviors. Strengths of the study include the design with carefully thought out controls, and the premise of combining their proteomics with behavioral analysis to better understand the biological significance of their proteomics findings. However, the criteria for defining hits and prioritization of hits for behavioral characterizations were major wweaknesses of the paper.

      Advance:

      There have been multiple transcriptomic studies in the worm looking at gene expression changes in the context of behavioral training (Lakhina et al., 2015, Freytag 2017). This study compliments and extends those studies, by examining how the proteome changes in a different training paradigm. This approach here could be employed for multiple different training paradigms, presenting a new technical advance for the field.

      Audience:

      This paper would be of interest to the broader field of behavioral and molecular neuroscience. Though it uses an invertebrate system, many findings in the worm regarding learning and memory translate to higher organisms.

      I am an expert in molecular and behavioral neuroscience in both vertebrate and invertebrate models, with experience in genetics and genomics approaches.

      We appreciate Reviewer 2’s thoughtful assessment and constructive feedback. In response to concerns regarding definition and prioritisation of hits, we have revised our approach as detailed above to place more consideration on ‘strong’ hits present in multiple biological replicates. We have also added new behavioural data for additional mutants that fall into this category (Figures 6 & S7). We hope these revisions strengthen our study and enhance its relevance to the behavioural/molecular neuroscience community.

      Reviewer #3 (Evidence, reproducibility and clarity):

      Summary:

      In the manuscript titled "Identifying regulators of associative learning using a protein-labelling approach in C. elegans" the authors attempted to generate a snapshot of the proteomic changes that happen in the C. elegans nervous system during learning and memory formation. They employed the TurboID-based protein labeling method to identify the proteins that are uniquely found in samples that underwent training to associate no-salt with food, and consequently exhibited lower attraction to high salt in a chemotaxis assay. Using this system they obtained a list of target proteins that included proteins represented in molecular pathways previously implicated in associative learning. The authors then further validated some of the hits from the assay by testing single gene mutants for effects on learning and memory formation.

      Major Comments:

      In the discussion section, the authors comment on the sources of "background noise" in their data and ways to improve the specificity. They provide some analysis on this aspect in Supplementary figure S2. However, a better visualization of non-specificity in the sample could be a GO analysis of tissue-specificity, and presented as a pie chart as in Figure 2A. Nonneuronal proteins such as MYO-2 or MYO-3 repeatedly show up on the "TurboID trained" lists in several biological replicates (Tables S2 and S3). If a major fraction of the proteins after subtraction of control lists are non-specific, that increases the likelihood that the "hits" observed are by chance. This analysis should be presented in one of the main figures as it is essential for the reader to gauge the reliability of the experiment.

      We agree with this assessment and thank Reviewer 3 for this constructive suggestion. In response, we have now incorporated a comprehensive tissue-specific analysis of the learning proteome in the revised manuscript. Using the single neuron RNA-Seq database CeNGEN, we identified the proportion of neuronal vs non-neuronal proteins from each biological replicate of mass spectrometry data. Specifically, we present Table 1 on page 17 (which we originally intended to include in the manuscript, but inadvertently left out), which shows that 87-95% (i.e. a large majority) of proteins identified across replicates corresponded to genes detected in neurons, supporting that the TurboID enzyme was able to target the neuronal proteome as expected. Table 1 is now described in the main text of the revised work on page 16.

      In addition, we performed neuron-specific analyses using both the WormBase gene enrichment tool and the CeNGEN single-cell transcriptomic database, which we describe in detail on our response to Reviewer 1 Major Comment 2. To summarise, these analyses revealed enrichment of several neuron classes, including those previously implicated in associative learning (e.g., ASEL, AIB, RIS, AVK) as well as neurons not previously studied in this context (e.g., IL1, DA9, DVC) (summarised in Table S7). By examining expression overlap across neuron types, we identified shared and distinct profiles that suggest potential functional connectivity and candidate circuits underlying behavioural plasticity (Figure 4). Taken together, these data show that the proteins identified in our dataset are (1) neuronal and (2) expressed in neurons that are known to be required for learning. Methods are detailed on pages 50-51.

      Other than the above, the authors have provided sufficient details in their experimental and analysis procedures. They have performed appropriate controls, and their data has sufficient biological and technical replaictes for statistical analysis.

      We appreciate this positive feedback and thank the Reviewer for acknowledging the clarity of our experimental and analysis procedures.

      Minor Comments:

      There is an error in the first paragraph of the discussion, in the sentences discussing the learning effects in gar-1 mutant worms. The sentences in lines 12-16 on page 22 says that gar-1 mutants have improved salt-associative learning and defective salt-aversive learning, while in fact the data and figures state the opposite.

      We appreciate the Reviewer noting this discrepancy. As clarified in our response to Reviewer 1, Major Comment 1 above, we reanalysed the behavioural data to ensure consistency across genotypes by comparing only those tested within the same biological replicates (thus having the same N for all genotypes). Upon this reanalysis, we found that the previously reported phenotype for gar-1 mutants in salt-associative learning was not statistically different from wildtype controls. Therefore, we have removed references to GAR-1 from the manuscript.

      Reviewer #3 (Significance):

      Strengths and limitations:

      This study used neuron-specific TurboID expression with transient biotin exposure to capture a temporally restricted snapshot of the C. elegans nervous system proteome during saltassociative learning. This is an elegant method to identify proteins temporally specific to a certain condition. However, there are several limitations in the way the experiments and analyses were performed which affect the reliability of the data. As the authors themselves have noted in the discussion, background noise is a major issue and several steps could be taken to improve the noise at the experimental or analysis steps (use of integrated C. elegans lines to ensure uniformity of samples, flow cytometry to isolate neurons, quantitative mass spec to detect fold change vs. strict presence/absence).

      Advance:

      Several studies have demonstrated the use of proximity labeling to map the interactome by using a bait protein fusion. In fact, expressing TurboID not fused to a bait protein is often used as a negative control in proximity labeling experiments. However, this study demonstrates the use of free TurboID molecules to acquire a global snapshot of the proteome under a given condition.

      Audience:

      Even with the significant limitations, this study is specifically of interest to researchers interested in understanding learning and memory formation. Broadly, the methods used in this study could be modified to gain insights into the proteomic profiles at other transient developmental stages. The reviewer's field of expertise: Cell biology of C. elegans neurons.

      We thank the reviewer for their thoughtful evaluation of our work. We appreciate the recognition of the novelty and potential of using neuron-specific TurboID to capture a temporally restricted snapshot of the C. elegans nervous system proteome during learning. We agree that this approach offers a unique opportunity to identify proteins associated with specific behavioural states in future studies.

      We also appreciate the reviewer’s comments regarding limitations in experimental and analytical design. In revising the manuscript, we have taken several steps to address these concerns and improve the clarity, rigour, and interpretability of our data. Specifically:

      - We now provide a frequency-based representation of proteomic hits (Table 2), which helps clarify how candidate proteins were selected and highlights differences between trained and control groups.

      - We have added neuron-specific enrichment analyses using both WormBase and CenGEN databases (Table S7 & Figure 4), which help identify candidate neurons and potential circuits involved in learning (methods on pages 50-51).

      - We have clarified the rationale for using qualitative proteomics in the context of TurboID, in addition to acknowledging the challenges of integrating quantitative mass spectrometry with biotin-based enrichment (page 39). Additional methods for improving sample purity, such as using integrated lines or FACS-enrichment of neurons, could further refine this approach in future studies. For transparency, we did attempt to integrate the TurboID transgenic line to improve the strength and consistency of biotinylation signals. However, despite four rounds of backcrossing, this line exhibited unexpected phenotypes, including a failure to respond reliably to the established training protocol. As a result, we were unable to include it in the current study. Nonetheless, we believe our current approach provides a valuable proof-of-concept and lays the groundwork for future refinement.

      By addressing the major concerns of peer reviewers, we believe our study makes a significant and impactful contribution by demonstrating the feasibility of using TurboID to capture learninginduced proteomic changes in the nervous system. The identification of novel learning-related mutants, including those involved in acetylcholine signalling and cAMP pathways, provides new directions for future research into the molecular and circuit-level mechanisms of behavioural plasticity.

      Reviewer #4 (Evidence, reproducibility and clarity):

      Summary:

      In this manuscript, authors used a learning paradigm in C. elegans; when worms were fed in a saltless plate, its chemotaxis to salt is greatly reduced. To identify learning-related proteins, authors employed nervous system-specific transcriptome analysis to compare whole proteins in neurons between high-salt-fed animals and saltless-fed animals. Authors identified "learningspecific genes" which are observed only after saltless feeding. They categorized these proteins by GO analyses and pathway analyses, and further stepped forward to test mutants in selected genes identified by the proteome analysis. They find several mutants that are defective or hyper-proficient for learning, including acc-1/3 and lgc-46 acetylcholine receptors, gar-1 acetylcholine receptor GPCR, glna-3 glutaminase involved in glutamate biosynthesis, and kin-2, a cAMP pathway gene. These mutants were not previously reported to have abnormality in the learning paradigm.

      Major comments:

      (1) There are problems in the data processing and presentation of the proteomics data in the current manuscript which deteriorates the utility of the data. First, as the authors discuss (page 24, lines 5-12), the current approach does not consider amount of the peptides. Authors state that their current approach is "conservative", because some of the proteins may be present in both control and learned samples but in different amounts. This reviewer has a concern in the opposite way: some of the identified proteins may be pseudo-positive artifacts caused by the analytical noise. The problem is that authors included peptides that are "present" in "TurboID, trained" sample but "absent" in the "Non-Tg, trained" and "TurboID, control" samples in any one of the biological replicates, to identify "learning proteome" (706 proteins, page 8, last line - page 9, line 8; page 32, line 21-22). The word "present" implies that they included even peptides whose amounts are just above the detection threshold, which is subject to random noise caused by the detector or during sample collection and preparation processes. This consideration is partly supported by the fact that only a small fraction of the proteins are common between biological replicates (honestly and respectably shown in Figure S2). Because of this problem, there is no statistical estimate of the identity in "learning proteome" in the current manuscript. Therefore, the presentation style in Tables S2 and S3 are not very useful for readers, especially because authors already subtracted proteins identified in Non-Tg samples, which must also suffer from stochastic noise. I suggest either quantifying the MS/MS signal, or if authors need to stick to the "present"/"absent" description of the MS/MS data, use the number of appearances in biological replicates of each protein as estimate of the quantity of each protein. For example, found in 2 replicates in "TurboID, learned" and in 0 replicates in "Non-Tg, trained". One can apply statistics to these counts. This said, I would like to stress that proteins related to acquisition of memory may be very rare, especially because learning-related changes likely occur in a small subset of neurons. Therefore, 1 time vs 0 time may be still important, as well as something like 5 times vs 1 time. In summary, quantitative description of the proteomics results is desired.

      We thank the reviewer for these valuable comments and suggestions.

      We acknowledge that quantitative proteomics would provide beneficial information; however, as also indicated by Reviewer 1 (in cross-comment), it is practically challenging to perform with TurboID. We have included discussion of potential future experiments involving quantitative mass spectrometry, as well as a comprehensive discussion of some of the limitations of our approach as summarised by this Reviewer, in the Discussion section (page 39). However, we note that our qualitative approach also provides beneficial knowledge, such as the identification of functional protein networks acting within biological pathways previously implicated in learning (Figure 2), and novel learning regulators ACC-1/3, LGC-46, and F46H5.3.

      We agree with the assessment that the frequency of occurrence for each candidate we test per biological replicate is useful to disclose in the manuscript as a proxy for quantification. This was also highlighted by Reviewer 2 (Major Comment 1). As detailed above in response to R2, we have now separated candidates into two categories: ‘strong’ (present in 3 or more biological replicates) and ‘weak’ candidates (present in 2 or fewer biological replicates). We have also added behavioural data after testing 9 of these strong candidates in Figures 6 & S7.

      We have also added Table 2 to the revised manuscript, which summarises the frequency-based representation of the proteomics results, as suggested. This is described on pages 22-23.

      Briefly, this shows the range of candidates further explored using single mutant testing. Specifically, this data showed that many of the tested candidates were more frequently detected in trained worms compared to high-salt controls. This includes both strong and weak candidates, providing a clearer view of how proteomic frequency informed our selection for functional testing.

      (2) There is another problem in the treatment of the behavioural data. In Experimental Procedures, authors state that they excluded data in which naive or control groups showed average CI < 0.6499, and/or trained groups showed average CI < -0.0499 or > 0.5499 for N2 (page 36, lines 5-7). How were these values determined? One common example for judging a data point as an outlier is > mean + 1.5, 2 or 3 SD, or < mean - 1.5, 2 or 3 SD. Are these values any of these standards, or determined through other methods? If these values were determined simply by authors' decision, it could potentially introduce a bias and in the worst cases lead to incorrect conclusions. A related question is, authors state "trained animals showed a lower CI (~0.3)" where in the referred Figure 1B, the corresponding data shows averages close to 0. Why is the inconsistency? The assay that authors use is close to those described in the previous literature (Kunitomo et al., http://dx.doi.org/10.1038/ncomms3210). In this previous paper, it was described that animals conditioned under no salt with food show negative CI and are attracted to the low salt concentration area. Quantitative analysis of behavioural patterns showed migration bias towards lower salt concentrations (negative chemotaxis). Essentially the same concept was reported by Luo et al. (http://dx.doi.org/10.1016/j.neuron.2014.05.010). The experimental procedure employed in the current work is very similar with those by the Japanese group, with a notable difference: the chemotaxis assay plate included 50mM NaCl in Kunitomo et al, while authors used chemotaxis plate without added NaCl (p35, line 18). The latter is expected to cause shallow gradient towards the low-salt area, which may be the reason for the weak negative CI in the trained animals. In any case, the value of CI itself is not a problem, and authors' current assay is valid. The only concern of mine is the potential of author-introduced cognitive bias, possibly affecting, for example, whether a certain mutant has a significant defect or not. What happens if the cut-offs of -0.0499 and 0.5499 are omitted and all data were included in the analyses? What are the average CIs of N2 in all performed experiments for each of naive, control and trained groups?

      Thank you for pointing this out. As mentioned by both Reviewer 1 and Reviewer 4, the original manuscript states the following: “Data was excluded for salt associative learning experiments when wild-type N2 displayed (1) an average CI ≤ 0.6499 for naïve or control groups and/or (2) an average CI either < -0.0499 or >0.5499 for trained groups.”

      To clarify, we only excluded experiments in rare cases where N2 worms did not display robust high salt attraction before training, or where trained N2 did not display the expected behavioural difference compared to untrained or high-salt control N2. These anomalies were typically attributable to clear contamination or starvation issues that could clearly be observed prior to counting chemotaxis indices on CTX plates.

      We established these exclusion criteria in advance of conducting multiple learning assays to ensure an objective threshold for identifying and excluding assays affected by these rare but observable issues. However, these criteria were later found to be unnecessary, as N2 worms robustly displayed the expected untrained and trained phenotypes for salt associative learning when not compromised by starvation or contamination.

      We understand that the original criteria may have appeared to introduce arbitrary bias in data selection. To address this concern, we have removed these criteria from the revised manuscript from page 50.

      Minor comments:

      (1) Related to Major comments 1), the successful effect of neuron-specific TurboID procedure was not evaluated. Authors obtained both TurboID and Non-Tg proteome data. Do they see enrichment of neuron-specific proteins? This can be easily tested, for example by using the list of neuron-specific genes by Kaletsky et al. (http://dx.doi.org/10.1038/nature16483 or http://dx.doi.org/10.1371/journal.pgen.1007559), or referring to the CenGEN data.

      We thank this Reviewer for this helpful suggestion, which was echoed by Reviewer 3 (Major Comment 1). As indicated in the response to R3 above, the revised manuscript now includes Table 1 as a tissue-specific analysis of the learning proteome, using the single neuron RNASeq database CeNGEN to identify the proportion of neuronal proteins from each biological replicate of mass spectrometry data. Generally, we observed a range of 87-95% of proteins corresponded to genes from the CeNGEN database that had been detected in neurons, providing evidence that the TurboID enzyme was able to target the neuronal proteome as expected. Table 1 is now described in the main text of the revised work on pages 16 & 17.

      (2) The behavioural paradigm needs to be described accurately. Page 5, line 16-17, "C. elegans normally have a mild attraction towards higher salt concentration": in fact, C. elegans raised on NGM plates, which include approximately 50mM of NaCl, is attracted to around 50mM of NaCl (Kunitomo et al., Luo et al.) but not 100-200 mM.

      We thank the Reviewer for pointing this out. We agree that clarification is necessary. The revised text reads as follows on page 5: “C. elegans are typically grown in the presence of salt (usually ~ 50 mM) and display an attraction toward this concentration when assayed for chemotaxis behaviour on a salt gradient (Kunitomo et al., 2013, Luo et al., 2014).

      Training/conditioning with ‘no salt + food’ partially attenuates this attraction (group referred to ‘trained’).”

      Authors call this assay "salt associative learning", which refers to the fact that worms associate salt concentration (CS) and either presence or absence of food (appetitive or aversive US) during conditioning (Kunitomo et al., Luo et al., Nagashima et al.) but they are looking at only association with presence of food, and for proteome analysis they only change the CS (NaCl concentration, as discussed in Discussion, p24, lines 4-5). It is better to attempt to avoid confusion to the readers in general.

      Thank you Reviewer 4 for highlighting this clarity issue. We clarify our definition of “salt associative learning” for the purpose of this study in the revised manuscript on page 6 with the following text:

      “Similar behavioural paradigms involving pairings between salt/no salt and food/no food have been previously described in the literature (Nagashima et al. 2019). Here, learning experiments were performed by conditioning worms with either ‘no salt + food’ (referred to as ‘salt associative learning’) or ‘salt + no food’ (called ‘salt aversive learning’).”

      (3) page 32, line 23: the wording "excluding" is obscure and misleading because the elo-6 gene was included in the analysis.

      We appreciate this Reviewer for pointing out this misleading comment, which was unintentional. We have now removed it from the text (on page 21).

      (4) Typo at page 24, line 18: "that ACC-1" -> "than ACC-1".

      This has been corrected (on page 37).

      (5) Reference. In "LEO, T. H. T. et al.", given and sir names are flipped for all authors. Also, the paper has been formally published (http://dx.doi.org/10.1016/j.cub.2023.07.041).

      We appreciate the Reviewer drawing our attention to this – the reference has been corrected and updated.

      I would like to express my modest cross comments on the reviews:

      (1) Many of the reviewers comment on the shortage in the quantitative nature of the proteome analysis, so it seems to be a consensus.

      Thank you Reviewer 4 for this feedback. We appreciate the benefit in performing quantitative mass spectrometry, in that it provides an additional way to parse molecular mechanisms in a biological process (e.g., fold-changes in protein expression induced by learning). However, we note that quantitative mass spectrometry is challenging to integrate with TurboID due to the requirement to enrich for biotinylated peptides during sample processing (we now mention this on page 39). Nevertheless, it would be exciting to see this approach performed in a future study.

      To address the limitations of our original qualitative approach and enhance the clarity and utility of our dataset, we have made the following revisions in the manuscript:

      (1) Candidate selection criteria: We now clearly define how candidates were selected for functional testing, based on their frequency across biological replicates. Specifically, “strong candidates” were detected in three or more replicates, while “weak candidates” appeared in two or fewer.

      (2) Frequency-based representation (Table 2):We appreciate the suggestion by Reviewer 4 (Major Comment 1) to quantify differences between high-salt control and trained groups. We now provide the frequency-based representation of the candidates tested in this study within our proteomics data in Table 2. This data showed that many of the tested candidates were more frequently detected in trained worms compared to high-salt controls. This includes both strong and weak candidates

      We hope these additions help clarify our approach and demonstrate the value of the dataset, even within the constraints of qualitative proteomics.

      (2) Also, tissue- or cell-specificity of the identified proteins were commonly discussed. In reviewer #3's first Major comment, appearance of non-neuronal protein in the list was pointed out, which collaborate with my (#4 reviewer's) question on successful identification of neuronal proteins by this method. On the other hand, reviewer #1 pointed out subset neuron-specific proteins in the list. Obviously, these issues need to be systematically described by the authors.

      We agree with Reviewer 4 that these analyses provide a critical angle of analysis that is not explored in the original manuscript.

      Tissue analysis (Reviewer 3 Major Comment 1): We have used the single neuron RNA-Seq database CeNGEN, to identify that 87-95% (i.e. a large majority) of proteins identified across replicates corresponded to genes detected in neurons. These findings support that the TurboID enzyme was able to target the neuronal proteome as expected. Table 1 provides this information as is now described in the main text of the revised work on page 16.

      Neuron class analyses (Reviewer 1 Major Comment 2): In response, we have used the suggested Wormbase gene enrichment tool and CeNGEN. We specifically input proteins from the learning proteome into Wormbase, after filtering for proteins unique to TurboID trained animals. For CeNGEN, we compared genes/proteins from control worms and trained worms to identify potential neurons that may be involved in this learning paradigm.

      Briefly, we found highlight a range of neuron classes known in learning (e.g., RIS interneurons), cells that affect behaviour but have not been explored in learning (e.g., IL1 polymodal neurons), and neurons for which their function/s are unknown (e.g., pharyngeal neuron I3). Corresponding text for this new analysis has been added on pages 16-20, with a new table and figure added to illustrate these findings (Table S7 & Figure 4). Methods are detailed on pages 50-51.

      (3) Given reviewer #1's OPTIONAL Major comment, as an expert of behavioral assays in C. elegans, I would like to comment based on my experience that mutants received from Caenorhabditis Genetics Center or other labs often lose the phenotype after outcrossing by the wild type, indicating that a side mutation was responsible for the observed behavioral phenotype. Therefore, outcrossing may be helpful and easier than rescue experiments, though the latter are of course more accurate.

      Thank you for this suggestion. To address the potential involvement of background mutations, we have done experiments with backcrossed versions of mutants tested where possible, as shown in Figure 6. We found that F46H5.3(-) mutants maintained enhanced learning capacity after backcrossing with wild type, compared to their non-backcrossed mutant line. This was in contrast to C30G12.6(-) animals which lost their enhanced learning phenotype following backcrossing using wild type worms. This is described in the text on pages 24-26.

      (4) Just let me clarify the first Minor comment by reviewer #2. Authors described that the kin-2 mutant has abnormality in "salt associative learning" and "salt aversive learning", according to authors' terminology. In this comment by reviewer #2, "gustatory associative learning" probably refers to both of these assays.

      Reviewer 4 is correct. We have amended the wording appropriately on page 31 to clarify our meaning to address Reviewer 2’s comment.

      “Although kin-2(ce179) mutants were not shown to impact salt aversive learning, they have been reported previously to display impaired intermediate-term memory (but intact learning and short-term memory) for butanone appetitive learning (Stein and Murphy, 2014).”

      (5) There seem to be several typos in reviewer #1's Minor comments.

      "In Page 9, Lines 17-18" -> "Page 8, Lines 17-18".

      "Page 8, Line 24" -> "Page 7, Line 24".

      "I would suggest to remove figure 3" -> "I would suggest to remove figure 2"

      "summary figure similar to Figure 4" -> "summary figure similar to Figure 3"

      "In the discussion Page 24, Line 14" -> "In the discussion Page 23, Line 14"

      (I note that because a top page was inserted in the "merged" file but not in art file for review, there is a shift between authors' page numbers and pdf page numbers in the former.) It would be nice if reviewer #1 can confirm on these because I might be wrong.

      We appreciate Reviewer 4 noting this, and can confirm that these are the correct references (as indicated by Reviewer 1 in their cross-comments)

      Reviewer #4 (Significance):

      (1) Total neural proteome analysis has not been conducted before for learning-induced changes, though transcriptome analysis has been performed for odor learning (Lakhina et al., http://dx.doi.org/10.1016/j.neuron.2014.12.029). This guarantees the novelty of this manuscript, because for some genes, protein levels may change even though mRNA levels remain the same. We note an example in which a proteome analysis utilizing TurboID, though not the comparison between trained/control, has led to finding of learning related proteins (Hiroki et al., http://dx.doi.org/10.1038/s41467-022-30279-7). As described in the Major comments 1) in the previous section, improvement of data presentation will be necessary to substantiate this novelty.

      We appreciate this thoughtful feedback. We agree that while the neuronal transcriptome has been explored in Lakhina et al., 2015 for C. elegans in the context of memory, our study represents the first to examine learning-induced changes in the total neuronal proteome. We particularly agree with the statement that “for some genes, protein levels may change even though mRNA levels remain the same”. This is essential rationale that we now discuss on page 42.

      Additionally, we acknowledge the relevance of the study by Hiroki et al., 2022, which used TurboID to identify learning-related proteins, though not in a trained versus control comparison. Our work builds on this by directly comparing trained and control conditions, thereby offering new insights into the proteomic landscape of learning. This is now clarified on page 36.

      To substantiate the novelty and significance of our approach, we have revised the data presentation throughout the manuscript, including clearer candidate selection criteria, frequency-based representation of proteomic hits (Table 2), and neuron-specific enrichment analyses (Table S7 & Figure 4). We hope these improvements help convey the unique contribution of our study to the field.

      (2) Authors found six mutants that have abnormality in the salt learning (Fig. 4). These genes have not been described to have the abnormality, providing novel knowledge to the readers, especially those who work on C. elegans behavioural plasticity. Especially, involvement of acetylcholine neurotransmission has not been addressed. Although site of action (neurons involved) has not been tested in this manuscript, it will open the venue to further determine the way in which acetylcholine receptors, cAMP pathway etc. influences the learning process.

      Thank you Reviewer 4, for this encouraging feedback. To further strengthen the study and expand its relevance, we have tested additional mutants in response to Reviewer 3’s comments, as shown in Figures 6 & S7. These results provide even more candidate genes and pathways for future exploration, enhancing the significance and impact of our study.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #3 (Public review):

      The central issue for evaluating the overfilling hypothesis is the identity of the mechanism that causes the very potent (>80% when inter pulse is 20 ms), but very quickly reverting (< 50 ms) paired pulse depression (Fig 1G, I). To summarize: the logic for overfilling at local cortical L2/3 synapses depends critically on the premise that probability of release (pv) for docked and fully primed vesicles is already close to 100%. If so, the reasoning goes, the only way to account for the potent short-term enhancement seen when stimulation is extended beyond 2 pulses would be by concluding that the readily releasable pool overfills. However, the conclusion that pv is close to 100% depends on the premise that the quickly reverting depression is caused by exocytosis dependent depletion of release sites, and the evidence for this is not strong in my opinion. Caution is especially reasonable given that similarly quickly reverting depression at Schaffer collateral synapses, which are morphologically similar, was previously shown to NOT depend on exocytosis (Dobrunz and Stevens 1997). Note that the authors of the 1997 study speculated that Ca2+-channel inactivation might be the cause, but did not rule out a wide variety of other types of mechanisms that have been discovered since, including the transient vesicle undocking/re-docking (and subsequent re-priming) reported by Kusick et al (2020), which seems to have the correct timing.

      Thank you for your comments on an alternative possibility besides Ca<sup>2+</sup> channel inactivation. Kusick et al. (2020) showed that transient destabilization of docked vesicle pool is recovered within 14 ms after stimulation. This rapid recovery implies that post-stimulation undocking events might be largely resolved before the 20 ms inter-stimulus interval (ISI) used in our paired-pulse ratio (PPR) experiments, arguing against the possibility that post-AP undocking/re-docking events significantly influence PPR measured at 20 ms ISI. Furthermore, Vevea et al. (2021) showed that post-stimulus undocking is facilitated in synaptotagmin-7 (Syt7) knockout synapses. In our study, Syt7 knockdown did not affect PPR at 20 ms ISI, suggesting that the undocking process described in Kusick et al. may not be a major contributor to the paired-pulse depression observed at 20 ms interval in our study. Therefore, it is unlikely that transient vesicle undocking primarily underlies the strong PPD at 20 ms ISI in our experiments. Taken together, the undocking/redocking dynamics reported by Kusick et al. are too rapid to affect PPR at 20 ms ISI, and our Syt7 knockdown data further argue against a significant role of this process in the PPD observed at 20 ms interval.

      In an earlier round of review, I suggested raising extracellular Ca<sup>2+</sup>, to see if this would increase synaptic strength. This is a strong test of the authors' model because there is essentially no room for an increase in synaptic strength. The authors have now done experiments along these lines, but the result is not clear cut. On one hand, the new results suggest an increase in synaptic strength that is not compatible with the authors' model; technically the increase does not reach statistical significance, but, likely, this is only because the data set is small and the variation between experiments is large. Moreover, a more granular analysis of the individual experiments seems to raise more serious problems, even supporting the depletion-independent counter hypothesis to some extent. On the other hand, the increase in synaptic strength that is seen in the newly added experiments does seem to be less at local L2/3 cortical synapses compared to other types of synapses, measured by other groups, which goes in the general direction of supporting the critical premise that pv is unusually high at L2/3 cortical synapses. Overall, I am left wishing that the new data set were larger, and that reversal experiments had been included as explained in the specific points below.

      Specific Points:

      (1) One of the standard methods for distinguishing between depletion-dependent and depletion-independent depression mechanisms is by analyzing failures during paired pulses of minimal stimulation. The current study includes experiments along these lines showing that pv would have to be extremely close to 1 when Ca<sup>2+</sup> is 1.25 mM to preserve the authors' model (Section "High double failure rate ..."). Lower values for pv are not compatible with their model because the k<sub>1</sub> parameter already had to be pushed a bit beyond boundaries established by other types of experiments.

      It should be noted that we did not arbitrarily pushed the k<sub>1</sub> parameter beyond boundaries, but estimated the range of k<sub>1</sub> based on the fast time constant for recovery from paired pulse depression as shown in Fig. 3-S2-Ab.

      The authors now report a mean increase in synaptic strength of 23% after raising Ca to 2.5 mM. The mean increase is not quite statistically significant, but this is likely because of the small sample size. I extracted a 95% confidence interval of [-4%, +60%] from their numbers, with a 92% probability that the mean value of the increase in the full population is > 5%. I used the 5% value as the greatest increase that the model could bear because 5% implies pv < 0.9 using the equation from Dodge and Rahamimoff referenced in the rebuttal. My conclusion from this is that the mean result, rather than supporting the model, actually undermines it to some extent. It would have likely taken 1 or 2 more experiments to get above the 95% confidence threshold for statistical significance, but this is ultimately an arbitrary cut off.

      Our key claim in Fig. 3-S3 is not the statistical non-significance of EPSC changes, but the small magnitude of the change (1.23-fold). This small increase is far less than the 3.24-fold increase predicted by the fourth-power relationship (D&R equation, Dodge & Rahamimoff, 1967), which would be valid under the conditions that the fusion probability of docked vesicles (p<sub>v</sub>) is not saturated. We do not believe that addition of new experiments would increase the magnitude of EPSC change as high as the Dodge & Rahamimoff equation predicts, even if more experiments (n) yielded a statistical significance. In other words, even a small but statistically significant EPSC changes would still contradict with what we expect from low p<sub>v</sub> synapses. It should be noted that our main point is the extent of EPSC increase induced by high external [Ca<sup>2+</sup>], not a p-value. In this regard, it is hard for us to accept the Reviewer’s request for larger sample size expecting lower p-value.

      Although we agree to Reviewer’s assertion that our data may indicate a 92% probability for the high Ca<sup>2+</sup> -induced EPSC increases by more than 5%, we do not agree to the Reviewer’s interpretation that the EPSC increase necessarily implies an increase in p<sub>v</sub>. We are sorry that we could not clearly understand the Reviewer’s inference that the 5% increase of EPSCs implies p<sub>v</sub> < 0.9. Please note that release probability (p<sub>r</sub>) is the product of p<sub>v</sub> and the occupancy of docked vesicles in an active zone (p<sub>occ</sub>). We imagine that this inference might be under the premise that p<sub>occ</sub> is constant irrespective of external [Ca<sup>2+</sup>]. Contrary to the Reviewer’s premise, Figure 2c in Kusick et al. (2020) showed that the number of docked SVs increased by c. a. 20% upon increasing external [Ca<sup>2+</sup>] to 2 mM. Moreover, Figure 7F in Lin et al. (2025) demonstrated that the number of TS vesicles, equivalent to p<sub>occ</sub> increased by 23% at high external [Ca<sup>2+</sup>]. These extents of p<sub>occ</sub> increases are similar to our magnitude of high external Ca<sup>2+</sup> -induced increase in EPSC (1.23-fold). Of course, it is possible that both increase of p<sub>occ</sub> and p<sub>v</sub> contributed to the high [Ca<sup>2+</sup>]<sub>o</sub>-induced increase in EPSC. The low PPR and failure rate analysis, however, suggest that p<sub>v</sub> is already saturated in baseline conditions of 1.3 mM [Ca<sup>2+</sup>]<sub>o</sub> and thus it is more likely that an increase in p<sub>occ</sub> is primarily responsible for the 1.23-fold increase. Moreover, the 1.23-fold increase, does not match to the prediction of the D&R equation, which would be valid at synapses with low p<sub>v</sub>. Therefore, interpreting our observation (1.23-fold increase) as a slight increase in p<sub>occ</sub> is rather consistent with recent papers (Kusick et al.,2020; Lin et al., 2025) as well as our other results supporting the baseline saturation of p<sub>v</sub> as shown in Figure 2 and associated supplement figures (Fig. 2-S1 and Fig. 2-S2).

      (2) The variation between experiments seems to be even more problematic, at least as currently reported. The plot in Figure 3-figure supplement 3 (left) suggests that the variation reflects true variation between synapses, not measurement error.

      Note that there was a substantial variance in the number of docked or TS vesicles at baseline and its fold changes at high external Ca<sup>2+</sup> condition in previous studies too (Lin et al., 2025; Kusick et al., 2020). Our study did not focus on the heterogeneity but on the mean dynamics of short-term plasticity at L2/3 recurrent synapses. Acknowledging this, the short-term plasticity of these synapses could be best explained by assuming that vesicular fusion probability (p<sub>v</sub>) is near to unity, and that release probability is regulated by p<sub>occ</sub>. In other words, even though p<sub>v</sub> is near to unity, synaptic strength can increase upon high external [Ca<sup>2+</sup>], if the baseline occupancy of release sites (p<sub>occ</sub>) is low and p<sub>occ</sub> is increased by high [Ca<sup>2+</sup>]. Lin et al. (2025) showed that high external [Ca<sup>2+</sup>] induces an increase in the number of TS vesicles (equivalent to p<sub>occ</sub>) by 23% at the calyx synapses. Different from our synapses, the baseline p<sub>v</sub> (denoted as p<sub>fusion</sub> in Lin et al., 2025) of the calyx synapse is not saturated (= 0.22) at 1.5 mM external [Ca<sup>2+</sup>], and thus the calyx synapses displayed 2.36-fold increase of EPSC at 2 mM external [Ca<sup>2+</sup>], to which increases in p<sub>occ</sub> as well as in p<sub>v</sub> (from 0.22 to 0.42) contributed. Therefore, the small increase in EPSC (= 23%) supports that p<sub>v</sub> is already saturated at L2/3 recurrent synapses.

      And yet, synaptic strength increased almost 2-fold in 2 of the 8 experiments, which back extrapolates to pv < 0.2.

      We are sorry that we could not understand the first comment in this paragraph. Could you explain in detail why two-fold increase implies pv < 0.2?

      If all of the depression is caused by depletion as assumed, these individuals would exhibit paired pulse facilitation, not depression. And yet, from what I can tell, the individuals depressed, possibly as much as the synapses with low sensitivity to Ca<sup>2+</sup>, arguing against the critical premise that depression equals depletion, and even arguing - to some extent - for the counter hypothesis that a component of the depression is caused by a mechanism that is independent of depletion.

      For the first statement in this paragraph, we imagine that ‘the depression’ means paired pulse depression (PPD). If so, we can not understand why depletion-dependent PPD should lead to PPF. If the paired pulse interval is too short for docked vesicles to be replenished, the first pulse-induced vesicle depletion would result in PPD. We are very sorry that we could not understand Reviewer’s subsequent inference, because we could not understand the first statement.

      I would strongly recommend adding an additional plot that documents the relationship between the amount of increase in synaptic strength after increasing extracellular Ca<sup>2+</sup> and the paired pulse ratio as this seems central.

      We found no clear correlation of EPSC<sub>1</sub> with PPR changes (ΔPPR) as shown in the figure below.

      Author response image 1.

      Plot of PPR changes as a function of EPSC1.<br />

      (3) Decrease in PPR. The authors recognize that the decrease in the paired-pulse ratio after increasing Ca<sup>2+</sup> seems problematic for the overfilling hypothesis by stating: "Although a reduction in PPR is often interpreted as an increase in pv, under conditions where pv is already high, it more likely reflects a slight increase in p<sub>occ</sub> or in the number of TS vesicles, consistent with the previous estimates (Lin et al., 2025)."

      We admit that there is a logical jump in our statement you mentioned here. We appreciate your comment. We re-wrote that part in the revised manuscript (line 285) as follows:

      “Recent morphological and functional studies revealed that elevation of [Ca<sup>2+</sup>]<sub>o</sub> induces an increase in the number of TS or docked vesicles to a similar extent as our observation (Kusick et al., 2020; Lin et al., 2025), raising a possibility that an increase in p<sub>occ</sub> is responsible for the 1.23-fold increase in EPSC at high [Ca<sup>2+</sup>]<sub>o</sub> . A slight but significant reduction in PPR was observed under high [Ca<sup>2+</sup>]<sub>o</sub> too. An increase in p<sub>occ</sub> is thought to be associated with that in the baseline vesicle refilling rate. While PPR is always reduced by an increase in p<sub>v,</sub> the effects of refilling rate to PPR is complicated. For example, PPR can be reduced by both a decrease (Figure 2—figure supplement 1) and an increase (Lin et al., 2025) in the refilling rate induced by EGTA-AM and PDBu, respectively. Thus, the slight reduction in PPR is not contradictory to the possible contribution of p<sub>occ</sub> to the high [Ca<sup>2+</sup>]<sub>o</sub> effects.”

      I looked quickly, but did not immediately find an explanation in Lin et al 2025 involving an increase in pocc or number of TS vesicles, much less a reason to prefer this over the standard explanation that reduced PPR indicates an increase in pv.

      Fig. 7F of Lin et al. (2025) shows an 1.23-fold increase in the number of TS vesicles by high external [Ca<sup>2+</sup>]. The same figure (Fig. 7E) in Lin et al. (2025) also shows a two-fold increase of p<sub>fusion</sub> (equivalent to p<sub>v</sub> in our study) by high external [Ca<sup>2+</sup>] (from 0.22 to 0.42,). Because p<sub>occ</sub> is the occupancy of TS vesicles in a limited number of slots in an active zone, the fold change in the number of TS vesicles should be similar to that of p<sub>occ</sub>.

      The authors should explain why the most straightforward interpretation is not the correct one in this particular case to avoid the appearance of cherry picking explanations to fit the hypothesis.

      The results of Lin et al. (2025) indicate that high external [Ca<sub>2+</sub>] induces a milder increase in p<sub>occ</sub> (23%) compared to p<sub>v</sub> (190%) at the calyx synapses. Because the extent of p<sub>occ</sub> increase is much smaller than that of p<sub>v</sub> and multiple lines of evidence in our study support that the baseline p<sub>v</sub> is already saturated, we raised a possibility that an increase in p<sub>occ</sub> would primarily contribute to the unexpectedly low increase of EPSC at 2.5 mM [Ca<sub>2+</sub>]<sub>o</sub>. As mentioned above, our interpretation is also consistent with the EM study of Kusick et al. (2020). Nevertheless, the reduction of PPR at 2.5 mM Ca<sub>2+</sub> seems to support an increase in p<sub>v,</sub> arguing against this possibility. On the other hand, because p<sub>occ</sub> = k<sub>1</sub>/(k<sub>1</sub>+b<sub>1</sub>) under the simple vesicle refilling model (Fig. 3-S2Aa), a change in p<sub>occ</sub> should associate with changes in k<sub>1</sub> and/or b<sub>1</sub>. While PPR is always reduced by an increase in p<sub>v,</sub> the effects of refilling rate to PPR is complicated. For example, despite that EGTA-AM would not increase p<sub>v,</sub> it reduced PPR probably through reducing refilling rate (Fig. 2-S1). On the contrary, PDBu is thought to increase k<sub>1</sub> because it induces two-fold increase of p<sub>occ</sub> (Fig. 7L of Lin et al., 2025). Such a marked increase of p<sub>occ,</sub> rather than p<sub>v,</sub> seems to be responsible for the PDBu-induced marked reduction of PPR (Fig. 7I of Lin et al., 2025), because PDBu induced only a slight increase in p<sub>v</sub> (Fig. 7K of Lin et al., 2025). Therefore, the slight reduction of PPR is not contradictory to our interpretation that an increase in p<sub>occ</sub> might be responsible for the slight increase in EPSC induced by high [Ca<sup>2+</sup>]<sub>o</sub>.

      (4) The authors concede in the rebuttal that mean pv must be < 0.7, but I couldn't find any mention of this within the manuscript itself, nor any explanation for how the new estimate could be compatible with the value of > 0.99 in the section about failures.

      We have never stated in the rebuttal or elsewhere that the mean p<sub>v</sub> must be < 0.7. On the contrary, both of our manuscript and previous rebuttals consistently argued that the baseline p<sub>v</sub> is already saturated, based on our observations including low PPR, tight coupling, high double failure rate and the minimal effect of external Ca<sup>2+</sup> elevation.

      (5) Although not the main point, comparisons to synapses in other brain regions reported in other studies might not be accurate without directly matching experiments.

      Please understand that it not trivial to establish optimal experimental settings for studying other synapses using the same methods employed in the study. We think that it should be performed in a separate study. Furthermore, we have already shown in the manuscript that action potentials (APs) evoked by oChIEF activation occur in a physiologically natural manner, and the STP induced by these oChIEF-evoked APs is indistinguishable from the STP elicited by APs evoked by dual-patch electrical stimulation. Therefore, we believe that our use of optogenetic stimulation did not introduce any artificial bias in measuring STP.

      As it is, 2 of 8 synapses got weaker instead of stronger, hinting at possible rundown, but this cannot be assessed because reversibility was not evaluated. In addition, comparing axons with and without channel rhodopsins might be problematic because the channel rhodopsins might widen action potentials.

      We continuously monitored series resistance and baseline EPSC amplitude throughout the experiments. The figure below shows the mean time course of EPSCs at two different [Ca<sup>2+</sup>]<sub>o</sub>. As it shows, we observed no tendency for run-down of EPSCs during experiments. If any, such recordings were discarded from analysis. In addition, please understand that there is a substantial variance in the number of docked vesicles at both baseline and high external Ca<sup>2+</sup> (Lin et al., 2025; Kusick et al., 2020) as well as short-term dynamics of EPSCs at our synapses.

      Author response image 2.

      Time course of normalized amplitudes of the first EPSCs during paired-pulse stimulation at 20 ms ISI in control and in the elevated external Ca<sup>2+</sup> (n = 8).<br />

      (6) Perhaps authors could double check with Schotten et al about whether PDBu does/does not decrease the latency between osmotic shock and transmitter release. This might be an interesting discrepancy, but my understanding is that Schotten et al didn't acquire information about latency because of how the experiments were designed.

      Schotten et al. (2015) directly compared experimental and simulation data for hypertonicity-induced vesicle release. They showed a pronounced acceleration of the latency as the tonicity increases (Fig. 2-S2), but this tonicity-dependent acceleration was not reproduced by reducing the activation energy barrier for fusion (ΔEa) in their simulations (Fig. 2-S1). Thus, the authors mentioned that an unknown compensatory mechanism counteracting the osmotic perturbation might be responsible for the tonicity-dependent changes in the latency. Importantly, their modeling demonstrated that reducing ΔEa, which would correspond to increasing p<sub>v</sub> results in larger peak amplitudes and shorter time-to-peak, but did not accelerate the latency. Therefore, there is currently no direct explanation for the notion that PDBu or similar manipulations shorten latency via an increase in p<sub>v</sub>.

      (7) The authors state: "These data are difficult to reconcile with a model in which facilitation is mediated by Ca2+-dependent increases in pv." However, I believe that discarding the premise that depression is always caused by depletion would open up wide range of viable possibilities.

      We hope that Reviewer understands the reasons why we reached the conclusion that the baseline p<sub>v</sub> is saturated at our synapses. First of all, strong paired pulse depression (PPD) cannot be attributed to Ca<sup>2+</sup> channel inactivation because Ca<sup>2+</sup> influx at the axon terminal remained constant during 40 Hz train stimulation (Fig.2 -S2). Moreover, even if Ca<sup>2+</sup> channel inactivation is responsible for the strong PPD, this view cannot explain the delayed facilitation that emerges subsequent pulses (third EPSC and so on) in the 40 Hz train stimulation (Fig. 1-4), because Ca<sup>2+</sup> channel inactivation gradually accumulates during train stimulations as directly shown by Wykes et al. (2007) in chromaffin cells. Secondly, the strong PPD and very fast recovery from PPD indicates very fast refilling rate constant (k<sub>1</sub>). Under this high k<sub>1</sub>, the failure rates were best explained by p<sub>v</sub> close to unity. Thirdly, the extent of EPSC increase induced by high external Ca<sup>2+</sup> was much smaller than other synapses such as calyx synapses at which p<sub>v</sub> is not saturated (Lin et al., 2025), and rather similar to the increases in p<sub>occ</sub> estimated at calyx synapses or the EM study (Kusick et al., 2020; Lin et al., 2025).

      Reference

      Wykes et al. (2007). Differential regulation of endogenous N-and P/Q-type Ca<sup>2+</sup> channel inactivation by Ca<sup>2+</sup>/calmodulin impacts on their ability to support exocytosis in chromaffin cells. Journal of Neuroscience, 27(19), 5236-5248.

      Reviewer #3 (Recommendations for the authors):

      I continue to think that measuring changes in synaptic strength when raising extracellular Ca<sup>2+</sup> is a good experiment for evaluating the overfilling hypothesis. Future experiments would be better if the authors would include reversibility criteria to rule out rundown, etc. Also, comparisons to other types of synapses would be stronger if the same experimenter did the experiments at both types of synapses.

      We observed no systemic tendency for run-down of EPSCs during these experiments (Author response image 2). Furthermore, the observed variability is well within the expected variance range in the number of docked vesicles at both baseline and high external Ca²⁺ (Lin et al., 2025; Kusick et al., 2020) and reflects biological variability rather than experimental artifact. Therefore, we believe that additional reversibility experiments are not warranted. However, we are open to further discussion if the Reviewer has specific methodological concerns not resolved by our present data.

      For the second issue, as mentioned above, we think that studying at other synapse types should be done in a separate study.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations for the authors):

      (1) The onus of making the revisions understandable to the reviewers lies with the authors. In its current form, how the authors have approached the review is hard to follow, in my opinion. Although the authors have taken a lot of effort in answering the questions posed by reviewers, parallel changes in the manuscript are not clearly mentioned. In many cases, the authors have acknowledged the criticism in response to the reviewer, but have not changed their narrative, particularly in the results section.

      We fully acknowledge your concern regarding the narrative linking EB-induced GluCl expression to JH biosynthesis and fecundity enhancement, particularly the need to address alternative interpretations of the data. Below, we outline the specific revisions made to address your feedback and ensure the manuscript’s narrative aligns more precisely with the experimental evidence:

      (1) Revised Wording in the Results Section

      To avoid overinterpretation of causality, we have modified the language in key sections of the Results (e.g., Figure 5 and related text):

      Original phrasing:

      “These results suggest that EB activates GluCl which induces JH biosynthesis and release, which in turn stimulates reproduction in BPH (Figure 5J).”

      Revised phrasing:

      “We also examined whether silencing Gluclα impacts the AstA/AstAR signaling pathway in female adults. Knock-down of Gluclα in female adults was found to have no impact on the expression of AT, AstA, AstB, AstCC, AstAR, and AstBR. However, the expression of AstCCC and AstCR was significantly upregulated in dsGluclα-injected insects (Figure 5-figure supplement 2A-H). Further studies are required to delineate the direct or indirect mechanisms underlying this effect of Gluclα-knockdown.” (line 643-649). And we have removed Figure 5J in the revised manuscript.

      (2) Expanded Discussion of Alternative Mechanisms

      In the Discussion section, we have incorporated a dedicated paragraph to explore alternative pathways and compensatory mechanisms:

      Key additions:

      “This EB action on GluClα expression is likely indirect, and we do not consider EB as transcriptional regulator of GluClα. Thus, the mechanism behind EB-mediated induction of GluClα remains to be determined. It is possible that prolonged EB exposure triggers feedback mechanisms (e.g. cellular stress responses) to counteract EB-induced GluClα dysfunction, leading to transcriptional upregulation of the channel. Hence, considering that EB exposure in our experiments lasts several days, these findings might represent indirect (or secondary) effects caused by other factors downstream of GluCl signaling that affect channel expression.” (line 837-845).

      (2) In the response to reviewers, the authors have mentioned line numbers in the main text where changes were made. But very frequently, those lines do not refer to the changes or mention just a subsection of changes done. As an example please see point 1 of Specific Points below. The problem is throughout the document making it very difficult to follow the revision and contributing to the point mentioned above.

      Thank you for highlighting this critical oversight. We sincerely apologize for the inconsistency in referencing line numbers and incomplete descriptions of revisions, which undoubtedly hindered your ability to track changes effectively. We have eliminated all vague or incomplete line number references from the response letter. Instead, revisions are now explicitly tied to specific sections, figures, or paragraphs.

      (3) The authors need to infer the performed experiments rationally without over interpretation. Currently, many of the claims that the authors are making are unsubstantiated. As a result of the first review process, the authors have acknowledged the discrepancies, but they have failed to alter their interpretations accordingly.

      We fully agree that overinterpretation of data undermines scientific rigor. In response to your feedback, we have systematically revised the manuscript to align claims strictly with experimental evidence and to eliminate unsubstantiated assertions. We sincerely apologize for the earlier overinterpretations and appreciate your insistence on precision. The revised manuscript now rigorously distinguishes between observations (e.g., EB-GluCl-JH correlations) and hypotheses (e.g., GluCl’s mechanistic role). By tempering causal language and integrating competing explanations, we aimed to present a more accurate and defensible narrative.

      SPECIFIC POINTS (to each question initially raised and their rebuttals)

      (1a) "Actually, there are many studies showing that insects treated with insecticides can increase the expression of target genes". Please note what is asked for is that the ligand itself induces the expression of its receptor. Of course, insecticide treatment will result in the changes expression of targets. Of all the evidences furnished in rebuttal, only Peng et al. 2017 fits the above definition. Even in this case, the accepted mode of action of chlorantraniliprole is by inducing structural change in ryanodine receptor. The observed induction of ryanodine receptor chlorantraniliprole can best be described as secondary effect. All others references do not really suffice the point asked for.

      We appreciate the reviewers’ suggestions for improving the manuscript. First, we have supplemented additional studies supporting the notion that " There are several studies showing that insects treated with insecticides display increases in the expression of target genes. For example, the relative expression level of the ryanodine receptor gene of the rice stem borer, Chilo suppressalis was increased 10-fold after treatment with chlorantraniliprole, an insecticide which targets the ryanodine receptor (Peng et al., 2017). In Drosophila, starvation (and low insulin) elevates the transcription level of the receptors of the neuropeptides short neuropeptide F and tachykinin (Ko et al., 2015; Root et al., 2011). In BPH, reduction in mRNA and protein expression of a nicotinic acetylcholine receptor α8 subunit is associated with resistance to imidacloprid (Zhang et al., 2015). Knockdown of the α8 gene by RNA interference decreased the sensitivity of N. lugens to imidacloprid (Zhang et al., 2015). Hence, the expression of receptor genes may be regulated by diverse factors, including insecticide exposure.” We have inserted text in lines 846-857 to elaborate on these possibilities.

      Second, we would like to reiterate our position: we have merely described this phenomenon, specifically that EB treatment increases GluClα expression. “This EB action on GluClα expression is likely indirect, and we do not consider EB as transcriptional regulator of GluClα. Thus, the mechanism behind EB-mediated induction of GluClα remains to be determined. It is possible that prolonged EB exposure triggers feedback mechanisms (e.g. cellular stress responses) to counteract EB-induced GluClα dysfunction, leading to transcriptional upregulation of the channel. Hence, considering that EB exposure in our experiments lasts several days, these findings might represent indirect (or secondary) effects caused by other factors downstream of GluCl signaling that affect channel expression.” We have inserted text in lines 837-845 to elaborate on these possibilities.

      Once again, we sincerely appreciate this discussion, which has provided us with a deeper understanding of this phenomenon.

      b. The authors in their rebuttal accepts that they do not consider EB to a transcriptional regulator of Gluclα and the induction of Gluclα as a result of EB can best be considered as a secondary effect. But that is not reflected in the manuscript, particularly in the result section. Current state of writing implies EB up regulation of Gluclα to an important event that contributes majorly to the hypothesis. So much so that they have retained the schematic diagram (Fig. 5J) where EB -> Gluclα is drawn. Even the heading of the subsection says "EB-enhanced fecundity in BPHs is dependent on its molecular target protein, the Gluclα channel". As mentioned in the general points, it is not enough to have a good rebuttal written to the reviewer, the parent manuscript needs to reflect on the changes asked for.

      Thank you for your comments. We have carefully addressed your suggestions and made corresponding revisions to the manuscript.

      We fully acknowledge the reviewer's valid concern. In this revised manuscript, “However, we do not propose that EB is a direct transcriptional regulator of Gluclα, since EB and other avermectins are known to alter the channel conformation and thus their function (Wolstenholme, 2012; Wu et al., 2017). Thus, it is likely that the observed increase in Gluclα transcipt is a secondary effect downstream of EB signaling.” (Line 625-629). We agree that the original presentation in the manuscript, particularly within the Results section, did not adequately reflect this nuance and could be misinterpreted as suggesting a direct regulatory role for EB on Gluclα transcription.

      Regarding Fig. 5J, we have removed the figure and all mentions of Fig. 5J and its legend in the revised manuscript.

      c. "We have inserted text on lines 738 - 757 to explain these possibilities." Not a single line in the section mentioned above discussed the topic in hand. This is serious undermining of the review process or carelessness to the extreme level.

      In the Results section, we have now added descriptions “Taken together, these results reveal that EB exposure is associated with an increase in JH titer and that this elevated JH signaling contributes to enhanced fecundity in BPH.” (line 375-377).

      For the figures, we have removed Fig. 4N and all mentions of Fig. 4N and its legend in the revised manuscript.

      Lastly, regarding the issue of locating specific lines, we deeply regret any inconvenience caused. Due to the track changes mode used during revisions, line numbers may have shifted, resulting in incorrect references. We sincerely apologize for this and have now corrected the line numbers.

      (2) The section written in rebuttal should be included in the discussion as well, explaining why authors think a nymphal treatment with JH may work in increasing fecundity of the adults. Also, the authors accept that EBs effect on JH titer in Indirect. The text of the manuscript, results section and figures should be reflective of that. It is NOT ok to accept that EB impacts JH titer indirectly in a rebuttal letter while still continuing to portray EB direct effect on JH titer. In terms of diagrams, authors cannot put a -> sign until and unless the effect is direct. This is an accepted norm in biological publications.

      We appreciate the reviewer’s valuable suggestions here. We have now carefully revised the manuscript to address all concerns, particularly regarding the mechanism linking nymphal EB exposure to adult fecundity and the indirect nature of EB’s effect on JH titers. Below are our point-by-point responses and corresponding manuscript changes. Revised text is clearly marked in the resubmitted manuscript.

      (1) Clarifying the mechanism linking nymphal EB treatment to adult fecundity:

      Reviewer concern: Explain why nymphal EB treatment increases adult fecundity despite undetectable EB residues in adults.

      Response & Actions Taken:

      We agree this requires explicit discussion. We now propose that nymphal EB exposure triggers developmental reprogramming (e.g., metabolic/epigenetic changes) that persist into adulthood, indirectly enhancing JH synthesis and fecundity. This is supported by two key findings:

      (1) No detectable EB residues in adults after nymphal treatment (new Figure 1–figure supplement 1C).

      (2) Increased adult weight and nutrient reserves (Figure 1–figure supplement 3E,F), suggesting altered resource allocation.

      Added to Discussion (Lines 793–803): Notably, after exposing fourth-instar BPH nymphs to EB, no EB residues were detected in the subsequent adult stage. This finding indicates that the EB-induced increase in adult fecundity is initiated during the nymphal stage and s manifests in adulthood - a mechanism distinct from the direct fecundity enhancement of fecundity observed when EB is applied to adults. We propose that sublethal EB exposure during critical nymphal stages may reprogram metabolic or endocrine pathways, potentially via insulin/JH crosstalk. For instance, increased nutrient storage (e.g., proteins, sugars; Figure 2–figure supplement 2) could enhance insulin signaling, which in turn promotes JH biosynthesis in adults (Ling and Raikhel, 2021; Mirth et al., 2014; Sheng et al., 2011). Future studies should test whether EB alters insulin-like peptide expression or signaling during development.

      (3) Emphasizing EB’s indirect effect on JH titers:Reviewer concern: The manuscript overstated EB’s direct effect on JH. Arrows in figures implied causality where only correlation exists.

      Response & Actions

      Taken:We fully agree. EB’s effect on JH is indirect and multifactorial (via AstA/AstAR suppression, GluCl modulation, and metabolic changes). We have:

      Removed oversimplified schematics (original Figures 3N, 4N, 5J).

      Revised all causal language (e.g., "EB increases JH" → "EB exposure is associated with increased circulating JH III "). (Line 739)

      Clarified in Results/Discussion that EB-induced JH changes are likely secondary to neuroendocrine disruption.

      Key revisions:

      Results (Lines 375–377):

      "Taken together, these results reveal that EB exposure is associated with an increase in JH titer and that JH signaling contributes to enhanced fecundity in BPH."

      Discussion (Lines 837–845):

      This EB action on GluClα expression is likely indirect, and we do not consider EB as transcriptional regulator of GluClα. Thus, the mechanism behind EB-mediated induction of GluClα remains to be determined. It is possible that prolonged EB exposure triggers feedback mechanisms (e.g. cellular stress responses) to counteract EB-induced GluClα dysfunction, leading to transcriptional upregulation of the channel. Hence, considering that EB exposure in our experiments lasts several days, these findings might represent indirect (or secondary) effects caused by other factors downstream of GluCl signaling that affect channel expression.

      a. Lines 281-285 as mentioned, does not carry the relevant information.

      Thank you for your careful review of our manuscript. We sincerely apologize for the confusion regarding line references in our previous response. Due to extensive revisions and tracked changes during the revision process, the line numbers shifted, resulting in incorrect citations for Lines 281–285. The correct location for the added results (EB-induced increase in mature eggs in adult ovaries) is now in lines 253-258: “We furthermore observed that EB treatment of female adults also increases the number of mature eggs in the ovary (Figure 2-figure supplement 1).”

      b. Lines 351-356 as mentioned, does not carry the relevant information. Lines 281-285 as mentioned, does not carry the relevant information.

      Thank you for your careful review of our manuscript. We sincerely apologize for the confusion regarding line references in our previous response. The correct location for the added results is now in lines 366-371: “We also investigated the effects of EB treatment on the JH titer of female adults. The data indicate that the JH titer was also significantly increased in the EB-treated female adults compared with controls (Figure 3-figure supplement 3A). However, again the steroid 20-hydroxyecdysone, was not significantly different between EB-treated BPH and controls (Figure 3-figure supplement 3B).”

      c. Lines 378-379 as mentioned, does not carry the relevant information. Lines 387-390 as mentioned, does not carry the relevant information.

      We sincerely apologize for the confusion regarding line references in our previous response.

      The correct location for the added results is now in lines 393-394: We furthermore found that EB treatment in female adults increases JHAMT expression (Figure 3-figure supplement 3C).

      The other correct location for the added results is now in lines 405-408: We found that Kr-h1 was significantly upregulated in the adults of EB-treated BPH at the 5M, 5L nymph and 4 to 5 DAE stages (4.7-fold to 27.2-fold) when 4th instar nymph or female adults were treated with EB (Figure 3H and Figure 3-figure supplement 3D)..

      (3) The writing quality is still extremely poor. It does not meet any publication standard, let alone elife.

      We fully understand your concerns and frustrations, and we sincerely apologize for the deficiencies in our writing quality, which did not meet the high standards expected by you and the journal. We fully accept your criticism regarding the writing quality and have rigorously revised the manuscript according to your suggestions.

      (4) I am confused whether Figure 2B was redone or just edited. Otherwise this seems acceptable to me.

      Regarding Fig. 2B, we have edited the text on the y-axis. The previous wording included the term “retention,” which may have caused misunderstanding for both the readers and yourself, leading to the perception of contradiction. We have now revised this wording to ensure accurate comprehension.

      (5) The rebuttal is accepted. However, still some of the lines mentioned does not hold relevant information.

      This error has been corrected.

      The correct location for the added results is now in lines 255-258 and lines 279-282: “Hence, although EB does not affect the normal egg developmental stages (see description in next section), our results suggest that EB treatment promotes oogenesis and, as a result the insects both produce more eggs in the ovary and a larger number of eggs are laid.” and “However, considering that the number of eggs laid by EB treated females was larger than in control females (Figure 1 and Figure 1-figure supplement 1), our data indicates that EB treatment of BPH can both promote both oogenesis and oviposition.”

      (6) Thank you for the clarification. Although now discussed extensively in discussion section, the nuances of indirect effect and minimal change in expression should also be reflected in the result section text. This is to ensure that readers have clear idea about content of the paper.

      Corrected. To ensure readers gain a clear understanding of our data, we have briefly presented these discussions in the Results section. Please see line 397-402: The levels of met mRNA slightly increased in EB-treated BPH at the 5M and 5L instar nymph and 1 to 5 DAE adult stages compared to controls (1.7-fold to 2.9-fold) (Figure 3G). However, it should be mentioned that JH action does not result in an increase of Met. Thus, it is possible that other factors (indirect effects), induced by EB treatment cause the increase in the mRNA expression level of Met.

      (7) As per the author's interpretation, it becomes critical to quantitate the amount of EB present at the adult stages after a 4th instar exposure to it. Only this experiment will unambiguously proof the authors claim. Also, since they have done adult insect exposure to EB, such experiments should be systematically performed for as many sections as possible. Don't just focus on few instances where reviewers have pointed out the issue.

      Thank you for raising this critical point. To address this concern, we have conducted new supplementary experiments. The new experimental results demonstrate that residual levels of emamectin benzoate (EB) in adult-stage brown planthoppers (BPH) were below the instrument detection limit following treatment of 4th instar nymphs with EB. Line 172-184: “To determine whether EB administered during the fourth-instar larval stage persists as residues in the adult stage, we used HPLC-MS/MS to quantify the amount of EB present at the adult stage after exposing 4th-instar nymphs to this compound. However, we found no detectable EB residues in the adult stage following fourth-instar nymphal treatment (Figure 1-figure supplement 1C). This suggests that the mechanism underlying the increased fecundity of female adults induced by EB treatment of nymphs may differ from that caused by direct EB treatment of female adults. Combined with our previous observation that EB treatment significantly increased the body weight of adult females (Figure 1—figure supplement 3E and F), a possible explanation for this phenomenon is that EB may enhance food intake in BPH, potentially leading to elevated production of insulin-like peptides and thus increased growth. Increased insulin signaling could potentially also stimulate juvenile hormone (JH) biosynthesis during the adult stage (Badisco et al., 2013).”

      (8) Thank you for the revision. Lines 725-735 as mentioned, does not carry the relevant information. However, since the authors have decided to remove this systematically from the manuscript, discussion on this may not be required.

      Thank you for identifying the limited relevance of the content in Lines 725–735 of the original manuscript. As recommended, we have removed this section in the revised version to improve logical coherence and maintain focus on the core findings.

      (9) Normally, dsRNA would last for some time in the insect system and would down-regulate any further induction of target genes by EB. I suggest the authors to measure the level of the target genes by qPCR in KD insects before and after EB treatment to clear the confusion and unambiguously demonstrate the results. Please Note- such quantifications should be done for all the KD+EB experiments. Additionally, citing few papers where such a rescue effect has been demonstrated in closely related insect will help in building confidence.

      We appreciate the reviewer’s suggestion to clarify the interaction between RNAi-mediated gene knockdown (KD) and EB treatment. To address this, we performed additional experiments measuring Kr-h1 expression via qPCR in dsKr-h1-injected insects before and after EB exposure.

      The results (now Figure 3–figure supplement 4) show that:

      (1) EB did not rescue *Kr-h1* suppression at 24h post-treatment (*p* > 0.05).

      (2) Partial recovery of fecundity occurred later (Figure 3M), likely due to:

      a) Degradation of dsRNA over time, reducing KD efficacy (Liu et al., 2010).

      b) Indirect effects of EB (e.g., hormonal/metabolic reprogramming) compensating for residual Kr-h1 suppression.

      Please see line 441-453: “Next, we investigated whether EB treatment could rescue the dsRNA-mediated gene silencing effect. To address this, we selected the Kr-h1 gene and analyzed its expression levels after EB treatment. Our results showed that Kr-h1 expression was suppressed by ~70% at 72 h post-dsRNA injection. However, EB treatment did not significantly rescue Kr-h1 expression in gene knock down insects (*p* > 0.05) at 24h post-EB treatment (Figure 3-figure supplement 4). While dsRNA-mediated Kr-h1 suppression was robust initially, its efficacy may decline during prolonged experiments. This aligns with reports in BPH, where effects of RNAi gradually diminish beyond 7 days post-injection (Liu et al., 2010a). The late-phase fecundity increase might reflect partial Kr-h1 recovery due to RNAi degradation, allowing residual EB to weakly stimulate reproduction. In addition, the physiological impact of EB (e.g., neurotoxicity, hormonal modulation) could manifest via compensatory feedback loops or metabolic remodeling.”

      (10) Not a very convincing argument. Besides without a scale bar, it is hard for the reviewers to judge the size of the organism. Whole body measurements of JH synthesis enzymes will remain as a quite a drawback for the paper.

      In response to your suggestion, we have also included images with scale bars (see next Figure 1). The images show that the head region is difficult to separate from the brown thoracic sclerite region. Furthermore, the anatomical position of the Corpora Allata in brown planthoppers has never been reported, making dissection uncertain and highly challenging. To address this, we are now attempting to use Drosophila as a model to investigate how EB regulates JH synthesis and reproduction.

      Author response image 1.<br /> This illustration provides a visual representation of the brown planthopper (BPH), a major rice pest.<br />

      Figure 1. This illustration provides a visual representation of the brown planthopper (BPH), a major rice pest.).

      (11) "The phenomenon reported was specific to BPH and not found in other insects. This limits the implications of the study". This argument still holds. Combined with extreme species specificity, the general effect that EB causes brings into question the molecular specificity that the authors claim about the mode of action.

      We acknowledge that the specificity of the phenomenon to BPH may limit its broader implications, but we would like to emphasize that this study provides important insights into the unique biological mechanisms in BPH, a pest of significant agricultural importance. The molecular specificity we described in the manuscript is based on rigorous experimental evidence. We believe that it contributes to valuable knowledge to understand the interaction of external factors such as EB and BPH and resurgence of pests. We hope that this study will inspire further research into the mechanisms underlying similar phenomena in other insects, thereby broadening our understanding of insect biology. Since EB also has an effect on fecundity in Drosophila, albeit opposite to that in BPHs (Fig. 1 suppl. 2), it seems likely that EB actions may be of more general interest in insect reproduction.

      (12) The authors have added a few lines in the discussion but it does not change the overall design of the experiments. In this scenario, they should infer the performed experiments rationally without over interpretation. Currently, many of the claims that the authors are making are unsubstantiated. As a result of the first review process, the authors have acknowledged the discrepancies, but they have failed to alter their interpretations accordingly.

      We appreciate your concern regarding the experimental design and the need for rational inference without overinterpretation. In response, we would like to clarify that our discussion is based on the experimental data we have collected. We acknowledge that our study focuses on BPH and the specific effects of EB, and while we agree that broader generalizations require further research, we believe the new findings we present are valid and contribute to the understanding of this specific system.

      We also acknowledge the discrepancies you mentioned and have carefully considered your suggestions. In this revised version, we believe our interpretations are reasonable and consistent with the data, and we have adjusted our discussion to better reflect the scope of our findings. We hope that these revisions address your concerns. Thank you again for your constructive feedback.

      ADDITIONAL POINTS

      (1) Only one experiment was performed with Abamectin. No titration for the dosage were done for this compound, or at least not provided in the manuscript. Inclusion of this result will confuse readers. While removing this result does not impact the manuscript at all. My suggestion would be to remove this result.

      We acknowledge that the abamectin experiment lacks dose-titration details and that its standalone presentation could lead to confusion. However, we respectfully request to retain these results for the following reasons:

      Class-Specific Mechanism Validation:

      Abamectin and emamectin benzoate (EB) are both macrocyclic lactones targeting glutamate-gated chloride channels (GluCls). The observed similarity in their effects on BPH fecundity (e.g., Figure 1—figure supplement 1B) supports the hypothesis that GluCl modulation, rather than compound-specific off-target effects, drives the reproductive enhancement. This consistency strengthens the mechanistic argument central to our study.

      (2) The section "The impact of EB treatment on BPH reproductive fitness" is poorly described. This needs elaboration. A line or two should be included to describe why the parameters chosen to decide reproductive fitness were selected in the first place. I see that the definition of brachypterism has undergone a change from the first version of the manuscript. Can you provide an explanation for that? Also, there is no rationale behind inclusion of statements on insulin at this stage. The authors have not investigated insulin. Including that here will confuse readers. This can be added in the discussion though.

      Thank you for your suggestion. We have added an explanation regarding the primary consideration of evaluating reproductive fitness. In the interaction between sublethal doses of insecticides and pests, reproductive fitness is a key factor, as it accurately reflects the potential impact of insecticides on pest control in the field. Among the reproductive fitness parameters, factors such as female Nilaparvata lugens body weight, lifespan, and brachypterous ratio (as short-winged N. lugens exhibit higher oviposition rates than long-winged individuals) are critical determinants of reproductive success. Therefore, we comprehensively assessed the effects of EB on these parameters to elucidate the primary mechanism by which EB influences reproduction. We sincerely appreciate your constructive feedback.

      (3) "EB promotes ovarian maturation in BPH" this entire section needs to be rewritten and attention should be paid to the sequence of experiments described.

      Thank you for your suggestion. Based on your recommendation, we have rewritten this section (lines 267–275) and adjusted the sequence of experimental descriptions to improve the structural clarity of this part.

      (4) Figure 3N is outright wrong and should be removed or revised.

      In accordance with your recommendation, we have removed the figure.

      (5) When you are measuring hormonal titers, it is important to mention explicitly whether you are measuring hemolymph titer or whole body.

      We believe we have explicitly stated in the Methods section (line 1013) that we measured whole-body hormone titers. However, we now added this information to figure legends.

      (6)  EB induces JH biosynthesis through the peptidergic AstA/AstAR signaling pathway- this section needs attention at multiple points. Please check.

      We acknowledge that direct evidence for EB-AstA/AstAR interaction is limited and have framed these findings as a hypothesis for future validation.

      References

      Liu, S., Ding, Z., Zhang, C., Yang, B., Liu, Z., 2010. Gene knockdown by intro-thoracic injection of double-stranded RNA in the brown planthopper, Nilaparvata lugens. Insect Biochem. Mol. Biol. 40, 666-671

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer 1:

      We thank Reviewer 1 for the discussion on the possible causes of ERPs and their relevance for the interpretation of changes in aperiodic activity. We have changed the relevant paragraph to read as follows: For example, ERPs may reflect changes in periodic activity, such as phase resets (Makeig et al., 2002), or baseline shifts (Nikulin et al., 2007). ERPs may also capture aperiodic activity, either in the form of evoked transients triggered by an event (Shah et al., 2004) or induced changes in the ongoing background signal. This has important implications: evoked transients can alter the broadband spectrum without implying shifts in ongoing background activity, whereas induced aperiodic changes may signal different neural mechanisms, such as shifts in the excitation-inhibition balance (Gao et al., 2017).

      Reviewer 1 argued that a time point-by-time point comparison between ERPs and aperiodic parameters may not be the most appropriate approach, since aperiodic time series have lower temporal resolution than ERPs. Reviewer suggested comparing their topographies instead. We had already done this in the first version of the paper (see Fig. S7: https://elifesciences.org/reviewedpreprints/101071v1#s10). However, in the second version, we opted to use linear mixed models for each channel-time point in order to maintain consistency with the other analyses in the paper (e.g. the comparison between FOOOF parameters and baseline-corrected power).

      Nevertheless, we repeated the topographic correlations as in the first version, and the results are shown below. Correlations were computed for each time point, subject and condition, and then averaged across these dimensions for visualisation. The pattern differs from that of the linear mixedmodel results (see Fig. S14), with notable correlations appearing after ~0.5 s for the exponent and after ~1.0 s for the offset. Still, the correlations remain low, suggesting that aperiodic parameters and ERPs encode different information (at least in this dataset).

      Author response image 1.<br />

      Additionally, to control for the effect of smearing we have performed the same linear mixed model analysis as in Fig. S14 on low-pass filtered ERPs (with cut-off 10 Hz), and the results were largely similar as in Fig. S14.

      Reviewer 1 discussed two possible explanations for the observed correlations between baselinecorrected power and FOOOF parameters (Figure 4): “The correlation between the exponent and lowfrequency activity could be of either direction: low frequency power changes could reflect 1/f shifts, or exponent estimates might be biased by undetected delta/theta activity. I think that one other piece of evidence /…/ to intuitively highlight why the latter is more likely is the /…/ decrease at high ("transbeta") frequencies, which suggests a rotational shift /../.” We agree with the interpretation that lowfrequency power changes in our data primarily reflect 1/f shifts. However, we are uncertain about the reviewer’s statement that the “latter” explanation (i.e., bias in exponent estimates due to delta/theta activity) is more likely. Given the context, we believe the reviewer may have intended to say the “former” explanation is more likely.

      We agree with the reviewers' observation that rhythmicity, as estimated using the pACF, can be independent of power (Myrov et al., 2024, Fig. 1). However, it seems that in real (non-simulated) datasets, the pACF and power spectral density (PSD) are often moderately correlated (e.g. Myrov et al., 2024, Fig. 5).

      Reviewer 1 asked whether we had examined aperiodic changes in the data before and after subtracting the response-locked ERPs. We did not carry out this extra analysis as, as the reviewer suggests, it would have been excessive – the current version of the paper already contains more than 60 figures. As mentioned in the manuscript, we acknowledge the possibility that response-locked ERPs contribute to the second aperiodic component. However, due to the weak correlation between reaction times and aperiodic activity, the presence of both components throughout the entire epoch (in at least the first and third datasets) and the distinct differences between the ERPs and the aperiodic activity in the different conditions (see Fig. 8 vs. Fig. S13), we cannot conclusively determine whether the second aperiodic component is directly related to motor responses. Finally, we agree with the reviewer that the distribution of the response-locked ERP more closely resembles the frontocentral (earlier) aperiodic component than the later post-response component. We have amended the relevant paragraph in the Discussion to include these observations. ”While it is possible that response-related ERPs contributed to the second aperiodic component, several observations suggest otherwise: both aperiodic components were present throughout the entire epoch, differences between conditions diverged between ERPs and aperiodic activity (compare Figure 8 and Figure S16), and the associations with reaction times were weak. Moreover, the distribution of the response-locked ERP qualitatively resembled the earlier frontocentral aperiodic component more than the later post-response component. Taken together, these findings suggest that ERPs and aperiodic activity capture distinct aspects of neural processing, rather than reflecting the same underlying phenomenon.”

      We agree with Reviewer 1 that our introduction of aperiodic activity was abrupt, and that the term 'aperiodic exponent' required definition. We have now defined it as the spectral steepness in log–log space (i.e. the slope), and have added a brief explanatory sentence to the introduction.

      Reviewer 1 noted that the phrase 'task-related changes in overall power' could be misinterpreted as referring to total (broadband) power, and recommended that we specify a frequency range. We agree, so we have replaced 'overall power' with 'spectral power within a defined frequency range'.

      We agree with Reviewer 1 that the way we worded things in the Discussion section regarding alpha activity and inhibitory processes was awkward and could easily be misread. We have rephrased the sentences and added a brief explanation to avoid implying a direct link between alpha attenuation and neural inhibition.

      Furthermore, based on the reviewer’s suggestion, we added a brief comment in the Discussion section (Theoretical and methodological implications) on theoretical perspectives regarding the interaction between age and aperiodic activity.

      Reviewer 1 suggested including condition as a fixed effect in order to examine whether the relationship between FOOOF parameters and baseline-corrected power is modulated by condition. Specifically, the reviewer proposed changing our model from

      baseline_corrected_power ~ 1 + fooof_parameter + (1|modality) + (1|nback) + (1|stimulus) + (1|subject)

      to

      baseline_corrected_power ~ 1 + fooof_parameter + modality*nback *stimulus + (1|subject)

      While we appreciate this suggestion, we believe that including design variables as fixed effects would confound the interpretation of (marginal) R² as a measure of the association between FOOOF parameters and baseline-corrected power. Our primary question in this analysis was about the fundamental relationship between these measures, not how experimental conditions moderate this relationship.

      To address the reviewer's concern regarding condition-specific effects, we conducted separate analyses for each condition using a simpler model:

      baseline_corrected_power ~ 1 + fooof_parameter + (1|subject)

      The results (now included in the Supplement, Fig. S4–S6) show generally smaller effect sizes compared to our original random-effects model, with notable differences between conditions. The 2-back conditions, particularly the non-target trials, exhibited the weakest associations. Despite these differences, the overall patterns remained consistent with our original findings: exponent and offset exhibited positive associations at low frequencies (delta, theta) and negative associations at higher frequencies (beta, low gamma), while periodic activity correlated substantially with baselinecorrected power in the alpha, beta, and gamma ranges.

      However, this condition-specific approach has important limitations. With only 47 subjects per condition, the statistical power is insufficient for stable correlation estimates (Schönbrodt & Perugini, 2013; https://doi.org/10.1016/j.jrp.2013.05.009). This likely explains why the effects are smaller and less stable effects than in our original model, which uses the full dataset's power while appropriately accounting for condition-related variance through random effects. Since these additional analyses do not alter our primary conclusions, we have included them in the Supplement for completeness and made a minor change in the Discussion section.

      Reviewer 1 asked what channels are lines on Figure 9 based on. As stated in the Methods section, “We fitted models in a mass univariate manner, that is for each channel, frequency (where applicable), and time point separately. /…/ For the purposes of visualisation, p-values were averaged across channels (for heatmaps or lines) or across time (for topographies).” Therefore, the lines and heatmaps apply to all channels.

      Reviewer 2:

      We would like to thank reviewer 2 for their detailed explanation of the expected behaviour of the specparam algorithm. We have added the following explanation to the Methods section:

      Importantly, as noted by the reviewer, this behaviour reflects an explicit design choice of the algorithm: to avoid overfitting ambiguous peaks at the edges of the spectrum, FOOOF excludes peaks that are too close to the boundaries. This exclusion is controlled by the _bw_std_edge parameter, which defines the distance that a peak must be from the edge in order to be retained (in units of standard deviation; set to 1.0 by default). Therefore, although the algorithm is functioning as intended, users should be careful when interpreting aperiodic parameters in datasets where lowfrequency oscillatory activity might be expected.

      In line with the reviewer’s suggestion we have added a version of specparam to the paper.

      We thank reviewer 2 for pointing out two studies that used a time-resolved approach to spectral parameterisation. We have updated the text accordingly:

      Although a similar approach has been used to track temporal dynamics in sleep and resting state (e.g., Wilson et al., 2022; Ameen et al., 2024), as well as in task-based contexts (e.g., Barrie et al., 1996; Preston et al., 2025), its specific application to working memory paradigms remains underexplored.

      Reviewer 3:

      Reviewer 3 notes that the revised manuscript feels less intriguing than the original version. While we understand this concern, we believe this difference arises from a misalignment in expectations regarding the scope and purpose of our study. We think the reviewer is interpreting our work as focusing on whether theta activity is elicited in a paradigm that reliably produces theta oscillations. In contrast, our study is framed around a working memory task in which, based on prior literature, we expected to observe theta activity but instead found an absence of theta spectral peaks in almost all participants. Note that the absence of theta is already noteworthy in itself, given that theta oscillations are believed to play a crucial role in working memory.

      Importantly, Van Engen et al. (2024) have recently reported similar findings:

      ”While we did not observe load-dependent aperiodic changes over the frontal midline, we did reveal the possibility that previous frontal midline theta results that do not correct for aperiodic activity likely do not reflect theta oscillations. /…/ While our results do not invalidate previous research into extracranial theta oscillations in relation to WM, they challenge popular and widely held beliefs regarding the mechanistic role for theta oscillations to group or segregate channels of information”.

      From this perspective, we maintain that the following statements are still justified:

      “substantial portion of the changes often attributed to theta oscillations in working memory tasks may be influenced by shifts in the spectral slope of aperiodic activity”

      "Note that although no prominent oscillatory peak in the theta range was observed at the group level, and some of this activity could potentially fall within the delta range, similar lowfrequency patterns have often been referred to as 'theta' in previous work, even in the absence of a clear spectral peak"

      These formulations are intended to emphasize existing interpretations of changes in low-frequency power as theta oscillations in related research.

      Next, Reviewer 3 pointed out that “spectral reflection (peak?) in spectral power plot does not imply that an event is repeating (i..e. oscillatory).” We agree with the reviewer that not every spectral peak implies a true oscillation. To address this, we complemented the power analyses with a measure of rhythmicity (phase autocorrelation function, pACF) after the first round of reviews, and the pACF results were largely similar to those for periodic activity. These results suggest that, in our case, periodic activity is indeed largely oscillatory.

      However, we do agree with the reviewer that the term “oscillatory” is not interchangeable with “periodic”. To address this, we reviewed the paper for all appearances of “oscillations”, “oscillatory” and related terms, and replaced them with “power”, “spectral” or “periodic activity” where appropriate (all changes are marked in red in the latest version of the manuscript).

      Examples of corrections:

      Changes in aperiodic activity appear as low-frequency oscillations in baseline-corrected time-frequency plots à low-frequency power

      “The periodic component includes only the parameterised oscillatory peak” à spectral peak

      “FOOOF decomposition may miss low-frequency oscillations near the edges of the spectrum” à low-frequency peaks

      We disagree with the reviewer’s assertion that the subtitle “Aperiodic parameters are largely independent of oscillatory activity” is misleading for a methods oriented paper. Namely, the full subtitle is “Rhythmicity analysis reveals aperiodic parameters are largely independent of oscillatory activity”. Since rhythmicity is a phase-based measure that requires repeating dynamics and is therefore indicative of oscillations, we believe this phrasing is technically accurate.

      Finally, we would like to emphasise our contribution once again. Our analyses of rhythmicity, spectrally parameterised power, and baseline-corrected power offer different perspectives on the data. Each of these analyses may lead to different interpretations, but performing all of them on the same data provides a more comprehensive insight into what is actually going on in the data.

      Our findings demonstrate that conclusions drawn from a single analytical approach may be incomplete or misleading. For example, as we discuss in the paper, many studies examine thetagamma coupling in scalp EEG during n-back tasks without first establishing whether theta activity genuinely oscillates (e.g. Rajji et al., 2016). The absence of true theta oscillations would undermine the validity of such analyses. Our multifaceted approach provides researchers with a systematic framework for validating oscillatory assumptions before proceeding with more complex analyses.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review)

      Summary:

      This manuscript addresses the question of whether spontaneous activity contributes to the clustering of retinogeniculate synapses before eye opening. The authors re-analyze a previously published dataset to answer the question. The authors conclude that synaptic clustering is eye-specific and activity dependent during the first postnatal week. While there is useful information in this manuscript, I don't see how the data meaningfully supports the claims made about clustering.

      In adult retinogeniculate connections, functionally specificity is supported by select pairings of retinal ganglion cells and thalamocortical cells forming dozens of synaptic connections in subcellular microcircuits called glomeruli. In this manuscript, the authors measure whether the frequency of nearby synapses is higher in the observed data than in a model where synapses are randomly distributed throughout the volume. Any real anatomical data will deviate from such a model. The interesting biological question is not whether a developmental state deviates from random. The interesting question is how much of the adult clustering occurs before eye opening. In trying to decode the analysis in this manuscript, I can't tell if the answer is 99% or 0.001%.

      We thank the reviewer for their helpful critique through both rounds of review. We have refocused the manuscript on paired eye-specific measurements of active zone addition and spatial relationships among active zones at each age. All effect sizes and power values for each comparison are now reported in Table S2. These measures allow readers to gauge biological significance more transparently.

      Strengths:

      The source dataset is high resolution data showing the colocalization of multiple synaptic proteins across development. Added to this data is labeling that distinguishes axons from the right eye from axons from the left eye. The first order analysis of this data showing changes in synapse density and in the occurrence of multi-active zone synapses is useful information about the development of an important model system.

      Weaknesses:

      I don't think the analysis of clustering within this dataset improves our understanding of how the system works. It is possible that the result is clear to the authors based on looking at the images. As a reader trying to interpret the analysis, I ran into the following problems:

      • It is not possible to estimate biologically meaningful effect sizes from the data provided. Spontaneous activity in the post natal week could be responsible for 99% or 0.001% of RGC synapse clustering.

      • The sample size is too small for the kinds of comparisons being made. The authors point out that many STORM studies use an n of 1 while the authors have n = 3 for each of their six experimental groups. However, the critical bit is what kinds of questions you are trying to answer with a given sample size. This study depends on determining whether the differences between groups are due to age, genotype, or individual variation. This study also makes multiple comparisons of many different noisy parameters that test the same or similar hypothesis. In this context, it is unlikely that n = 3 sufficiently controls for individual variation.

      We have revised the manuscript to focus on eye-specific differences, which are paired measurements collected at each age. We have measured effect sizes and performed power tests for all comparisons presented in the manuscript. These measurements are shown for every figure in a new supplemental table S2.

      • There is no clear biological interpretation of the core measure of the publication, the normalized clustering index. The normalized clustering index starts with counting the fraction of single active zone synapses within various distances to the edge of synapses. This frequency is compared to a randomization model in which the positions of synapses are randomized throughout a volume. The authors found that the biggest deviation between the observed and randomized proximity frequency using a distance threshold of 1.5 um. They consider the deviation from the random model to be a sign of clustering. However, two RGC synapses 1.5 um apart have a good chance of coming from the same RGC axon. At this scale, real observations will, therefore, always look more clustered than a model where synapses are randomly placed in a volume. If you randomly place synapses on an axon, they will be much closer together than if you randomly place synapses within a volume. The authors normalize their clustering measure by dividing by the frequency of clustering in the normalized model. That makes the measure of clustering an ambiguous mix of synapse clustering, axon morphology, and synaptic density.

      We have removed the “normalized clustering index”. “Clustered” inputs are now defined strictly as those that have a neighboring single active-zone (sAZ) synapse within 1.5 mm. For each type of input (sAZ and mAZ) we show 1) the ratio of clustered to isolated inputs for both eyes, and 2) the number of neighboring sAZs (Figure 4).

      We agree with the reviewer that many synapses are likely made nearby along the same axon from an individual RGC. In this scenario, sAZ synapses that are nearby a neighboring mAZ input may be part of the same nascent bouton. And, sAZ synapses nearby other sAZ neighbors may ultimately mature into a mAZ input. At the same time, inputs from one RGC may form nearby other inputs from neighboring RGCs. We discuss these motifs and potential mechanisms of cell-autonomous and non-autonomous development (Lines 300-308).

      • Other measures are also very derived. For instance, one argument is based on determining that the cumulative distribution of the distance of dominant-eye multi-active zone synapses with nearby single-active zone synapses from dominant-eye multi-active zone synapses is statistically different from the cumulative distribution of the distance of dominant-eye multi-active zones without nearby single-active zone synapses from dominant-eye multi-active zones. Multiple permutations of this measure are compared.

      We have simplified the presentation to show all measured path lengths for every input. This allows the reader to see each of the inputs and their relative distances. We present these data for like-eye type interactions at P4 and P8 (Figures 5 and S5).   

      • There are major biological differences between groups that are difficult to control for. Between P2, P4, and P8, there are changes in cell morphology and synaptic density. There are also large differences in synapse density between wild type and KO mice. It is difficult to be confident that these differences are not responsible for the relatively subtle changes in clustering indices.

      • Many claims are based on complicated comparisons between groups rather than the predominating effects within the data. It is noted that: "In KO mice, dominant eye projections showed increased clustering around mAZ synapses compared to sAC synapses suggesting partial maintenance of synaptic clustering despite retinal wave defects". In contrast, I did not notice any discussion of the fact that the most striking trend in those measures is that the clustering index decreases from P2 to P8.

      Related to the points above, we have revised the manuscript to focus on eye-specific release site addition and spatial relationships. For clarity, we have removed the clustering index and instead present ratios of clustered and isolated inputs, the number of sAZ synapses near each input type, and distance between like-eye mAZ inputs (Figure 4).      

      • Statistics are improperly applied. In my first review I tried to push the authors to calculate confidence intervals for two reasons. First, I believed the reader should be able to answer questions such as whether 99% or 0.01% of RGC synaptic clustering occurred in the first postnatal week. Second, I wanted the authors to deal with the fact that n=3 is underpowered for many of the questions they were asking. While many confidence intervals can now be found leading up to a claim, it is difficult to find claims that are directly supported by the correct confidence interval. Many claims are still incorrectly based on which combinations of comparisons produced statistically significant differences and which combinations did not.

      We have substantially revised the manuscript to focus on within-group paired effects between eye-of-origin. We performed power tests for all statistical presentations and effect sizes and powers are presented for every figure in a new supplemental table S2. To simplify the manuscript and make it easier to read, we report confidence interval measurements in a separate supplemental table S3.

      Reviewer #2 (Public review):

      Summary:

      This study provides a valuable data set showing changes in the spatial organization of synaptic proteins at the retinogeniculate connection during a developmental period of active axonal and synaptic remodeling. The data collected by STORM microscopy is state-of-the-art in terms of the high-resolution view of the presynaptic components of a plastic synapse. The revision has addressed many, but not all, of the initial concerns about the authors interpretation of their data. However, with the revisions, the manuscript has become very dense and difficult to follow.

      We greatly appreciate the reviewer’s thoughtful comments through two rounds of review. To improve the clarity of the manuscript, we have substantially revised the work to streamline the narrative, clearly define terminology, and simplify data presentations, allowing readers to more directly interpret results and their implications.

      Strengths:

      The data presented is of good quality and provides an unprecedented view at high resolution of the presynaptic components of the retinogeniculate synapse during active developmental remodeling. This approach offers an advance to the previous mouse EM studies of this synapse because the CTB label allows identification of the eye from which the presynaptic terminal arises.

      Weaknesses:

      From these data the authors conclude that eye-specific increase in mAZ synapse density occur over retinogeniculate refinement, that sAZ synapses cluster close to mAZ synapses over age, and that this process depends on spontaneous activity and proximity to eye-specific mAZ synapses. While the interpretation of this data set is much more grounded in this revised submission, some of the authors' conclusions/statements still lack convincing supporting evidence.

      This includes:

      (1) The conclusion that multi-active zone synapses are loci for synaptic clustering. This statement, or similar ones (e.g., line 407) suggest that mAZ synapses actively or through some indirect way influence the clustering of sAZ synapses. There is no evidence for this. Clustering of retinal synapses are in part due to the fact that retinal inputs synapse on the proximal dendrites. With increased synaptogenesis, there will be increased density of retinal terminals that are closely localized. And with development, perhaps sAZ synapses mature into mAZ synapses. This scenario could also explain a large part of this data set.

      We thank the reviewer for their comment. We have removed the ambiguous phrasing and clarified the manuscript to explicitly discuss alternative interpretations consistent with the results (Lines 300-308). This includes a discussion of sAZ synapse maturation into mAZ inputs (Lines 294-296).

      (2) The conclusion that, "clustering depends on spontaneous retinal activity" could be misleading to the reader given that the authors acknowledge that their data is most consistent with a failure of synaptogenesis in the mutant mice (in the rebuttal). Additionally clustering does occur in CTB+ projections around mAZ synapses.

      We have removed the highlighted phrase and revised the manuscript to focus on differences in release site addition between eye-of-origin. We clarified our discussion of activity-dependent changes to state that synapses fail to form in the mutant and synaptic clustering was reduced (Lines 324-330).

      (3) Line 403: "Since mAZ synapses are expected to have a higher release probability, they likely play an important role in driving plasticity mechanisms reliant on neurotransmission.":What evidence do the authors have that mAZ are expected to have higher release probability?

      We thank the reviewer for their careful reading. Because they have several active zones, mAZ synapses are expected to have a higher number of release sites (N), which could be independent of release probability at any individual active zone (Pr). We have removed the reference to release probability. Instead, we maintain focus on active zone number.

      Reviewer #3 (Public review):

      This study is a follow-up to a recent study of synaptic development based on a powerful data set that combines anterograde labeling, immunofluorescence labeling of synaptic proteins, and STORM imaging (Cell Reports, 2023). Specifically, they use anti-Vglut2 label to determine the size of the presynaptic structure (which they describe as the vesicle pool size), anti-Bassoon to label active zones with the resolution to count them, and anti-Homer to identify postsynaptic densities. Their previous study compared the detailed synaptic structure across the development of synapses made with contra-projecting vs. ipsi-projecting RGCs and compared this developmental profile with a mouse model with reduced retinal waves. In this study, they produce a new detailed analysis on the same data set in which they classify synapses into "multi-active zone" vs. "single-active zone" synapses and assess the number and spacing of these synapses. The authors use measurements to make conclusions about the role of retinal waves in the generation of same-eye synaptic clusters, providing key insight into how neural activity drives synapse maturation.

      Strengths:

      This is a fantastic data set for describing the structural details of synapse development in a part of the brain undergoing activity-dependent synaptic rearrangements. The fact that they can differentiate eye of origin is what makes this data set unique over previous structural work. The addition of example images from EM data set provides confidence in their categorization scheme.

      Weaknesses:

      Though the descriptions of synaptic clusters are important and represent a significant advance, the authors conclusions regarding the biological processes driving these clusters are not testable by such a small sample. This limitation is expected given the massive effort that goes into generating this data set. Of course the authors are free to speculate, but many of the conclusions of the paper are not statistically supported.

      We thank the reviewer for their helpful comments throughout the revision process. We have substantially modified the manuscript to reframe the work around release site addition during eye-specific competition. Power tests and effect size measurements are presented for every figure in a new supplemental table S2.

      Reviewer #2 (Recommendations for the authors):

      (1) Authors should discuss that it is not clear what the relationship is between sAZ and mAZ, and sAZ could turn into a mAZ. This is not unreasonable that the number of AZ/bouton increases with development given that in the adult rodent retinogeniculate bouton, there is an average of 27 active zones (Budisantoso et al, 2012).

      We thank the reviewer for their helpful suggestion. We have added a discussion of the relationship between sAZ and mAZ inputs and the point that sAZ synapses may mature into mAZ synapses (Lines 294-296). We now reference the work of Budisantoso et al., J. Neurosci. 2012.   

      (2) The authors should clarify how the statistics are calculated for the normalized clustering index (figure 3B, C). For ratios of values each with variance, the variance is summed when calculating SEM.

      For clarity, we have removed the normalized clustering index analysis. We have simplified the work to present a clear definition of clustered and unclustered inputs, where clustering is defined by the presence of a nearby neighboring synapse within 1.5mm. We present the ratio of clustered and unclustered inputs for each input type and eye-of-origin. We also show the number of sAZ synapses nearby each clustered input (Figure 4).

      (3) The authors have significantly clarified the terminology that they use in the text. This is much appreciated. However, it would be helpful to the naïve reader if they could define their use of the word "synapse" as referring to individual active zones/release sites or to terminals/boutons. For example:

      Line 378: "Prior electron microscopy studies in the mouse found limited evidence of convergent synaptic clustering from neighboring RGCs at postnatal day 8 (10, 13), suggesting that the mAZ synapses seen in STORM images are single retinogeniculate terminals. The lack of synaptic convergence in prior EM reconstructions at P8 implies that early clustering around mAZ synapses may result from local output clustering within individual RGC arbors.":

      What do the authors mean by "convergent synaptic clustering": do they mean clustering of release sites from different RGC inputs? And what does "local output clustering" mean?

      We thank the reviewer for their suggestion to use clear terminology. We have revised the manuscript to define our use of the term “synapse” as a single active zone/release site (Lines 134-136). We refer to mAZ boutons in STORM data as “inputs”. We have revised the discussion of prior EM studies (Lines 130-132) and clarified all discussions of synaptic clustering throughout the work.

      (4) While the authors argue that the retina-specific β2-nAChR mice exhibit disrupted retinal waves and defects in eye specific segregation, the authors are studying issues of active zone density which may depend on mechanisms depending on the postsynaptic neuron. This should be acknowledged.

      We have updated the text to discuss the fact that postsynaptic mechanisms are also critical for the refinement of eye-specific synapses (Lines 332-340). We have added several additional references to the manuscript accordingly.

      Reviewer #3 (Recommendations for the authors):

      The authors have addressed many of my original concerns. The additional description of criteria for categorizing synapses, showing all the data points, gives the reader a stronger sense of where the numbers in the quantification come from. Replacing the "complex/simple" distinction with the "multi/single active zone" and the other clarifying text was effective. The addition of the EM data was also a very nice example to help interpret STORM images. It does appear there was no quantification on this EM data set and perhaps just a few example images were taken as "proof of principle". If, by chance, the authors have more EM images to make a data set of them that allows for some quantification, that would be great to add.

      We thank the reviewer for their helpful comments on the manuscript through both rounds of review. The EM data we collected were 2D images of a subset of physical sections at postnatal day 8. Most dAPEX2(+) profiles had a single active zone, but a definitive identification would require 3D imaging so that each terminal can be assessed in its entirety for release sites that might be missed in a single cross section. Similarly, multi-active zone boutons are positively identified in 2D images, but definitive measurements of AZ number would require 3D information. We analyzed our 2D EM images and present a plot of dAPEX2(+) profile size versus active zone number below. These measures are positively correlated (r = 0.74), with larger profiles containing more active zones.

      Author response image 1.<br />

      Unfortunately, we are not currently equipped to perform volumetric EM imaging at our home institution and are concerned that analysis of 2D data may be inconclusive. For these reasons, we are opting to maintain a qualitative presentation of our current EM results and we look forward to collaborating with other experts to achieve volumetric EM reconstructions in the future

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      (1) Summary:

      The authors note that it is challenging to perform diffusion MRI tractography consistently in both humans and macaques, particularly when deep subcortical structures are involved. The scientific advance described in this paper is effectively an update to the tracts that the XTRACT software supports. The claims of robustness are based on a very small selection of subjects from a very atypical dMRI acquisition (n=50 from HCP-Adult) and an even smaller selection of subjects from a more typical study (n=10 from ON-Harmony).

      Strengths:

      The changes to XTRACT are soundly motivated in theory (based on anatomical tracer studies) and practice (changes in seeding/masking for tractography), and I think the value added by these changes to XTRACT should be shared with the field. While other bundle segmentation software typically includes these types of changes in release notes, I think papers are more appropriate.

      We would like to thank the reviewer for their assessment and we appreciate the comments for improving our manuscript. We have added new results, sampling from a larger cohort with a typical dMRI protocol (N=50 from UK Biobank), as well as showcasing examples from individual subject reconstructions (Supplementary figures S6, S7). We also demonstrate comparisons against another approach that has been proposed for extracting parts of the cortico-striatal bundle in a bundle segmentation fashion, as the reviewer suggests (see comment and Author response image 1 below). 

      We would also like to take the opportunity to summarise the novelty of our contribuIons, as detailed in the Introduction, which we believe extend beyond a mere software update; this is a byproduct of this work rather than the aim. 

      i) We devise for the first Ime standard-space protocols for 21 challenging cortico-subcortical bundles for both human and macaque and we interrogate them in a comprehensive manner.

      ii) We demonstrate robustness of these protocols using criteria grounded on neuroanatomy, showing that tractography reconstructions follow topographical principles known from tracers both in WM and GM and for both species. We also show that these protocols capture individual variability as assessed by respecting family structure in data from the HCP twins.

      iii) We use high-resolution dMRI data (HCP and post-mortem macaque) to showcase feasibility of these reconstructions, and we show that reconstructions are also plausible with more conventional data, such as the ones from the UK Biobank.

      iv) We further showcase robustness and the value of cross-species mapping by using these tractography reconstructions to predict known homologous grey matter (GM) regions across the two species, both in cortex and subcortex, on the basis of similarity of grey matter areal connection patterns to the set of proposed white matter bundles.

      Weaknesses

      (2) The demonstration of the new tracts does not include a large number of carefully selected scans and is only compared to the prior methods in XTRACT. The small n and limited statistical comparisons are insufficient to claim that they are better than an alternative. Qualitatively, this method looks sound.

      We appreciate the suggestion for larger sample size, so we performed the same analysis using 50 randomly drawn UK Biobank subjects, instead of ON-Harmony, matching the N=50 randomly drawn HCP subjects (detailed explanation in the comment below, Main text Figure 4A; Supplementary Figures S4). We also generated results using the full set of N=339 HCP unrelated subjects (Supplementary Figure S5 compares 10, 50 and 339 unrelated HCP subjects). We provide further details in the relevant point (3) below. 

      With regards to comparisons to other methods, there are not really many analogous approaches that we can compare against. In our knowledge there are no previous cross-species, standard space tractography protocols for the tracts we considered in this study (including Muratoff, amygdalofugal, different parts of extreme an external capsules, along with their neighbouring tracts). We therefore i) directly compared against independent neuroanatomical knowledge and patterns (Figures 2, 3, 5), ii) confirmed that patterns against data quality and individual variability that the new tracts demonstrate are similar to patterns observed for the more established cortical tracts (Figure 4), iii) indirectly assessed efficacy by performing a demanding task, such as homologue identification on the basis of the tracts we reconstruct (Figures 6, 7). 

      We need to point out that our approach is not “bundle segmentation”, in the sense of “datadriven” approaches that cluster streamlines into bundles following full-brain tractography. The latter is different in spirit and assigns a label to each generated streamline; as full-brain tractography is challenging (Maier-Hein, Nature Comms 2017), we follow instead the approach of imposing anatomical constraints to miIgate for some of these challenges as suggested in (MaierHein, 2017).

      Nevertheless, we used TractSeg (one of the few alternatives that considers corticostriatal bundles) to perform some comparisons. The Author response image below shows average path distributions across 10 HCP subjects for a few bundles that we also reconstruct in our paper (no temporal part of striatal bundle is generated by Tractseg). We can observe that the output for each tract is highly overlapping across subjects, indicating that there is not much individual variability captured. We also see the reduced specificity in the connectivity end-points of the bundles. 

      Author response image 1.

      Comparison between 10-subject average for example subcortical tracts using TractSeg and XTRACT. We chose example bundles shared between our set and TractSeg. Per subject TractSeg produces a binary mask rather than a path distribution per tract. Furthermore, the mask is highly overlapping across subjects. Where direct correspondence was not possible, we found the closest matching tract. Specifically, we used ST_PREF for STBf, and merged ST_PREC with ST_POSTC to match StBm. There was no correspondence for the temporal part of StB.

      We subsequently performed the twinness test using both TractSeg and XTRACT (Author response image 2), as a way to assess whether aspects of individual variability can be captured. Due to heritability of brain organisation features, we anticipate that monozygotic twins have more similar tract reconstructions compared to dizygoIc twins and subsequently non-twin siblings. This pattern is reproduced using our proposed approach, but not using TractSeg that provides a rather flat pattern.  

      Author response image 2.

      Violin plots of the mean pairwise Pearson’s correlations across tracts between 72 monozygotic (MZ) twin pairs, 72 dizygotic (DZ) twin pairs, 72 non-twin sibling pairs, and 72 unrelated subject pairs from the Human Connectome Project, using Tractseg (left) and XTRACT (right). About 12 cortico-subcortical tracts were considered, as closely matched as possible between the two approaches. For Tractseg we considered: 'CA', 'FX', 'ST_FO', 'ST_M1S1' (merged ‘ST_PREC’ and ‘ST_POSTC’ to approximate the sensorimotor part of our striatal bundle), 'ST_OCC', 'ST_PAR', 'ST_PREF',  'ST_PREM', 'T_M1S1' (merged ‘T_PREC’ and ‘T_POSTC’ to approximate the sensorimotor part of our striatal bundle), 'T_PREF', 'T_PREM', 'UF'. For XTRACT we considered: 'ac', 'fx', 'StB<sub>f</sub>', 'StB<sub>m</sub>', 'StB<sub>p</sub>', 'StB<sub>t</sub>, 'EmC<sub>f</sub>', 'EmC<sub>p</sub>', 'EmC<sub>t</sub>', 'MB', 'amf', 'uf'. Showing the mean (μ) and standard deviation (σ) for each group. There were no significant di^erences between groups using TractSeg.

      Taken together, these results indicate as a minimum that the different approaches have potentially different aims. Their different behaviour across the two approaches can be desirable and beneficial for different applications (for instance WM ROI segmentation vs connectivity analysis) but makes it challenging to perform like-to-like comparisons.

      (3) “Subject selection at each stage is unclear in this manuscript. On page 5 the data are described as "Using dMRI data from the macaque (𝑁 = 6) and human brain (𝑁 = 50)". Were the 50 HCP subjects selected to cover a range of noise levels or subject head motion? Figure 4 describes 72 pairs for each of monozygotic, dizygotic, non-twin siblings, and unrelated pairs - are these treated separately? Similarly, NH had 10 subjects, but each was scanned 5 times. How was this represented in the sample construction?”

      We appreciate the suggestions and we agree that some of the choices in terms of group sizes may have been confusing. Short answer is we did not perform any subject selection, subjects were randomly drawn from what we had available. The 72 twin pairs are simply the maximum number of monozygotic twin pairs available in the HCP cohort, so we used 72 pairs in all categories to match this number in these specific tests. The N=6 animals are good quality post-mortem dMRI data that have been acquired in the past and we cannot easily expand. For the rest of the points, we have now made the following changes:

      We have replaced our comparison to the ON-Harmony dataset (10 subjects) with a comparison to 50 unrelated UK Biobank subjects (to match the 50 unrelated HCP subject cohort used throughout). Updated results can be seen in Figure 4A and Supplementary Figure S4. This allows a comparison of tractography reconstruction between high quality and more conventional quality data for the same N.

      We looked at QC metrics to ensure our chosen cohorts were representaIve of the full cohorts we had available. The N=50 unrelated HCP cohort and N=50 unrelated UKBiobank cohorts we used in the study captured well the range of the full 339 unrelated HCP cohort and N=7192 UKBiobank cohort in terms of absolute/relative moion (Author response image 3A and 3B respectively). A similar pattern was observed in terms of SNR and CNR ranges Author response image 4).

      We generated tractography reconstructions for single subjects, corresponding to the 10th percentile (P<sub>10</sub>), median and 90th percentile (P90) of the distributions with respect to similarity to the cohort average maps. These are now shown in Supplementary Figures S6, S7. We also checked the QC metrics for these single subjects and confirmed that average absolute subject moIon was highest for the P<sub>10</sub>, followed by the P<sub>50</sub> and lowest for the P<sub>90</sub> subject, capturing a range of within cohort data quality.

      We generated reconstructions for an even larger HCP cohort (all 339 unrelated HCP subjects) and these look very similar to the N=50 reconstructions (Supplementary Figure S5).

      Author response image 3.

      Subsets chosen from the HCP and UKB reflect similar range of average motion (relative and absolute) to the corresponding full cohorts. (A) Absolute and relative motion comparison between N=50 and N=339 unrelated HCP subjects. (B) Absolute and relative motion comparison between N=50 and N=7192 super-healthy UKB subjects.  

      Author response image 4.

      Average SNR and CNR values show similar range between the N=50 UKB subset and the full UK Biobank cohort of N=7192.

      (4) In the paper, the authors state "the mean agreement between HCP and NH reconstructions was lower for the new tracts, compared to the original protocols (𝑝 < 10^−10). This was due to occasionally reconstructing a sparser path distribution, i.e., slightly higher false negative rate," - how can we know this is a false negative rate without knowing the ground truth?

      We are sorry for the terminology, we have corrected this, as it was confusing. Indeed, we cannot call it false negaIve, what we meant is that reconstructions from lower resolution data for these bundles ended up being in general sparser than the ones from the high-resolution data, potentially missing parts of the tract. We have now revised the text accordingly.

      Reviewer #2 Public Review:

      (5) Summary:

      In this article, Assimopoulos et al. expand the FSL-XTRACT software to include new protocols for identifying cortical-subcortical tracts with diffusion MRI, with a focus on tracts connecting to the amygdala and striatum. They show that the amygdalofugal pathway and divisions of the striatal bundle/external capsule can be successfully reconstructed in both macaques and humans while preserving large-scale topographic features previously defined in tract tracing studies. The authors set out to create an automated subcortical tractography protocol, and they accomplished this for a subset of specific subcortical connections for users of the FSL ecosystem.

      Strengths:

      A main strength of the current study is the translation of established anatomical knowledge to a tractography protocol for delineating cortical-subcortical tracts that are difficult to reconstruct. Diffusion MRI-based tractography is highly prone to false positives; thus, constraining tractography outputs by known anatomical priors is important. Key additional strengths include 1) the creation of a protocol that can be applied to both macaque and human data; 2) demonstration that the protocol can be applied to be high quality data (3 shells, > 250 directions, 1.25 mm isotropic, 55 minutes) and lower quality data (2 shells, 100 directions, 2 mm isotropic, 6.5 minutes); and 3) validation that the anatomy of cortical-subcortical tracts derived from the new method are more similar in monozygotic twins than in siblings and unrelated individuals.

      We thank the Reviewer for the globally posiIve evaluaIon of this work and the perInent comments that have helped us to improve the paper.

      Weaknesses

      (6) Although this work validates the general organizational location and topographic organization of tractography-derived cortical-subcortical tracts against prior tract tracing studies (a clear strength), the validation is purely visual and thus only qualitative. Furthermore, it is difficult to assess how the current XTRACT method may compare to currently available tractography approaches to delineating similar cortical-subcortical connections. Finally, it appears that the cortical-subcortical tractography protocols developed here can only be used via FSL-XTRACT (yet not with other dMRI software), somewhat limiting the overall accessibility of the method.

      We agree that a more quanItative comparison against gold standard tracing data would be ideal. However, there are practical challenges that prohibit such a comparison at this stage: i) Access to data. There are no quantifiable, openly shared, large scale/whole brain tracing data available. The Markov study provided the only openly available weighted connectivity matrices measured by tracers in macaques (Markov, Cereb Cortex 2014), which are only cortico-cortical and do not provide the white matter routes, they only quantify the relative contrast in connection terminals. ii) 2D microscopy vs 3D tractography. The vast majority of tracing data one can find in neuroanatomy labs is on 2D microscopy slices with restricted field of view, which is also the case for the data we had access to for this study. This complicates significantly like-to-like comparisons against 3D whole-brain tractography reconstructions. iii) Quantifiability is even tricky in the case of gold standard axonal tracing, as it depends on nuisance factors, e.g. injection site, injection size, injection uniformity and coverage, which confound the gold-standard measurements, but are not relevant for tractography. For these reasons, a number of high-profile NIH BRAIN CONNECTS Centres (for instance hXps://connects.mgh.harvard.edu/, hXps://mesoscaleconnecIvity.org/) are resourced to address these challenges at scale in the coming years and provide the tools to the community to perform such quantitative comparisons in the future.  

      In terms of comparison with other approaches, we have performed new tests and detail a response to a similar comment (2) from Reviewer 1.

      Finally, our protocols have been FSL-tested, but have nothing that is FSL specific. We cannot speak of performance when used with other tools, but there is nothing that prohibits translation of these standard space protocols to other tools. In fact, the whole idea behind XTRACT was to generate an approach open to external contributions for bundle-specific delineation protocols, both for humans and for non-human species. A number of XTRACT extensions that have been published over the last 5 years for other NHP species (Roumazeilles et al. (2020); Bryant et al. (2020); Wang et al. (2025)) and similar approaches have been used in commercial packages (Boshkovski et al, 2106, ISMRM 2022).

      Recommendations To the Authors:

      (7) Superiority of the FSL-XTRACT approach to delineating cortical-subcortical tracts. The Introduction of the article describes how "Tractography protocols for white matter bundles that reach deeper subcortical regions, for instance the striatum or the amygdala, are more difficult to standardize" due to the size, proximity, complexity, and bottlenecks associated with corticalsubcortical tracts. It would be helpful for the authors to better describe how the analytic approach adopted here overcomes these various challenges. What does the present approach do differently than prior efforts to examine cortical-subcortical connectivity? 

      There have not been many prior efforts to standardise cortico-subcortical connecIvity reconstructions, as we overview in the Introduction. As outlined in (Schilling et al. (2020),  hXps://doi.org/10.1007/s00429-020-02129-z), tractography reconstructions can be highly accurate if we guide them using constraints that dictate where pathways are supposed to go and where they should not go. This is the philosophy behind XTRACT and all the proposed protocols, which provide neuroanatomical constraints across different bundles. At the same time these constraints are relatively coarse so that they are species-generalisable. We have clarified that in Discussion. The approach we took was to first identify anatomical constraints from neuroanatomy literature for each tract of interest independently, derive and test these protocols in the macaque, and then optimise in an iterative fashion until the protocols generalise well to humans and until, when considering groups of bundles, the generated reconstructions can follow topographical principles known from tract tracing literature. This process took years in order to perform these iterations as meticulously as we could. We have modified the first sections in Methods to reflect this better (3rd paragraph of 1st Methods section), as well as modified the third and second to last paragraphs of the Introduction (“We propose an approach that addresses these challenges…”).

      (8) Relatedly, it is difficult to fully evaluate the utility of the current approach to dissecting cortical-subcortical tracts without a qualitative or quantitative comparison to approaches that already exist in the field. Can the authors show that (or clarify how) the FSL-XTRACT approach is similar to - or superior to - currently available methods for defining cortical-striatal and amygdalofugal tracts (e.g., methods they cite in the Introduction)?”

      From the limited similar approaches that exist, we did perform some comparisons against TractSeg, please see Reply to Comment 2 from Reviewer 1. We have also expanded the relevant text in the introduction to clarify the differences:

      “…However, these either uIlise labour-intensive single-subject protocols (22,26), are not designed to be generalisable across species (42, 43), or are based mostly on geometrically-driven parcellaIons that do not necessarily preserve topographical principles of connecIons (40). We propose an approach that addresses these challenges and is automated, standardised, generalisable across two species and includes a larger set of cortico-subcortical bundles than considered before, yielding tractography reconstructions that are driven by neuroanatomical constraints.”

      (9) Future applications of the tractography protocol:

      It would be helpful for the authors to describe the contexts in which the automated tractography approach developed here can (and cannot) be applied in future studies. Are future applications limited to diffusion data that has been processed with FSL's BEDPOSTX and PROBTRACKX? Can FSL-XTRACT take in diffusion data modelled in other software (e.g., with CSD in mrtrix or with GQI in DSI Studio)? Can the seed/stop/target/exclusion ROIs be applied to whole-brain tractography generated in other software? Integration with other software suites would increase the accessibility of the new tract dissection protocols.

      We have added some text in the Discussion to clarify this point. Our protocols have been FSLtested, but have nothing that is FSL specific. We cannot speak of performance of other tools, but there is nothing that prohibits translaIon of these standard space protocols to other tools. As described before, the protocols are recipes with anatomical constraints including regions the corresponding white matter pathways connect to and regions they do not, constructed with cross-species generalisability in mind. In fact a number of other packages (even commercial) have adopted the XTRACT protocols with success in the past, so we do not see anything in principle that prohibits these new protocols to be similarly adopted. 

      We cannot comment on the protocols’ relevance for segmenIng whole-brain tractograms, as these can induce more false posiIves than tractography reconstructions from smaller seed regions and may require stricter exclusions.    

      (10) It was great to see confirmation that the XTRACT approach can be successfully applied in both high-quality diffusion data from the HCP and in the ON-Harmony data. Given the somewhat degraded performance in the lower quality dataset (e.g., Figure 4A), can the authors speak to the minimum data requirements needed to dissect these new cortical-subcortical tracts? Will the approach work on single-shell, low b data? Is there a minimum voxel resolution needed? Which tracts are expected to perform best and worst in lower-quality data?

      Thank you for these comments, even if we have not really tried in lower (spaIal and angular) resolution data, given the proximity of the tracts considered, as well as the small size of some bundles, we would not recommend lower resolution than those of the UK Biobank protocol. In general, we would consider the UK Biobank protocol (2mm, 2 shells) as the minimum and any modern clinical scanner can achieve this in 6-8 minutes. We hence evaluated performance from high quality HCP to lower quality UK Biobank data, covering a considerable range (scan Ime from 55 minutes down to 6 minutes). 

      In terms of which tract reconstructions were more reproducible for UKBiobank data, the tracts with lowest correlations across subjects (Figure 4) were the anterior commissure (AC) and the temporal part of the Extreme Capsule (EmC<sub>t</sub>), while the highest correlations were for the Muratoff Bundle (MB) and the temporal part of the Striatal Bundle (StB<sub>t</sub>). Interestingly, for the HCP data, the temporal part of the Extreme Capsule (EmC<sub>t</sub>) and the Muratoff Bundle were also the tracts with the lowest/highest correlations, respectively. Hence, certain tract reconstructions were consistently more variable than others across subjects, which may hint to also being more challenging to reconstruct. We have now clarified these aspects in the corresponding Results section. 

      (11) Anatomical validation of the new cortical-subcortical tracts

      I really appreciated the use of prior tract tracing findings to anatomically validate the corticalsubcortical tractography outputs for both the cortical-striatal and amygdalofugal tracts. It struck me, however, that the anatomical validation was purely qualitative, focused on the relative positioning or the topographical organization of major connections. The anatomical validation would be strengthened if profiles of connectivity between cortical regions and specific subcortical nuclei or subcortical subdivisions could be quantitatively compared, if at all possible. Can the differential connectivity shown visually for the putamen in Figure 3 be quantified for the tract tracing data and the tractography outputs? Does the amygdalofugal bundle show differential/preferential connectivity across amygdala nuclei in tract tracing data, and is this seen in tractography?

      We appreciate the comment, please see Reply to your comment 6 above. In addiIon to the challenges described there, we do not have access to terminal fields other than in the striatum and these ones are 2D, so we make a qualitaIve comparison of the relevant connecIvity contrasts. We expect that a number of currently ongoing high-profile BRAIN CONNECTS Centres (such as the LINC and the CMC) will be addressing such challenges in the coming years and will provide the tools and data to the community to perform such quanItaIve comparisons at scale.  

      (12) I believe that all visualizations of the macaque and human tractography showed groupaveraged maps. What do these tracts look like at the individual level? Understanding individual-level performance and anatomical variation is important, given the Discussion paragraph on using this method to guide neuromodulation.

      We now demonstrate some representative examples of individual subject reconstructions in Supplementary Figures S6, S7, ranking subjects by the average agreement of individual tract reconstructions to the mean and depicting the 10th percentile, median and 90th percentile of these subjects. We have also shown more results in Author response images 1-2, generated by TractSeg, to indicate how a different bundle segmentation approach would handle individual variability compared to our approach.

      (13) Connectivity-based comparisons across species:

      Figures 5 and 6 of the manuscript show that, as compared to using only cortico-cortical XTRACT tracts, using the full set of XTRACT tracts (with new cortical-subcortical tracts) allows for more specific mapping of homologous subcortical and cortical regions across humans and macaques. Is it possible that this result is driven by the fact that the "connectivity blueprints" for the subcortex did not use an intermediary GM x WM matrix to identify connection patterns, whereas the connectivity blueprints for the cortex did? I was surprised that a whole brain GM x WM connectivity matrix was used in the cortical connectivity mapping procedure, given known problems with false positives etc., when doing whole brain tractography - especially aHer such anatomical detail was considered when deriving the original tracts. Perhaps the intermediary step lowers connectivity specificity and accuracy overall (as per Figure 9), accounting for the poorer performance for cortico-cortical tracts?

      The point is well-taken, however it cannot drive the results in Figures 5 and 6. Before explaining this further, let us clarify the raIonale of using the GMxWM connecIvity matrix, which we have published quite extensively in the past for cortico-cortical connecIons (Mars, eLife 2018 - Warrington, Neuroimage 2020 - Roumazeilles, PLoS Biology 2020 - Warrington, Science Advances 2022 – Bryant, J Neuroscience 2025). 

      Having established the bodies of the tract using the XTRACT protocols, we use this intermediate step of multiplying with a GM x WM connectivity matrix to estimate the grey matter projections of the tracts. The most obvious approach of tracking towards the grey matter (i.e. simply find where tracts intersect GM) has the problem that one moves through bottlenecks in the cortical gyrus and after which fibres fan out. Most tractography algorithms have problems resolving this fanning. However, we take the opposite approach of tracking from the grey matter surface towards the white matter (GMxWM connectivity matrix), thus following the direction in which the fibres are expected to merge, rather than to fan out. We then multiply the GMxWM tractrogram with that of the body of the tract to identify the grey matter endpoints of the tract. This avoids some of the major problems associated with tracking towards the surface. In fact, using this approach improves connectivity specificity towards the cortex, rather than the opposite. We provide some indicative results here for a few tracts:

      Author response image 5.

      Connectivity profiles for example cortico-cortical tracts with and without using the intermediary GMxWM matrix. Tracts considered are the Superior Longitudinal Fasciculus 1 (SLF<sub>1</sub>), Superior Longitudinal Fasciculus 2 (SLF<sub>2</sub>), the Frontal Aslant (FA) and the Inferior Fronto-Occipital Fasciculus (IFO). We see that the surface connectivity patterns without using the GMxWM intermediary matrix are more diffuse (effect of “fanning out” gyral bias), with reduced specificity, compared to whenusing the GMxWM matrix

      Tracking to/from subcortical nuclei does not have the same tractography challenges as tracking towards the cortex and in fact we found that using the intermediary GMxWM matrix is less favourable for subcortex (Figure 9), which is why we opted for not using it. 

      Regardless of how cortical and subcortical connectivity patterns are obtained, the results in Figures 5 and 6 utilise only cortical connectivity patterns. Hence, no matter what tracts are considered (cortico-cortical or cortico-subcortical) to build the connectivity patterns, these results have been obtained by always using the intermediate step of multiplying with the GMxWM connectivity matrix (i.e. it is not the case that cortical features are obtained with the intermediate step and subcortical features without, all of them have the intermediate step applied, as the connectivity patterns comprise of cortical endpoints). Figure 9 is only applicable for subcortical endpoints that play no role in the comparisons shown in Figures 5 and 6. We hope this clarifies this point.

      (14) Methodological clarifications:

      The Methods describe how anatomical masks used in tractography were delineated in standard macaque space and then translated to humans using "correspondingly defined landmarks". Can the authors elaborate as to how this translation from macaques to humans was accomplished?

      For a given tract, our process for building a protocol involved looking into the wider anatomical literature, including the standard white matter atlas of Schmahmann and Pandya (2006) and numerous anatomy papers that are referenced in the protocol description, to determine the expected path the tract was meant to take in white matter and which cortical and subcortical regions are connected. This helped us define constraints and subsequently the corresponding masks. The masks were created through the combination of hand-drawn ROIs and standard space atlases. We firstly started with the macaque where tracer literature is more abundant, but, importantly, our protocol definitions have been designed such that the same protocol can be applied to the human and macaque brain. All choices were made with this aspect in mind, hence corresponding landmarks between the two brains were considered in the mask definition (for instance “the putamen”, “a sub-commissural white matter mask”, the “whole frontal pole” etc, as described in the protocol descriptions).

      The protocols have not been created by a single expert but have been collated from multiple experts (co-authors SA, SW, DF, KB, SH, SS drove this aspect) and the final definitions have been agreed upon by the authors. 

      (15) The article heavily utilizes spatial path distribution maps/normalized path distributions, yet does not describe precisely what these are and how they were generated. Can the authors provide more detail, along with the rationale for using these with Pearson's correlations to compare tracts across subjects (as opposed to, e.g., overlap sensitivity/specificity or the Jaccard coefficient)?

      We have now clarified in text how these plots are generated, particularly when compared using correlation values. We tried Jaccard indices on binarized masks of the tracts and these gave similar trends to the correlations reported in Figure 4 (i.e. higher similarities within that across cohorts). We however feel that correlations are better than Jaccard indices, as the latter assume binary masks, so they focus on spatial overlap ignoring the actual values of the path distributions, we hence kept correlations in the paper.

      Reviewing Editor Comments

      “The reviewers had broadly convergent comments and were enthusiastic about the work. As further detailed by Reviewer 3 (see below), if the authors choose to pursue revisions, there are several elements that have the potential to enhance impact.”

      Thank you, we have replied accordingly and aimed to address most of the comments of the Reviewers.   

      “Comparison to existing methods. How does this approach compare to other approaches cited by the authors?”

      Please see replies to Comment 2 of Reviewer 1 and Comment 7 of Reviewer 2. Briefly, we have now generated new results and clarified aspects in the text. 

      “Minimum data requirements. How broadly can this approach be used across scan variation? How does this impact data from individual participants? Displaying individual participants may help, in addition to group maps.”

      Please see replies to Comment 10 of Reviewer2 on minimum data requirements and individual parIcipants, as well as to Comment 3 of Reviewer 1 on the actual groups considered. Briefly, we have generated new figures and regenerated results using UKBiobank data. 

      Softare. What are the sofware requirements? Is the approach interoperable with other methods?”

      Please see Reply to Comment 9 of Reviewer 2. Our protocols can be used to guide tractography using other types of data as they comprise of guiding ROIs for a given tract. So, although we have not tested them beyond FSL-XTRACT, we believe they can be useful with other tractography packages as well, as there is nothing FSL-specific in these anatomically-informed recipes. 

      “Comparisons with tract tracing. To the degree possible, quantitative comparisons with tract tracing data would bolster confidence in the method.”

      Please see Replies to Comments 6 and 11 of Reviewer 2. Briefly, we appreciate the comment and it is something we would love to do, but there are no data readily available that would allow such quanItaIve comparison in a meaningful way. This is a known challenge in the tractography field, which is why NIH has invested in two 5 year Centres to address it. Our approach will provide a solid starIng point for opImising and comparing further cortico-subcortical tractography reconstructions against microscopy and tracers in the same animal and at scale.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this study, Gu et al. employed novel viral strategies, combined with in vivo two-photon imaging, to map the tone response properties of two groups of cortical neurons in A1. The thalamocortical recipient (TR neurons) and the corticothalamic (CT neurons). They observed a clear tonotopic gradient among TR neurons but not in CT neurons. Moreover, CT neurons exhibited high heterogeneity of their frequency tuning and broader bandwidth, suggesting increased synaptic integration in these neurons. By parsing out different projecting-specific neurons within A1, this study provides insight into how neurons with different connectivity can exhibit different frequency response-related topographic organization.

      Strengths:

      This study reveals the importance of studying neurons with projection specificity rather than layer specificity since neurons within the same layer have very diverse molecular, morphological, physiological, and connectional features. By utilizing a newly developed rabies virus CSN-N2c GCaMP-expressing vector, the authors can label and image specifically the neurons (CT neurons) in A1 that project to the MGB. To compare, they used an anterograde trans-synaptic tracing strategy to label and image neurons in A1 that receive input from MGB (TR neurons).

      Weaknesses:

      Perhaps as cited in the introduction, it is well known that tonotopic gradient is well preserved across all layers within A1, but I feel if the authors want to highlight the specificity of their virus tracing strategy and the populations that they imaged in L2/3 (TR neurons) and L6 (CT neurons), they should perform control groups where they image general excitatory neurons in the two depths and compare to TR and CT neurons, respectively. This will show that it's not their imaging/analysis or behavioral paradigms that are different from other labs. 

      We thank the reviewer for these constructive suggestions. As recommended, we have performed control experiments that imaged the general excitatory neurons in superficial layers (shown below), and the results showed a clear tonotopic gradient, which was consistent with previous findings (Bandyopadhyay et al., 2010; Romero et al., 2020; Rothschild et al., 2010; Tischbirek et al., 2019), thereby validating the reliability of our imaging/analysis approach. The results are presented in a new supplemental figure (Figure 2- figure supplementary 3).

      Related publications:

      (1) Gu M, Li X, Liang S, Zhu J, Sun P, He Y, Yu H, Li R, Zhou Z, Lyu J, Li SC, Budinger E, Zhou Y, Jia H, Zhang J, Chen X. 2023. Rabies virus-based labeling of layer 6 corticothalamic neurons for two-photon imaging in vivo. iScience 26: 106625. DIO: https://doi.org/10.1016/j.isci.2023.106625, PMID: 37250327

      (2) Bandyopadhyay S, Shamma SA, Kanold PO. 2010. Dichotomy of functional organization in the mouse auditory cortex. Nat Neurosci 13: 361-8. DIO: https://doi.org/10.1038/nn.2490, PMID: 20118924

      (3) Romero S, Hight AE, Clayton KK, Resnik J, Williamson RS, Hancock KE, Polley DB. 2020. Cellular and Widefield Imaging of Sound Frequency Organization in Primary and Higher Order Fields of the Mouse Auditory Cortex. Cerebral Cortex 30: 1603-1622. DIO: https://doi.org/10.1093/cercor/bhz190, PMID: 31667491

      (4) Rothschild G, Nelken I, Mizrahi A. 2010. Functional organization and population dynamics in the mouse primary auditory cortex. Nat Neurosci 13: 353-60. DIO: https://doi.org/10.1038/nn.2484, PMID: 20118927

      (5) Tischbirek CH, Noda T, Tohmi M, Birkner A, Nelken I, Konnerth A. 2019. In Vivo Functional Mapping of a Cortical Column at Single-Neuron Resolution. Cell Rep 27: 1319-1326 e5. DIO: https://doi.org/10.1016/j.celrep.2019.04.007, PMID: 31042460

      Figures 1D and G, the y-axis is Distance from pia (%). I'm not exactly sure what this means. How does % translate to real cortical thickness?

      We thank the reviewer for this question. The distance of labeled cells from pia was normalized to the entire distance from pia to L6/WM border for each mouse, according to the previous study (Chang and Kawai, 2018). For all mice tested, the entire distance from pia to L6/WM border was 826.5 ± 23.4 mm (in the range of 752.9 to 886.1).

      Related publications:

      Chang M, Kawai HD. 2018. A characterization of laminar architecture in mouse primary auditory cortex. Brain Structure and Function 223: 4187-4209. DIO: https://doi.org/10.1007/s00429-018-1744-8, PMID: 30187193

      For Figure 2G and H, is each circle a neuron or an animal? Why are they staggered on top of each other on the x-axis? If the x-axis is the distance from caudal to rostral, each neuron should have a different distance? Also, it seems like it's because Figure 2H has more circles, which is why it has more variation, thus not significant (for example, at 600 or 900um, 2G seems to have fewer circles than 2H). 

      We sincerely appreciate the reviewer’s careful attention to the details of our figures. Each circle in the Figure 2G and H represents an individual imaging focal plane from different animals, and the median BF of some focal planes may be similar, leading to partial overlap. In the regions where overlap occurs, the brightness of the circle will be additive.

      Since fewer CT neurons, compared to TR neurons, responded to pure tones within each focal plane, as shown in Figure 2- figure supplementary 2, a larger number of focal planes were imaged to ensure a consistent and robust analysis of the pure tone response characteristics. The higher variance and lack of correlation in CT neurons is a key biological finding, not an artifact of sample size. The data clearly show a wide spread of median BFs at any given location for CT neurons, a feature absent in the TR population.

      Similarly, in Figures 2J and L, why are the circles staggered on the y-axis now? And is each circle now a neuron or a trial? It seems they have many more circles than Figure 2G and 2H. Also, I don't think doing a correlation is the proper stats for this type of plot (this point applies to Figures 3H and 3J).

      We regret any confusion have caused. In fact, Figure 2 illustrates the tonotopic gradient of CT and TR neurons at different scales. Specifically, Figures 2E-H present the imaging from the focal plane perspective (23 focal planes in Figures 2G, 40 focal planes in Figures 2H), whereas Figures 2I-L provide a more detailed view at the single-cell level (481 neurons in Figures 2J, 491 neurons in Figures 2L). So, Figures 2J and L do indeed have more circles than Figures 2G and H. The analysis at these varying scales consistently reveals the presence of a tonotopic gradient in TR neurons, whereas such a gradient is absent in CT neurons.

      We used Pearson correlation as a standard and direct method to quantify the linear relationship between a neuron's anatomical position and its frequency preference, which is widely used in the field to provide a quantitative measure (R-value) and a significance level (p-value) for the strength of a tonotopic gradient. The same statistical logic applies to testing for spatial gradients in local heterogeneity in Figure 3. We are confident that this is an appropriate and informative statistical approach for these data.

      What does the inter-quartile range of BF (IQRBF, in octaves) imply? What's the interpretation of this analysis? I am confused as to why TR neurons show high IQR in HF areas compared to LF areas, which means homogeneity among TR neurons (lines 213 - 216). On the same note, how is this different from the BF variability?  Isn't higher IQR equal to higher variability?

      We thank the reviewer for raising this important point. IQRBF, is a measure of local tuning heterogeneity. It quantifies the diversity of BFs among neighboring neurons. A small IQRBF means neighbors are similarly tuned (an orderly, homogeneous map), while a large IQRBF means neighbors have very different BFs (a disordered, heterogeneous map). (Winkowski and Kanold, 2013; Zeng et al., 2019).

      From the BF position reconstruction of all TR neurons (Figures 2I), most TR neurons respond to high-frequency sounds in the high-frequency (HF) region, but some neurons respond to low frequencies such as 2 kHz, which contributes to high IQR in HF areas. This does not contradict our main conclusion, that the TR neurons is significantly more homogeneous than the CT neurons. BF variability represents the stability of a neuron's BF over time, while IQR represents the variability of BF among different neurons within a certain range. (Chambers et al., 2023).

      Related publications:

      (1) Chambers AR, Aschauer DF, Eppler JB, Kaschube M, Rumpel S. 2023. A stable sensory map emerges from a dynamic equilibrium of neurons with unstable tuning properties. Cerebral Cortex 33: 5597-5612. DIO: https://doi.org/10.1093/cercor/bhac445, PMID: 36418925

      (2) Winkowski DE, Kanold PO. 2013. Laminar transformation of frequency organization in auditory cortex. Journal of Neuroscience 33: 1498-508. DIO: https://doi.org/10.1523/JNEUROSCI.3101-12.2013, PMID: 23345224

      (3) Zeng HH, Huang JF, Chen M, Wen YQ, Shen ZM, Poo MM. 2019. Local homogeneity of tonotopic organization in the primary auditory cortex of marmosets. Proceedings of the National Academy of Sciences of the United States of America 116: 3239-3244. DIO: https://doi.org/10.1073/pnas.1816653116, PMID: 30718428

      Figure 4A-B, there are no clear criteria on how the authors categorize V, I, and O shapes. The descriptions in the Methods (lines 721 - 725) are also very vague.

      We apologize for the initial vagueness and have replaced the descriptions in the Methods section. “V-shaped”: Neurons whose FRAs show decreasing frequency selectivity with increasing intensity. “I-shaped”: Neurons whose FRAs show constant frequency selectivity with increasing intensity. “O-shaped”: Neurons responsive to a small range of intensities and frequencies, with the peak response not occurring at the highest intensity level.

      To provide better visual intuition, we show multiple representative examples of each FRA type for both TR and CT neurons below. We are confident that these provide the necessary clarity and reproducibility for our analysis of receptive field properties.

      Author response image 1.

      Different FRA types within the dataset of TR and CT neurons. Each row shows 6 representative FRAs from a specific type. Types are V-shaped (‘V'), I-shaped (‘I’), and O-shaped (‘O’). The X-axis represents 11 pure tone frequencies, and the Y-axis represents 6 sound intensities.

      Reviewer #2 (Public Review):

      Summary:

      Gu and Liang et. al investigated how auditory information is mapped and transformed as it enters and exits an auditory cortex. They use anterograde transsynaptic tracers to label and perform calcium imaging of thalamorecipient neurons in A1 and retrograde tracers to label and perform calcium imaging of corticothalamic output neurons. They demonstrate a degradation of tonotopic organization from the input to output neurons.

      Strengths:

      The experiments appear well executed, well described, and analyzed.

      Weaknesses:

      (1) Given that the CT and TR neurons were imaged at different depths, the question as to whether or not these differences could otherwise be explained by layer-specific differences is still not 100% resolved. Control measurements would be needed either by recording (1) CT neurons in upper layers, (2) TR in deeper layers, (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      We appreciate these constructive suggestions. To address this, we performed new experiments and analyses.

      Comparison of TR neurons across superficial layers: we analyzed our existing TR neuron dataset to see if response properties varied by depth within the superficial layers. We found no significant differences in the fraction of tuned neurons, field IQR, or maximum bandwidth (BWmax) between TR neurons in L2/3 and L4. This suggests a degree of functional homogeneity within the thalamorecipient population across these layers. The results are presented in new supplemental figures (Figure 2- figure supplementary 4).

      Necessary control experiments.

      (1) CT neurons in upper layers. CT neurons are thalamic projection neurons that only exist in the deeper cortex, so CT neurons do not exist in upper layers (Antunes and Malmierca, 2021).

      (2) TR neurons in deeper layers. As we mentioned in the manuscript, due to high-titer AAV1-Cre virus labeling controversy (anterograde and retrograde labelling both exist), it is challenging to identify TR neurons in deeper layers.

      (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      To directly test if projection identity confers distinct functional properties within the same cortical layers, we performed the crucial control of comparing TR neurons to their neighboring non-TR neurons. We injected AAV1-Cre in MGB and a Cre-dependent mCherry into A1 to label TR neurons red. We then co-injected AAV-CaMKII-GCaMP6s to label the general excitatory population green.  In merged images, this allowed us to functionally image and directly compare TR neurons (yellow) and adjacent non-TR neurons (green). We separately recorded the responses of these neurons to pure tones using two-photon imaging. The results show that TR neurons are significantly more likely to be tuned to pure tones than their neighboring non-TR excitatory neurons. This finding provides direct evidence that a neuron's long-range connectivity, and not just its laminar location, is a key determinant of its response properties. The results are presented in new supplemental figures (Figure 2- figure supplementary 5).

      Related publications:

      Antunes FM, Malmierca MS. 2021. Corticothalamic Pathways in Auditory Processing: Recent Advances and Insights From Other Sensory Systems. Front Neural Circuits 15: 721186. DIO: https://doi.org/10.3389/fncir.2021.721186, PMID: 34489648

      (2) What percent of the neurons at the depths are CT neurons? Similar questions for TR neurons?

      We thank the reviewer for the comments. We performed histological analysis on brain slices from our experimental animals to quantify the density of these projection-specific populations. Our analysis reveals that CT neurons constitute approximately 25.47%\22.99%–36.50% of all neurons in Layer 6 of A1. In the superficial layers(L2/3 and L4), TR neurons comprise approximately 10.66%\10.53%–11.37% of the total neuronal population.

      Author response image 2.

      The fraction of CT and TR neurons. (A) Boxplots showing the fraction of CT neurons. N = 11 slices from 4 mice. (B) Boxplots showing the fraction of TR neurons. N = 11 slices from 4 mice.

      (3) V-shaped, I-shaped, or O-shaped is not an intuitively understood nomenclature, consider changing. Further, the x/y axis for Figure 4a is not labeled, so it's not clear what the heat maps are supposed to represent.

      The terms "V-shaped," "I-shaped," and "O-shaped" are an established nomenclature in the auditory neuroscience literature for describing frequency response areas (FRAs), and we use them for consistency with prior work. V-shaped: Neurons whose FRAs show decreasing frequency selectivity with increasing intensity. I-shaped: Neurons whose FRAs show constant frequency selectivity with increasing intensity. O-shaped: Neurons responsive to a small range of intensities and frequencies, with the peak response not occurring at the highest intensity level.

      (Rothschild et al., 2010). We have included a more detailed description in the Methods.

      The X-axis represents 11 pure tone frequencies, and the Y-axis represents 6 sound intensities. So, the heat map represents the FRA of neurons in A1, reflecting the responses for different frequencies and intensities of sound stimuli. In the revised manuscript, we have provided clarifications in the figure legend.

      (4) Many references about projection neurons and cortical circuits are based on studies from visual or somatosensory cortex. Auditory cortex organization is not necessarily the same as other sensory areas. Auditory cortex references should be used specifically, and not sources reporting on S1, and V1.

      We thank the reviewers for their valuable comments. We have made a concerted effort to ensure that claims about cortical circuit organization are supported by findings specifically from the auditory cortex wherever possible, strengthening the focus and specificity of our discussion.

      Reviewer #3 (Public Review):

      Summary:

      The authors performed wide-field and 2-photon imaging in vivo in awake head-fixed mice, to compare receptive fields and tonotopic organization in thalamocortical recipient (TR) neurons vs corticothalamic (CT) neurons of mouse auditory cortex. TR neurons were found in all cortical layers while CT neurons were restricted to layer 6. The TR neurons at nominal depths of 200-400 microns have a remarkable degree of tonotopy (as good if not better than tonotopic maps reported by multiunit recordings). In contrast, CT neurons were very heterogenous in terms of their best frequency (BF), even when focusing on the low vs high-frequency regions of the primary auditory cortex. CT neurons also had wider tuning.

      Strengths:

      This is a thorough examination using modern methods, helping to resolve a question in the field with projection-specific mapping.

      Weaknesses:

      There are some limitations due to the methods, and it's unclear what the importance of these responses are outside of behavioral context or measured at single timepoints given the plasticity, context-dependence, and receptive field 'drift' that can occur in the cortex.

      (1) Probably the biggest conceptual difficulty I have with the paper is comparing these results to past studies mapping auditory cortex topography, mainly due to differences in methods. Conventionally, the tonotopic organization is observed for characteristic frequency maps (not best frequency maps), as tuning precision degrades and the best frequency can shift as sound intensity increases. The authors used six attenuation levels (30-80 dB SPL) and reported that the background noise of the 2-photon scope is <30 dB SPL, which seems very quiet. The authors should at least describe the sound-proofing they used to get the noise level that low, and some sense of noise across the 2-40 kHz frequency range would be nice as a supplementary figure. It also remains unclear just what the 2-photon dF/F response represents in terms of spikes. Classic mapping using single-unit or multi-unit electrodes might be sensitive to single spikes (as might be emitted at characteristic frequency), but this might not be as obvious for Ca2+ imaging. This isn't a concern for the internal comparison here between TR and CT cells as conditions are similar, but is a concern for relating the tonotopy or lack thereof reported here to other studies.

      We sincerely thank the reviewer for the thoughtful evaluation of our manuscript and for your positive assessment of our work.

      (1)  Concern regarding Best Frequency (BF) vs. Characteristic Frequency (CF)

      Our use of BF, defined as the frequency eliciting the highest response averaged across all sound levels, is a standard and practical approach in 2-photon Ca²⁺ imaging studies. (Issa et al., 2014; Rothschild et al., 2010; Schmitt et al., 2023; Tischbirek et al., 2019). This method is well-suited for functionally characterizing large numbers of neurons simultaneously, where determining a precise firing threshold for each individual cell can be challenging.

      (2) Concern regarding background noise of the 2-photon setup

      We have expanded the Methods section ("Auditory stimulation") to include a detailed description of the sound-attenuation strategies used during the experiments. The use of a custom-built, double-walled sound-proof enclosure lined with wedge-shaped acoustic foam was implemented to significantly reduce external noise interference. These strategies ensured that auditory stimuli were delivered under highly controlled, low-noise conditions, thereby enhancing the reliability and accuracy of the neural response measurements obtained throughout the study.

      (3) Concern regarding the relationship between dF/F and spikes

      While Ca²⁺ signals are an indirect and filtered representation of spiking activity, they are a powerful tool for assessing the functional properties of genetically-defined cell populations. As you note, the properties and limitations of Ca²⁺ imaging apply equally to both the TR and CT neuron groups we recorded. Therefore, the profound difference we observed—a clear tonotopic gradient in one population and a lack thereof in the other—is a robust biological finding and not a methodological artifact.

      Related publications:

      (1) Issa JB, Haeffele BD, Agarwal A, Bergles DE, Young ED, Yue DT. 2014. Multiscale optical Ca2+ imaging of tonal organization in mouse auditory cortex. Neuron 83: 944-59. DIO: https://doi.org/10.1016/j.neuron.2014.07.009, PMID: 25088366

      (2) Rothschild G, Nelken I, Mizrahi A. 2010. Functional organization and population dynamics in the mouse primary auditory cortex. Nat Neurosci 13: 353-60. DIO: https://doi.org/10.1038/nn.2484, PMID: 20118927

      (3) Schmitt TTX, Andrea KMA, Wadle SL, Hirtz JJ. 2023. Distinct topographic organization and network activity patterns of corticocollicular neurons within layer 5 auditory cortex. Front Neural Circuits 17: 1210057. DIO: https://doi.org/10.3389/fncir.2023.1210057, PMID: 37521334

      (4) Tischbirek CH, Noda T, Tohmi M, Birkner A, Nelken I, Konnerth A. 2019. In Vivo Functional Mapping of a Cortical Column at Single-Neuron Resolution. Cell Rep 27: 1319-1326 e5. DIO: https://doi.org/10.1016/j.celrep.2019.04.007, PMID: 31042460

      (2) It seems a bit peculiar that while 2721 CT neurons (N=10 mice) were imaged, less than half as many TR cells were imaged (n=1041 cells from N=5 mice). I would have expected there to be many more TR neurons even mouse for mouse (normalizing by number of neurons per mouse), but perhaps the authors were just interested in a comparison data set and not being as thorough or complete with the TR imaging?

      As shown in the Figure 2- figure supplementary 2, a much higher fraction of TR neurons was "tuned" to pure tones (46% of 1041 neurons) compared to CT neurons (only 18% of 2721 neurons). To obtain a statistically robust and comparable number of tuned neurons for our core analysis (481 tuned TR neurons vs. 491 tuned CT neurons), it was necessary to sample a larger total population of CT neurons, which required imaging from more animals.

      (3) The authors' definitions of neuronal response type in the methods need more quantitative detail. The authors state: "Irregular" neurons exhibited spontaneous activity with highly variable responses to sound stimulation. "Tuned" neurons were responsive neurons that demonstrated significant selectivity for certain stimuli. "Silent" neurons were defined as those that remained completely inactive during our recording period (> 30 min). For tuned neurons, the best frequency (BF) was defined as the sound frequency associated with the highest response averaged across all sound levels.". The authors need to define what their thresholds are for 'highly variable', 'significant', and 'completely inactive'. Is best frequency the most significant response, the global max (even if another stimulus evokes a very close amplitude response), etc.

      We appreciate the reviewer's suggestions. We have added more detailed description in the Methods.

      Tuned neurons: A responsive neuron was further classified as "Tuned" if its responses showed significant frequency selectivity. We determined this using a one-way ANOVA on the neuron's response amplitudes across all tested frequencies (at the sound level that elicited the maximal response). If the ANOVA yielded a p-value < 0.05, the neuron was considered "Tuned”. Irregular neurons: Responsive neurons that did not meet the statistical criterion for being "Tuned" (i.e., ANOVA p-value ≥ 0.05) were classified as "Irregular”. This provides a clear, mutually exclusive category for sound-responsive but broadly-tuned or non-selective cells. Silent neurons: Neurons that were not responsive were classified as "Silent". This quantitatively defines them as cells that showed no significant stimulus-evoked activity during the entire recording session. Best frequency (BF): It is the frequency that elicited the maximal mean response, averaged across all sound levels.

      To provide greater clarity, we showed examples in the following figures.

      Author response image 3.

      Reviewer #1 (Recommendations For The Authors):

      (1) A1 and AuC were used exchangeably in the text.

      Thank you for pointing out this issue. Our terminological strategy was to remain faithful to the original terms used in the literature we cite, where "AuC" is often used more broadly. In the revised manuscript, we have performed a careful edit to ensure that we use the specific term "A1" (primary auditory cortex) when describing our own results and recording locations, which were functionally and anatomically confirmed.

      (2) Grammar mistakes throughout.

      We are grateful for the reviewer’s suggested improvement to our wording. The entire manuscript has undergone a thorough professional copyediting process to correct all grammatical errors and improve overall readability.

      (3) The discussion should talk more about how/why L6 CT neurons don't possess the tonotopic organization and what are the implications. Currently, it only says 'indicative of an increase in synaptic integration during cortical processing'...

      Thanks for this suggestion. We have substantially revised and expanded the Discussion section to explore the potential mechanisms and functional implications of the lack of tonotopy in L6 CT neurons.

      Broad pooling of inputs: We propose that the lack of tonotopy is an active computation, not a passive degradation. CT neurons likely pool inputs from a wide range of upstream neurons with diverse frequency preferences. This broad synaptic integration, reflected in their wider tuning bandwidth, would actively erase the fine-grained frequency map in favor of creating a different kind of representation.

      A shift from topography to abstract representation: This transformation away from a classic sensory map may be critical for the function of corticothalamic feedback. Instead of relaying "what" frequency was heard, the descending signal from CT neurons may convey more abstract, higher-order information, such as the behavioral relevance of a sound, predictions about upcoming sounds, or motor-related efference copy signals that are not inherently frequency-specific.’

      Modulatory role of the descending pathway: The descending A1-to-MGB pathway is often considered to be modulatory, shaping thalamic responses rather than driving them directly. A modulatory signal designed to globally adjust thalamic gain or selectivity may not require, and may even be hindered by, a fine-grained topographical organization.

      Reviewer #2 (Recommendations For The Authors):

      (1) Given that the CT and TR neurons were imaged at different depths, the question as to whether or not these differences could otherwise be explained by layer-specific differences is still not 100% resolved. Control measurements would be needed either by recording (1) CT neurons in upper layers (2) TR in deeper layers (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      We appreciate these constructive suggestions. To address this, we performed new experiments and analyses.

      Comparison of TR neurons across superficial layers: we analyzed our existing TR neuron dataset to see if response properties varied by depth within the superficial layers. We found no significant differences in the fraction of tuned neurons, field IQR, or maximum bandwidth (BWmax) between TR neurons in L2/3 and L4. This suggests a degree of functional homogeneity within the thalamorecipient population across these layers.

      Necessary control experiments.

      (1) CT neurons in upper layers. CT neurons are thalamic projection neurons that only exist in the deeper cortex, so CT neurons do not exist in upper layers (Antunes and Malmierca, 2021).

      (2) TR neurons in deeper layers. As we mentioned in the manuscript, due to high-titer AAV1-Cre virus labeling controversy (anterograde and retrograde labelling both exist), it is challenging to identify TR neurons in deeper layers.

      (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      To directly test if projection identity confers distinct functional properties within the same cortical layers, we performed the crucial control of comparing TR neurons to their neighboring non-TR neurons. We injected AAV1-Cre in MGB and a Cre-dependent mCherry into A1 to label TR neurons red. We then co-injected AAV-CaMKII-GCaMP6s to label the general excitatory population green.  In merged images, this allowed us to functionally image and directly compare TR neurons (yellow) and adjacent non-TR neurons (green). We separately recorded the responses of these neurons to pure tones using two-photon imaging. The results show that TR neurons are significantly more likely to be tuned to pure tones than their neighboring non-TR excitatory neurons. This finding provides direct evidence that a neuron's long-range connectivity, and not just its laminar location, is a key determinant of its response properties.

      Related publications:

      Antunes FM, Malmierca MS. 2021. Corticothalamic Pathways in Auditory Processing: Recent Advances and Insights From Other Sensory Systems. Front Neural Circuits 15: 721186. DIO: https://doi.org/10.3389/fncir.2021.721186, PMID: 34489648

      (3) V-shaped, I-shaped, or O-shaped is not an intuitively understood nomenclature, consider changing. Further, the x/y axis for Figure 4a is not labeled, so it's not clear what the heat maps are supposed to represent.

      The terms "V-shaped," "I-shaped," and "O-shaped" are an established nomenclature in the auditory neuroscience literature for describing frequency response areas (FRAs), and we use them for consistency with prior work. V-shaped: Neurons whose FRAs show decreasing frequency selectivity with increasing intensity. I-shaped: Neurons whose FRAs show constant frequency selectivity with increasing intensity. O-shaped: Neurons responsive to a small range of intensities and frequencies, with the peak response not occurring at the highest intensity level.

      (Rothschild et al., 2010). We have included a more detailed description in the Methods.

      The X-axis represents 11 pure tone frequencies, and the Y-axis represents 6 sound intensities. So, the heat map represents the FRA of neurons in A1, reflecting the responses for different frequencies and intensities of sound stimuli. In the revised manuscript, we have provided clarifications in the figure legend.

      (4) Many references about projection neurons and cortical circuits are based on studies from visual or somatosensory cortex. Auditory cortex organization is not necessarily the same as other sensory areas. Auditory cortex references should be used specifically, and not sources reporting on S1, V1.

      We thank the reviewers for their valuable comments. We have made a concerted effort to ensure that claims about cortical circuit organization are supported by findings specifically from the auditory cortex wherever possible, strengthening the focus and specificity of our discussion.

      Reviewer #3 (Recommendations For The Authors):

      I suggest showing some more examples of how different neurons and receptive field properties were quantified and statistically analyzed. Especially in Figure 4, but really throughout.

      We thank the reviewer for this valuable suggestion. To provide greater clarity, we have added more examples in the following figure.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary 

      The authors describe a method for gastruloid formation using mouse embryonic stem cells (mESCs) to study YS and AGM-like hematopoietic differentiation. They characterise the gastruloids during nine days of differentiation using a number of techniques including flow cytometry and single-cell RNA sequencing. They compare their findings to a published data set derived from E10-11.5 mouse AGM. At d9, gastruloids were transplanted under the adrenal gland capsule of immunocompromised mice to look for the development of cells capable of engrafting the mouse bone marrow. The authors then applied the gastruloid protocol to study overexpression of Mnx1 which causes infant AML in humans.

      In the introduction, the authors define their interpretation of the different waves of hematopoiesis that occur during development. 'The subsequent wave, known as definitive, produces: first, oligopotent erythro-myeloid progenitors (EMPs) in the YS (E8-E8.5); and later myelo-lymphoid progenitors (MLPs - E9.5-E10), multipotent progenitors (MPPs - E10-E11.5), and hematopoietic stem cells (HSCs - E10.5-E11.5), in the aorta-gonad-mesonephros (AGM) region of the embryo proper.' Herein they designate the yolk sac-derived wave of EMP hematopoiesis as definitive, according to convention, although paradoxically it does not develop from intra-embryonic mesoderm or give rise to HSCs.

      Our definition of primitive and definitive waves is widely used in the field (e.g. PMID: 18204427; PMID: 28299650; PMID: 33681211). Definitive haematopoiesis, encompassing EMP, MLP, MPP and HSC, highlights their origin from haemogenic endothelium, generation of mature cells with adult characteristics from progenitors with multilineage potential and direct and indirect developmental contributions to the intra-embryonic and time-restricted generation of HSCs. 

      General comments 

      The authors make the following claims in the paper: 

      (1) The development of a protocol for hemogenic gastruloids (hGx) that recapitulates YS and AGMlike waves of blood from HE.

      (2) The protocol recapitulates both YS and EMP-MPP embryonic blood development 'with spatial and temporal accuracy'.

      (3) The protocol generates HSC precursors capable of short-term engraftment in an adrenal niche.

      (4) Overexpression of MNX1 in hGx transforms YS EMP to 'recapitulate patient transcriptional signatures'.

      (5) hGx is a model to study normal and leukaemic embryonic hematopoiesis. 

      There are major concerns with the manuscript. The statements and claims made by the authors are not supported by the data presented, data is overinterpreted, and the conclusions cannot be justified. Furthermore, the data is presented in a way that makes it difficult for the reader to follow the narrative, causing confusion. The authors have not discussed how their hGx compares to the previously published mouse embryoid body protocols used to model early development and hematopoiesis. Specific points 

      (1) It is claimed that HGxs capture cellularity and topography of developmental blood formation. The hGx protocol described in the manuscript is a modification of a previously published gastruloid protocol (Rossi et al 2022). The rationale for the protocol modifications is not fully explained or justified. There is a lack of novelty in the presented protocol as the only modifications appear to be the inclusion of Activin A and an extension of the differentiation period from 7 to 9 days of culture. No direct comparison has been made between the two versions of gastruloid differentiation to justify the changes.

      The Reviewer paradoxically claims that the protocol is not novel and that it differs from a previous publication in at least 2 ways – the patterning pulse and the length of the protocol. Of these, the patterning pulse is key. As documented in Fig. 1S1, we cannot obtain Flk1-GFP expression in the absence of Activin A (Fig. 1S1A), and the concentration of Activin A scales activity of the Flk1 locus (Fig. 1S1B). Expression of Flk1 is a fundamental step in haemato-endothelial specification and, accordingly, we do not see CD41 or CD45+ cells in the absence of Activin A. Furthermore, these markers also titrate with the dose of Activin A (in Fig. 1S1B).

      Also, in our hands, there is a clear time-dependent progression of marker expression, with sequential acquisition of CD41 and CD45, with the latter not detectable until 192h (Fig. 1C-D), another key difference relative to the Rossi et al (2022) protocol. We suggest, and present further evidence for in this rebuttal and the revised manuscript, that the 192h-timepoint captures the onset of AGM-like haematopoiesis. We have edited the manuscript to clarify the differences and novelty in our protocol (lines 132-143) and provided a more detailed comparison with the report from Rossi et al. (2022) in the Discussion (lines 574-586).

      The inclusion of Activin A at high concentration at the beginning of differentiation would be expected to pattern endoderm rather than mesoderm. BMP signaling is required to induce Flk1+ mesoderm, even in the presence of Wnt.

      Again, we call the Reviewer’s attention to Fig. 1S1A which clearly shows that Activin A (with no BMP added) is required for induction of Flk1 expression, in the presence of Wnt. Activin A in combination with Wnt, is used in other protocols of haemato-endothelial differentiation from pluripotent cells, with no BMP added in the same step of patterning and differentiation (PMID: 39227582; PMID: 39223325). In the latter protocol, we also call the Reviewer’s attention to the fact that a higher concentration of Activin A precludes the need for BMP4 addition. Finally, one of us has recently reported that Activin A, on its own, will induce Flk1, as well as other anterior mesodermal progenitors (https://www.biorxiv.org/content/10.1101/2025.01.11.632562v1). In addressing the Reviewer’s concerns with the dose of Activin A used, we titrated its concentration against activation of Flk1, confirming optimal Flk1-GFP expression at the 100ng/ml dose used in the manuscript. We have included this data in the manuscript in Figure 1S1B.                         

      FACS analysis of the hGx during differentiation is needed to demonstrate the co-expression of Flk1GFP and lineage markers such as CD34 to indicate patterning of endothelium from Flk1+ mesoderm. The FACS plots in Fig. 1 show C-Kit expression but very little VE-cadherin which suggests that CD34 is not induced. Early endoderm expresses C-Kit, CXCR4, and Epcam, but not CD34 which could account for the lack of vascular structures within the hGx as shown in Fig. 1E.

      We were surprised by the Reviewer’s comment that there are no endothelial structures in our haemogenic gastruloids. The presence of a Flk1-GFP+ network is visible in the GFP images in Fig. 1B, from 144h onwards, and is detailed in the revised Fig. 2A, which shows overlap between Flk1GFP and the endothelial marker CD31. In addition, our single-cell RNA-seq data, included in the manuscript, confirms the presence of endothelial cells with a developing endothelial, including arterial, programme. This is now presented in the revised Fig. 3B-D of the manuscript, which updates a representation in the original manuscript. In contrast with the Reviewer’s claims that no endothelial cells are formed, the data show that Kdr (Flk1)+ cells co-express Cdh5/VE-Cadherin and indeed Cd34, attesting to the presence of an endothelial programme. Arterial markers Efnb2, Flt1, and Dll4 are present. A full-blown programme, which also includes haemogenic markers including Sox17, Esam, Cd44 and Mecom is clear at early (144h) and, particularly at late (192h) timepoints in cells sorted on detection of surface C-Kit (Fig. 3B-E in the manuscript). To address the specific point by the Reviewer, we also document co-expression of Flk1-GFP, CD34 and/or CD31 by flow cytometry (Fig. 2S1A-B in the revised manuscript).

      To summarise new and revised data in the manuscript in relation to this point:

      Immunofluorescence staining showing the Flk1-GFP-defined vascular network in Figure 1E and co-expression of endothelial marker CD31 in Figure 2A. In text: lines 159-163; 178-180.

      Flow cytometry analysis of co-expression of Flk1-GFP with CD31 and CD34 in Figure 2S1AD, including controls. In text: 180-187.

      Real-time quantitative (q)PCR analysis showing time-dependent expression of haematoendothelial and arterial markers in Figure 2F (specifically Dll4 and Mecom). In text: 200-209.

      An improved representation of our scRNA-seq data highlighting key haemato-endothelial markers in Figure 3B-D. In text: 268-304

      (2) The protocol has been incompletely characterised, and the authors have not shown how they can distinguish between either wave of Yolk Sac (YS) hematopoiesis (primitive erythroid/macrophage and erythro-myeloid EMP) or between YS and intraembryonic Aorta-Gonad-Mesonephros (AGM) hematopoiesis. No evidence of germ layer specification has been presented to confirm gastruloid formation, organisation, and functional ability to mimic early development. Furthermore, differentiation of YS primitive and YS EMP stages of development in vitro should result in the efficient generation of CD34+ endothelial and hematopoietic cells. There is no flow cytometry analysis showing the kinetics of CD34 cell generation during differentiation. Benchmarking the hGx against developing mouse YS and embryo data sets would be an important verification. 

      The Reviewer is correct that we have not provided detailed characterisation of the different germ layers, as this was not the focus of the study. In that context, we were surprised by the earlier comment assuming co-expression of C-Kit, Cxcr4 and Epcam, which we did not show, while overlooking the endothelial programme reiterated above, which we have presented. Given our focus on haemato-endothelial specification, we have started the single-cell RNA-seq characterisation of the haemogenic gastruloid at 120h and have not looked specifically at earlier timepoints of embryo patterning. This said, we show the presence of neuroectodermal cells in cluster 9; on the other hand, cluster 7 includes hepatoblast-like cells, denoting endodermal specification (Supplementary File S2). However, in the absence of earlier timepoints and given the bias towards mesodermal specification, we expect that specification of ectodermal and endodermal programmes may be incomplete. 

      In respect of the contention regarding the capture of YS-like and AGM-like haematopoiesis, we had presented evidence in the original version of the manuscript that haemogenic cells generated during gastruloid differentiation, particularly at late 192h and 216h timepoints project onto highly purified CKit+ CD31+ Gfi1-expressing cells from mouse AGM (PMID: 38383534), providing support for at least partial recapitulation of the corresponding developmental stage. These projections are represented in Fig. 4A, right and 4S1C of the revised manuscript. In distinguishing between YS-like and AGM-like haematopoiesis, we call the Reviewer’s attention to the replotting of the single-cell RNA-seq data already in the manuscript, which we provided in response to point 1 (Fig. 3B-D and 3S2B), which highlights an increase in Sox17, but not Sox18, expression in the 192h haemogenic endothelium, which suggests an association with AGM haematopoiesis (PMID: 20228271). A significant association of Cd44 and Procr expression with the same time-point (Fig. 3B-D in the manuscript), further supports an AGM-like endothelial-to-haematopoietic transition at the 192h timepoint. We have re-analysed the scRNA-seq data to better represent the expression of these markers in Fig. 3A-E and S32B. We agree that it remains challenging to identify markers exclusive to AGM haematopoiesis, which is operationally equated with generation of transplantable haematopoietic stem cells. While HSC generation is a key event characteristic of the AGM, not all AGM haematopoiesis corresponds to HSCs, an important point in evaluating the data presented in the manuscript, and one that is acknowledged by us. The main text has been edited to clarify the experiments pertaining to distinguishing AGM and YS haematopoiesis, which are detailed in lines 180-187, 200-221, 268-304, and 315-356.

      Following on the Reviewer’s comments about Cd34, we also inspected co-expression of Cd34 with Cd41 and Cd45, the latter co-expression present in, although not necessarily exclusive to, AGM haematopoiesis. Reassuringly, we observed clear co-expression with both markers (Author response image 1), in addition to a CD41+CD34- population, which likely reflects YS EMP-independent erythropoiesis. Flow cytometry analysis of co-expression of CD31 and CD34 in CD41+ and CD45+ populations at 144h and 216h timepoints has been included in Fig. 2B-D, Fig. 2S1A-D, including controls. In text: 180-187. We have earlier on in the rebuttal highlighted the fact that marker expression is responsive to the levels of Activin A used in the patterning pulse, with the 100ng/ml Activin A used in our protocol superior to 75ng/ml.

      Author response image 1.

      Association of CD34 with CD41 and CD45 expression is Activin A-responsive and supports the presence of definitive haematopoiesis. A. Flow cytometry analysis of CD34 and CD41 expression in 216h-haemogenic gastruloids; two doses of Activin A were used in the patterning pulse with CHI99021 between 48-72h. FMO controls shown. B. Flow cytometry analysis of CD34 and CD45 at 216h in the same experimental conditions.

      Given the centrality of this point in comments by all the Reviewers, we have conducted projections of our single-cell RNA-seq data against two studies which (1) capture arterial and haemogenic specification in the para-splanchnopleura (pSP) and AGM region between E8.0 and E11 (Hou et al, PMID: 32203131), and (2) uniquely capture YS, AGM and FL progenitors and the AGM endothelial-tohaematopoietic transition (EHT) in the same scRNA-seq dataset (Zhu et al, PMID: 32392346). Focusing the analysis on the subsets of haemogenic gastruloid cells sorted as CD41+ (144h) C-Kit+ (144h and 192h) and CD45+ (192h and 216h) (now represented in Fig. 3A, and projected onto the studies in Fig. 4A), we show:

      (1) That a subset of haemato-endothelial cells from haemogenic gastruloids at 144h to 216h project onto intra-embryonic cells spanning E8.25 to E10 (revised Fig. 4A left and 4S1A). This is in agreement with our original interpretation that 216h are no later than the MPP/pre-HSC state of embryonic development, requiring further maturation to generate engrafting progenitors. We have nevertheless removed specific references to pre-HSC, and instead referred to HSPC/progenitors.

      (2) That haemogenic gastruloids contain YS-like (including EMP-like) and AGM-like haematopoietic cells (Fig. 4A centre and 4 S1B). Significantly, some of the cells, particularly CKit-sorted cells with a candidate endothelial and HE-like signature project onto AGM pre-HE and HE, as well as IAHC. Some 144h CD41+ and 192h CD45+ cells also project onto IAHC, suggesting that YS-like and AGM-like programmes arise independently and with partial timedependent organisation in the haemogenic gastruloid model. Later, predominantly 216h cells, have characteristics of MPP/LMPP-like cells from the FL, suggesting a progenitor wave of differentiation.

      Altogether, the data support the notion that haemogenic gastruloids capture YS and AGM haematopoiesis until E10, as suggested by us in the manuscript.This re-analysis of the scRNA-seq data which was indeed prompted by challenging and insightful comments from the Reviewers, has been incorporated in the manuscript as described above and further listed here:

      Re-clustering and highlights of specific markers in our scRNA-seq data in Figure 3A-E. In text: 268-304.

      Projections to mouse embryo datasets in Figure 4A (Figure 4S1A-C; Supplementary File 3). In text: 315-356. 

      Single-cell RNA sequencing was used to compare hGx with mouse AGM. The authors incorrectly conclude that ' ..specification of endothelial and HE cells in hGx follows with time-dependent developmental progression into putative AGM-like HE..' And, '...HE-projected hGx cells.......expressed Gata2 but not Runx1, Myb, or Gfi1b..' Hemogenic endothelium is defined by the expression of Runx1 and Gfli1b is downstream of Runx1.

      As a hierarchy of regulation, Gata2 precedes and drives Runx1 expression at the specification of HE (PMID: 17823307; PMID: 24297996), while Runx1 drives the EHT, upstream of Gfi1b in haematopoietic clusters (PMID: 34517413). Please note that the text segment the Reviewer refers to has been removed from the manuscript, as the analysis is no longer solely focused on projection to Thambyrajah et al (2024) data, and instead gained significantly from the projections on to the Hou et al (2020) and Zhu et al (2020) studies, as detailed above.

      (3) The hGx protocol 'generates hematopoietic SC precursors capable of short-term engraftment' is not supported by the data presented. Short-term engraftment would be confirmed by flow cytometric detection of hematopoietic cells within the recipient bone marrow, spleen, thymus, and peripheral blood that expressed the BFP transgene. This analysis was not provided. PCR detection of transcripts, following an unspecified number of amplification cycles, as shown in Figure 3G (incorrectly referred to as Figure 3F in the legend) is not acceptable evidence for engraftment.

      We provide the full flow cytometry analysis of spleen engraftment in the 5 mice which received implantation of 216h-haemogenic gastruloids in the adrenal gland and were analysed at 4 weeks; an additional (control) animal received adrenal injection of PBS (Fig. 4B-D in the revised manuscript). In this experiment, the bone marrow collection was limiting, and material was prioritised for PCR (Fig. 4C and full gels in 4S2C in the revised manuscript).

      We had previously provided only representative plots of flow cytometry analysis of bone marrow and spleen, which we described as low-level engraftment and were chosen conservatively. The analysis was meant to complement the genomic DNA PCR, where detection was present in only some of the replicates tested per animal. On this note, we confirm that PCR analysis used conventional 40 cycles; the sensitivity had already been shown in the earlier version of the manuscript and is again represented in Fig. 4S2B. We argue that the low level of cytometric and molecular engraftment at 4 weeks, from haemogenic gastruloid-derived progenitors that have not progressed beyond a stage equivalent to E10 (Fig. 4A and Supplementary File 3 in the revised manuscript from scRNAseq projections), and that we have described as requiring additional maturation in vivo, are not surprising. Indeed, as previously shown and now repeated in in Fig. 2B-E (controls in Fig. 2S1E-G) in the revised manuscript, no more than 7 CD45+CD144+ multipotent cells are present per haemogenic gastruloid. We are only able to implant 3 haemogenic gastruloids in the adrenal gland of each transplanted animal. 

      We have rephrased Results and Discussion in lines 359-415 and 588-621, respectively, to rectify the nature of the engraftment, which we now attribute more generically to progenitors, also in light of the developmental time we could capture in the gastruloids prior to implantation.

      Transplanted hGx formed teratoma-like structures, with hematopoietic cells present at the site of transplant only analysed histologically. Indeed, the quality of the images provided does not provide convincing validation that donor-derived hematopoietic cells were present in the grafts.

      As stated in the text, the images mean to illustrate that the haemogenic gastruloids developed in situ. Further analysis motivated by the Reviewers’ comments and indeed a subsequent experiment with analysis of engraftment at a later timepoint of 8 weeks (revised Fig. 4E and 4 S2F-G) did not show a direct correspondence between engraftment and in vivo development or expansion, although this occurs in some cases. To be clearer, the observation of donor-derived blood cells in the implanted haemogenic gastruloids would not correspond to engraftment, as we have amply demonstrated that they have generated blood cells in vitro. There is no evidence that there are remaining pluripotent cells in the haemogenic gastruloid after 9 days of differentiation, and it is therefore not clear that the structures observed are teratomas. We specifically comment on this point in the revised manuscript – lines 601-607.

      There is no justification for the authors' conclusion that '... the data suggest that 216h hGx generate AGM-like pre-HSC capable of at least short-term multilineage engraftment upon maturation...'. Indeed, this statement is in conflict with previous studies demonstrating that pre-HSCs in the dorsal aorta of the mouse embryo are immature and actually incapable of engraftment.

      We have clearly stated that we do not see haematopoietic engraftment through transplantation of dissociated haemogenic gastruloids, which reach the E10 state containing pre-HSC (revised Fig 4A, 4S1A and Supplementary File 3). Instead, we observed rare myelo-erythroid (revised Fig. 4S2F-G) and myelo-lymphoid (revised Fig. 4E) engraftment upon in vivo maturation of haemogenic gastruloids with preserved 3D organisation. These statements are not contradictory. Nevertheless, we have now more cautiously attributed engraftment to the present of progenitors as a generic designation, and not to pre-HSC (lines 412-414 and 588-592 in the revised manuscript).

      The statement '...low-level production of engrafting cells recapitulates their rarity in vivo, in agreement with the embryo-like qualities of the gastruloid system....' is incorrect. Firstly, no evidence has been provided to show the hGx has formed a dorsal aorta facsimile capable of generating cells with engrafting capacity. Secondly, although engrafting cells are rare in the AGM, approximately one per embryo, they are capable of robust and extensive engraftment upon transplantation.

      As indicated above, the statement in lines 412-414 now reads “Engraftment is erythromyeloid at 4 weeks and lympho-myeloid at 8 weeks, reflecting different classes of progenitors, putatively of YS-like and AGM-like affiliation.” To be clear, with our original statement we meant to highlight that the production of definitive AGM-like haematopoietic progenitors (not all of which are engrafting) in haemogenic gastruloids does not correspond to non-physiological single-lineage programming. We did and do not claim that we achieved production of HSC, which would be long-term engrafting.

      (4) Expression MNX1 transcript and protein in hematopoietic cells in MNX1 rearranged acute myeloid leukaemia (AML) is one cause of AML in infants. In the hGX model of this disease, Mnx1 is overexpressed in the mESCs that are used to form gastruloids. Mnx1 overexpression seems to confer an overall growth advantage on the hGx and increase the serial replating capacity of the small number of hematopoietic cells that are generated. The inefficiency with which the hGx model generates hematopoietic cells makes it difficult to model this disease. The poor quality of the cytospin images prevents accurate identification of cells. The statement that the kit-expressing cells represent leukemic blast cells is not sufficiently validated to support this conclusion. What other stem cell genes are expressed? Surface kit expression also marks mast cells, frequently seen in clonogenic assays of blood cells. Flow cytometric and gene expression analyses using known markers would be required.

      The haemogenic gastruloid model generates haematopoietic and haemato-endothelial cells. MNX1 expands C-Kit+ cells at 144h, which we show to have a haemato-endothelial signature (see revised Fig. 3A-E, Supplementary File 2). We have added additional flow cytometry data showing that the replating cells from MNX1 express CD31 (Figure 6S1A-B).

      Serial replating of CFC assays is a conventional in vitro assay of leukaemia transformation. Critically, colony replating is not maintained in EV control cells, attesting to the transformation potential of MNX1. Although we have not fully-traced the cellular hierarchy of MNX1-driven transformation in the haemogenic gastruloid system, the in vitro replating expands a C-Kit+ cell (revised Fig. 6E), which reflects the surface phenotype of the leukaemia, also recapitulated in the mouse model initiated by MNX1-overexpressing FL cells. Importantly, it recapitulates the transcriptional profile of MNX1leukaemia patients (revised Fig. 7C), which is uniquely expressed by MNX1144h and replated colony cells, but not to MNX1 216h gastruloid cells, arguing against a generic signature of MNX1 overexpression (revised Fig. 7B). Importantly, the MNX1-transformation of haemogenic gastruloid cells is superior to the FL leukaemia model at capturing the unique transcriptional features of MNX1-driven leukaemia, distinct from other forms of AML in the same age group (Fig 7 S1D-F). It is possible that this corresponds to a pre-leukaemia event, and we will explore this in future studies, which are beyond the proof-of-principle nature of this paper.

      (5) In human infant MNX1 AML, the mutation is thought to arise at the fetal liver stage of development. There is no evidence that this developmental stage is mimicked in the hGx model.

      We never claim that the haemogenic gastruloid model mimics the foetal liver. We propose that susceptibility to MNX1 is at the HE-to-EMP transition. Moreover, and importantly, contrary to the Reviewer’s statement, there is no evidence in the literature that the mutation arises in the foetal liver stage, just that the mutation arises before birth (PMID: 38806630), which is different. In a mouse model of MNX1 overexpression, the authors achieve leukaemia engraftment upon MNX1 overexpression in foetal liver, but not in bone marrow cells (PMID: 37317878). This is in agreement with a vulnerability of embryonic / foetal, but not adult cells to the MNX1 expression caused by the translocation. However, haematopoietic cells in the foetal liver originate from YS and AGM precursors, so the origin of the MNX1susceptible cells can be in those locations, rather than the foetal liver itself.

      Reviewer #2 (Public review):

      Summary: 

      In this manuscript, the authors develop an exciting new hemogenic gastruloid (hGX) system, which they claim reproduces the sequential generation of various blood cell types. The key advantage of this cellular system would be its potential to more accurately recapitulate the spatiotemporal emergence of hematopoietic progenitors within their physiological niche compared to other available in vitro systems. The authors present a large set of data and also validate their new system in the context of investigating infant leukemia. 

      Strengths: 

      The development of this new in vitro system for generating hematopoietic cells is innovative and addresses a significant drawback of current in vitro models. The authors present a substantial dataset to characterize this system, and they also validate its application in the context of investigating infant leukemia. 

      Weaknesses: 

      The thorough characterization and full demonstration that the cells produced truly represent distinct waves of hematopoietic progenitors are incomplete. The data presented to support the generation of late yolk sac (YS) progenitors, such as lymphoid cells, and aortic-gonad-mesonephros (AGM)-like progenitors, including pre-hematopoietic stem cells (pre-HSCs), by this system are not entirely convincing. Given that this is likely the manuscript's most crucial claim, it warrants further scrutiny and direct experimental validation. Ideally, the identity of these progenitors should be further demonstrated by directly assessing their ability to differentiate into lymphoid cells or fully functional HSCs. Instead, the authors primarily rely on scRNA-seq data and a very limited set of markers (e.g., Ikzf1 and Mllt3) to infer the identity and functionality of these cells. Many of these markers are shared among various types of blood progenitors, and only a well-defined combination of markers could offer some assurance of the lymphoid and pre-HSC nature of these cells, although this would still be limited in the absence of functional assays.

      The identification of a pre-HSC-like CD45⁺CD41⁻/lo C-Kit⁺VE-Cadherin⁺ cell population is presented as evidence supporting the generation of pre-HSCs by this system, but this claim is questionable. This FACS profile may also be present in progenitors generated in the yolk sac such as early erythromyeloid progenitors (EMPs). It is only within the AGM context, and in conjunction with further functional assays demonstrating the ability of these cells to differentiate into HSCs and contribute to long-term repopulation, that this profile could be strongly associated with pre-HSCs. In the absence of such data, the cells exhibiting this profile in the current system cannot be conclusively identified as true pre-HSCs.

      We present 2 additional pieces of evidence to support our claims that we capture YS and AGM stages of haematopoietic development.

      (I) In the new Figures 4A and 4 S1A-C and Supplementary File 3 in the revised manuscript, we project our single-cell RNA-seq data onto (1) developing intra-embryonic pSP and AGM between E8 and E11 (Fig. 4A left, 4S1A) and (2) a single-cell RNA-seq study of HE development which combines haemogenic and haematopoietic cells from the YS, the developing HE and IAHC in the AGM, and FL (Fig. 4A centre, 4S1B). Our data maps E8.25-E10, and captures YS EMP and erythroid and myeloid progenitors, as well as AGM pre-HE, HE and IAHC, with some cells matching HSPC and LMPP, as suggested by the projection onto the Thambyrajah et al data set (already presented in the previous version of the manuscript, and now in Fig. 4A right and 4 S1C). The projection of the scRNA-seq data in presented in lines 314-355 of the revised manuscript. The scRNA-seq data itself was refocused on haemato-endothelial programmes as presented in the revised Fig. 3A-E, described in lines 267-303.

      (II) Given the difficulty in finding markers that specifically associate with AGM haematopoiesis, we inspected the possibility of capturing different regulatory requirements at different stages of gastruloid development mirroring differential effects in the embryo. Polycomb EZH2 is specifically required for EMP differentiation in the YS, but does not affect AGM-derived haematopoiesis; it is also not required for primitive erythroid cells (PMID: 29555646; PMID: 34857757). We treated haemogenic gastruloids from 120h onwards with either DMSO (0.05%) or GSK126 (0.5uM), and inspected the cellularity of gastruloids at 144h, which we equate with YS-EMP, and 216h – putatively AGM haematopoiesis. We show that EZH2 inhibition / GSK126 treatment specifically reduces %CD41+ cells at 144h, but does not reduce %CD41+ or %CD45+ cells at 216h. We have included this experiment in the manuscript in Fig. 2 S2B-C (in text: 209-221).

      These data, together with the scRNA-seq projections described, provide evidence to our claim that 144h haemogenic gastruloids capture YS EMPs, while CD41+ and CD45+ cells isolated at 216h reflect AGM progenitors. We cannot conclude as to the functional nature of the AGM cells from this experiment. The main text has been edited to clarify the experiments pertaining to distinguishing AGM and YS haematopoiesis (lines 180-187; 200-221; 268-304; 315-356).

      The engraftment data presented are also not fully convincing, as the observed repopulation is very limited and evaluated only at 4 weeks post-transplantation. The cells detected after 4 weeks could represent the progeny of EMPs that have been shown to provide transient repopulation rather than true HSCs. 

      In the original version of the manuscript, we stated that there is low level engraftment and did not claim to have generated HSC. Instead, we described cells with short-term engraftment potential. We agree with the Reviewer that the cells we show in the manuscript at 4 weeks could be EMPs (revised Fig. 4B-E and 4 S2D-G). Additionally, we now have 8-week analysis of implant recipients, in which we observed, again low-level, a multi-lineage engraftment of the recipient bone marrow in 1:3 recipients (revised Fig. 4B-E and 4S2F-H). This engraftment is myeloid-lymphoid and therefore likely to have originated in a later progenitor. To be clear, we do not claim that this corresponds to the presence of HSC. It nevertheless supports the maturation of progenitors with engraftment potential. Limiting amounts of material was prioritised for flow cytometry stainings, not allowing PCR analysis. We rephrased Results and Discussion in lines 359-414 and 588-621, respectively, to rectify the nature of the engraftment.      

      Reviewer #3 (Public review):  

      In this study, the authors employ a mouse ES-derived "hemogenic gastruloid" model which they generated and which they claim to be able to deconvolute YS and AGM stages of blood production in vitro. This work could represent a valuable resource for the field. However, in general, I find the conclusions in this manuscript poorly supported by the data presented. Importantly, it isn't clear what exactly are the "YS" and the "AGM"-like stages identified in the culture and where is the data that backs up this claim. In my opinion, the data in this manuscript lack convincing evidence that can enable us to identify what kind of hematopoietic progenitor cells are generated in this system. Therefore, the statement that "our study has positioned the MNX1-OE target cell within the YS-EMP stage (line 540)" is not supported by the evidence presented in this study. Overall, the system seems to be very preliminary and requires further optimization before those claims can be made.

      Specific comments below: 

      (1) The flow cytometric analysis of gastruloids presented in Figure 1 C-D is puzzling. There is a large % of C-Kit+ cells generated, but few VE-Cad+ Kit+ double positive cells. Similarly, there are many CD41+ cells, but very few CD45+ cells, which one would expect to appear toward the end of the differentiation process if blood cells are actually generated. It would be useful to present this analysis as consecutive gating (i.e. evaluating CD41 and CD45 within VE-Cad+ Kit+ cells, especially if the authors think that the presence of VE-Cad+ Kit+ cells is suggestive of EHT). The quantification presented in D is misleading as the scale of each graph is different.

      Fig. 1C-D provide an overview of haemogenic markers during the timecourse of haemogenic gastruloid differentiation, and does indeed show a late up-regulation of CD45, as the Reviewer points out would be expected. The %CD45+ cells is indeed low. However, we should point out that the haemogenic gastruloid protocol, although biased towards mesodermal outputs, does not aim to achieve pure haematopoietic specification, but rather place it in its embryo-like context. We refute that the scale is misleading: it is a necessity to represent the data in a way that is interpretable by the reader: and we made sure from the outset that the gates (in C) are truly representative and annotated, as are the plot axes (in D). Consecutive gating at the 216h-timepoint is shown and quantified in Fig. 2S1D-F, or in the alternative consecutive gating suggested by the Reviewer, in Author response iamge 2 below. At the request of Reviewer 1, we also analysed CD31 and CD34 within CD41 and CD45 populations, again as validation of the emergent haematopoietic character of the cells obtained. This new analysis is shown in revised Fig. 2B, quantified in 2C.

      Author response image 2.

      Flow cytometry analysis of VE-cadherin+ cells in haemogenic gastruloids at 216h of the differentiation protocol, probing co-expression of CD45, CD41 and C-Kit.

      (2) The imaging presented in Figure 1E is very unconvincing. C-Kit and CD45 signals appear as speckles and not as membrane/cell surfaces as they should. This experiment should be repeated and nuclear stain (i.e. DAPI) should be included.

      We included the requested immunofluorescence staining in Figure 1E (216h). We also show the earlier timepoint of 192h here as Author response image 3. In text: lines 158-162.

      Author response image 3.

      Confocal images of haematopoietic production in haemogenic gastruloids. Wholemount, cleared haemogenic gastruloids were stained for CD45 (pseudo-coloured red) and C-Kit antigens (pseudo-coloured yellow) with indirect staining, as described in the manuscript. Flk1-GFP signal is shown in green. Nuclei are contrasted with DAPI. (A) 192h. (B) 216h.

      (3) Overall, I am not convinced that hematopoietic cells are consistently generated in these organoids. The authors should sort hematopoietic cells and perform May-Grunwald Giemsa stainings as they did in Figure 6 to confirm the nature of the blood cells generated.

      It is factual that the data are reproducible and complemented by functional assays shown in revised Fig. 2D-E, which clearly demonstrate haematopoietic output. The single-cell RNA-seq data also show expression of a haematopoietic programme, which we have complemented with biologically independent qRT-PCR analysis of the expression of key endothelial and haematopoietic marker and regulatory genes (revised Fig. 2F; in text: 200-209). As requested, we include Giemsa-Wright’s stained cytospins obtained at 216h to illustrate haematopoietic output. These are shown in revised Fig. 2S2A, in text: lines 194-199. Inevitably, the cytospins will be inconclusive as to the presence of endothelial-tohaematopoietic transition or the generation of haematopoietic stem/progenitor cells, as these cells do not have a distinctive morphology.

      (4) The scRNAseq in Figure 2 is very difficult to interpret. Specific points related to this: - Cluster annotation in Figure 2a is missing and should be included. 

      Why do the heatmaps show the expression of genes within sorted cells? Couldn't the authors show expression within clusters of hematopoietic cells as identified transcriptionally (which ones are they? See previous point)? Gene names are illegible.

      I see no expression of Hlf or Myb in CD45+ cells (Figure 2G). Hlf is not expressed by any of the populations examined (panels E, F, G). This suggests no MPP or pre-HSC are generated in the culture, contrary to what is stated in lines 242-245. (PMID 31076455 and 34589491).Later on, it is again stated that "hGx cells... lacked detection of HSC genes like Hlf, Gfi1, or Hoxa9" (lines 281-283). To me, this is proof of the absence of AGM-like hematopoiesis generated in those gastruloids.

      For a combination of logistic and technical reasons, we performed single-cell RNA-seq using the Smart-Seq2 platform, which is inherently low throughput. We overcame the issue of cell coverage by complementing whole-gastruloid transcriptional profiling at successive time-points with sorting of subpopulations of cells based on individual markers documented in Fig. 1. We clearly stated which platform was used as well as the number and type of cells profiled (Fig. 3S1 and lines 226-241 of the revised manuscript), and our approach is standard. Following suggestions of the Reviewers to further focus our analysis on the haemogenic cellular differentiation within the gastruloids, we revised the presentation of the scRNA-seq data to now provide UMAP projections with representation and quantification of individual genes, including the ones queried by the Reviewer in Fig. 3 and respective supplements. Specifically, re-clustering and highlighting of specific markers are shown in Figure 3A-D and presented in lines 267-303 of the revised manuscript. Complementary independent real-time quantitative (q)PCR analysis showing time-dependent expression of endothelial and haematopoietic markers is now in Figure 2F. In text: 200-208.

      (5) Mapping of scRNA-Seq data onto the dataset by Thambyrajah et al. is not proof of the generation of AGM HE. The dataset they are mapping to only contains AGM cells, therefore cells do not have the option to map onto something that is not AGM. The authors should try mapping to other publicly available datasets also including YS cells.

      We have done this and the data are presented in Figure 4A (Figure 4S1A) and Supplementary File. In text: 314-355. As detailed in response to Reviewer 1, we have conducted projections of our single-cell RNA-seq data against two studies which (1) capture arterial and haemogenic specification in the para-splanchnopleura (pSP) and AGM region between E8.0 and E11 (Hou et al, PMID: 32203131) (revised Fig. 4A and 4 S1A), and (2) uniquely capture YS, AGM and FL progenitors and the AGM endothelial-to-haematopoietic transition (EHT) in the same scRNA-seq dataset (Zhu et al, PMID: 32392346) (revised Fig. 4A and 4 S1B). Specifically in answering the Reviewers’ point, we show that different subsets of haemogenic gastruloid cells sorted on haemogenic surface markers C-Kit, CD41 and CD45 cluster onto pre-HE and HE, intra-aortic clusters and FL progenitor compartments, and to YS EMP and erythroid and myeloid progenitors. This lends support to our claim that the haemogenic gastruloid system specifies both YS-like and AGM-like cells. Please note that we now do point out that some CD41+ cells at 144h project onto IAC, as do cells at the later timepoints, suggesting that AGM-like and YS-EMP-like waves may overlap at the 144h timepoint (lines…). In the future, we will address specific location of these cells, but that corresponds to a largescale spatial transcriptomics analysis requiring extensive optimisation for section capture which is beyond the scope of this manuscript and this revision. 

      (6) Conclusions in Figure 3, named "hGx specify cells with preHSC characteristics" are not supported by the data presented here. Again, I am not convinced that hematopoietic cells can be efficiently generated in this system, and certainly not HSCs or pre-HSCs.

      We have provided evidence in the original manuscript, and now through additional experiments, that there is haematopoietic specification, including of progenitor cells, in the haemogenic gastruloid system. Molecular markers are shown in revised Fig. 2F and Fig. 3 and supplements; CFC assays are shown in revised Fig. 2D-E; cytospins are in revised Fig. 2 S2A; further analysis of 4-week implants and new analysis of 8-week implants (discussed below) are in revised Fig. 4 B-D and Fig. 4 S2 and we discussed the new scRNA-seq projections above. Importantly, we have never claimed, and again do not, that haemogenic gastruloids generate HSC. We accept the Reviewer’s comment that we have not provided sufficient evidence for the specification of pre-HSC-like cells and accordingly now refer more generically and conservatively to progenitors.

      FACS analysis in 3A is again very unconvincing. I do not think the population identified as C-Kit+ CD144+ is real. Also, why not try gating the other way around, as commonly done (e.g. VE-Cad+ Kit+ and then CD41/CD45)?

      Our gating strategy is not unconventional, which was done from a more populated gate onto the less abundant one to ensure that the results are numerically more robust. In the case of haemogenic gastruloids, unlike the AGM preparations the Reviewer may be referring to, CD41 and CD45+ cells are more abundant as there is no circulation of more differentiated haematopoietic cells away from the endothelial structures. This said, we did perform the gating as suggested (Rev Fig. 2), indeed confirming that most VE-cad+ Kit+ cells are CD45+. Interestingly VE-cad+Kit- are predominantly CD41+, reinforcing the haematopoietic nature of these cells.

      The authors must have tried really hard, but the lack of short- or long-engraftment in a number of immunodeficient mouse models (lines 305-313) really suggests that no blood progenitors are generated in their system. I am not familiar with the adrenal gland transplant system, but it seems like a very non-physiological system for trying to assess the maturation of putative pre-HSCs. The data supporting the engraftment of these mice, essentially seen only by PCR and in some cases with a very low threshold for detection, are very weak, and again unconvincing. It is stated that "BFP engraftment of the Spl and BM by flow cytometry was very low level albeit consistently above control (Fig. S4E)" (lines 337-338). I do not think that two dots in a dot plot can be presented as evidence of engraftment.

      We have presented the data with full disclosure and do not deny that the engraftment achieved is low-level and short-term, indicating incomplete maturation of definitive haematopoietic progenitors in the current haemogenic gastruloid system. Indeed, by not wanting to overstate the finding, we were deliberately conservative in our representative flow cytometry plots and focused on the PCR for sensitivity. We now present the full flow cytometry analysis for spleen where we preserved more cells after the genomic DNA extraction (revised Fig. 4C) and call the Reviewer’s attention to the fact that detection of BFP+ cells by PCR and flow cytometry in the recipient animals is consistent between the 2 methods (revised Fig. 4C and D; full gels previously presented now in Fig. 4S2C; sensitivity analysis was also previously available and is now in Fig. 4S2B). In addition, we have now also been able to detect low-level myelo-lymphoid engraftment in the bone marrow and spleen 8 weeks after adrenal implantation, again suggesting the presence of a small number of definitive haematopoietic progenitors that potentially mature from the 3 haemogenic gastruloids implanted (Fig. 4E and 4 S2F-G in the revised manuscript. We rephrased Results and Discussion at lines 359-414 and 589-621, respectively, to rectify the nature of the engraftment which we attribute to progenitors.

      (7) Given the above, I find that the foundations needed for extracting meaningful data from the system when perturbed are very shaky at best. Nevertheless, the authors proceed to overexpress MNX1 by LV transduction, a system previously shown to transform fetal liver cells, mimicking the effect of the t(7;12) AML-associated translocation. Comments on this section:

      The increase in the size of the organoid when MNX1 is expressed is a very unspecific finding and not necessarily an indication of any hematopoietic effect of MNX1 OE.

      We agree with the Reviewer on this point; it is nevertheless a reproducible observation which we thought relevant to describe for completeness and data reproducibility.

      The mild increase of cKit+ cells (Figure 4E) at the 144hr timepoint and the lack of any changes in CD41+ or CD45+ cells suggests that the increase in Kit+ cells % is not due to any hematopoietic effect of MNX1 OE. No hematopoietic GO categories are seen in RNA seq analysis, which supports this interpretation. Could it be that just endothelial cells are being generated?

      The Reviewer is correct that the MNX1-overexpressing cells have a strong endothelial signature, which is present in patients (revised Fig. 5A). We investigated a potential link with C-Kit by staining cells from the replating colonies during the process of in vitro transformation with CD31. We observed that 40-50% of C-Kit+ cells (20-30% total colony cells) co-expressed CD31, at least at early plating. These cells co-exist with haematopoietic cells, namely Ter119+ cells, as expected from the YSlike erythroid and EMP-like affiliation of haematopoietic output from 144h-haemogenic gastruloids. These data are included in Fig. 6S1A-B (in text 506-507) of the revised manuscript.

      (8) There seems to be a relatively convincing increase in replating potential upon MNX1-OE, but this experiment has been poorly characterized. What type of colonies are generated? What exactly is the "proportion of colony forming cells" in Figures 5B-D? The colony increase is accompanied by an increase in Kit+ cells; however, the flow cytometry analysis has not been quantified.

      Given the inability to replate control EV cells, there is not a population to compare with in terms of quantification. The level of C-Kit+ represented in Fig. 6E of the revised manuscript is achieved at plate 2 or 3 (depending on the experiment), both of which are significantly enriched for colony-forming cells relative to control (revised Fig. 6B, D).  

      (9) Do hGx cells engraft upon MNX1-OE? This experiment, which appears not to have been performed, is essential to conclude that leukemic transformation has occurred.

      For the purpose of this study, we are satisfied with confirmation of in vitro transformation potential of MNX1 haemogenic gastruloids, which can be used for screening purposes. Although interesting, in vivo leukaemia engraftment from haemogenic gastruloids is beyond the scope of this study.

      Reviewer #2 (Recommendations for the authors):

      (1) Minor comments

      (a) I find the denomination "hGx" very confusing as it would suggest that these gastruloids are human, whereas, in fact, they are murine.

      We agree with the Reviewer on the confusing nomenclature and have edited the manuscript to call “haemGx” instead.

      (b) I find the presence of mast cells in CFC of MNX1-OE cultures very puzzling as this does not bear any resemblance to human leukemia.

      We detect an enrichment of mast cell transcriptional programmes, as defined by the cell type repositories. While it is not mast cells to represent leukaemic cells in patients, this ontology is likely to reflect the developmental stage and origin of progenitors which are affected by MNX1.

      (2) I have a few suggestions to improve figures and tables clarity, to help readers better follow the data presented.

      (a) To enhance readability, it would be beneficial to highlight the genes mentioned in the text within the scRNA-seq figures. Many figures currently display over 30-40 genes in small font sizes, making it difficult to quickly locate specific genes discussed in the text. Additionally, implementing a colorcoding system to categorize these genes according to their proposed lineages would improve clarity and organization.

      We have now performed major re-organisation and re-analyses of the scRNA-seq data, which we believe has improved the readability and clarity of the corresponding sections of the manuscript.

      (b) The data presented in Supplementary Table 1, along with other supplementary tables, are challenging to interpret due to insufficient annotations. Enhancing these tables with clearer and more detailed annotations would significantly improve clarity and aid readers in understanding the supplementary materials.

      Descriptive text has been added to accompany each Supplementary File to aid in understanding the results reported therein.

      Reviewer #3 (Recommendations for the authors):

      In addition to what was written in the public review, I would suggest the authors simplify and shorten the text. Currently, a lot of unnecessary detail is included which makes the story very hard to follow. Moreover, the authors should modify the figures to make them more comprehensible, especially for RNA-seq data.

      We have significantly re-arranged and shortened parts of the manuscript, particularly by focusing the Discussion. Results presentation has also been improved through additional analysis and graphic representation of the scRNA-seq data, which we believe has improved the readability and clarity.s

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Overall, the conclusions of the paper are mostly supported by the data but may be overstated in some cases, and some details are also missing or not easily recognizable within the figures. The provision of additional information and analyses would be valuable to the reader and may even benefit the authors' interpretation of the data. 

      We thank the reviewer for the thoughtful and constructive feedback. We are pleased that the reviewer found the overall conclusions of our paper to be well supported by the data, and we appreciate the suggestions for improving figure clarity and interpretive accuracy. Below, we address each point with corresponding revisions.

      The conclusion that DREADD expression gradually decreases after 1.5-2 years is only based on a select few of the subjects assessed; in Figure 2, it appears that only 3 hM4Di cases and 2 hM3Dq cases are assessed after the 2-year timepoint. The observed decline appears consistent within the hM4Di cases, but not for the hM3Dq cases (see Figure 2C: the AAV2.1-hSyn-hM3Dq-IRES-AcGFP line is increasing after 2 years.) 

      We agree that our interpretation should be stated more cautiously, given the limited number of cases assessed beyond the two-year timepoint. In the revised manuscript, we have clarified in the Results that the observed decline is based on a subset of animals. We have also included a text stating that while a consistent decline was observed in hM4Di-expressing monkeys, the trajectory for hM3Dq expression was more variable with at least one case showing an increased signal beyond two years.

      Revised Results section:

      Lines 140, “hM4Di expression levels remained stable at peak levels for approximately 1.5 years, followed by a gradual decline observed in one case after 2.5 years, and after approximately 3 years in the other two cases (Figure 2B, a and e/d, respectively). Compared with hM4Di expression, hM3Dq expression exhibited greater post-peak fluctuations. Nevertheless, it remained at ~70% of peak levels after about 1 year. This post-peak fluctuation was not significantly associated with the cumulative number of DREADD agonist injections (repeated-measures two-way ANOVA, main effect of activation times, F<sub>(1,6)</sub> = 5.745, P = 0.054). Beyond 2 years post-injection, expression declined to ~50% in one case, whereas another case showed an apparent increase (Figure 2C, c and m, respectively).”

      Given that individual differences may affect expression levels, it would be helpful to see additional labels on the graphs (or in the legends) indicating which subject and which region are being represented for each line and/or data point in Figure 1C, 2B, 2C, 5A, and 5B. Alternatively, for Figures 5A and B, an accompanying table listing this information would be sufficient. 

      We thank the reviewer for these helpful suggestions. In response, we have revised the relevant figures (Fig. 1C, 2B, 2C, and 5) as noted in the “Recommendations for the authors”, including simplifying visual encodings and improving labeling. We have also updated Table 2 to explicitly indicate the animal ID and brain regions associated with each data point shown in the figures.

      While the authors comment on several factors that may influence peak expression levels, including serotype, promoter, titer, tag, and DREADD type, they do not comment on the volume of injection. The range in volume used per region in this study is between 2 and 54 microliters, with larger volumes typically (but not always) being used for cortical regions like the OFC and dlPFC, and smaller volumes for subcortical regions like the amygdala and putamen. This may weaken the claim that there is no significant relationship between peak expression level and brain region, as volume may be considered a confounding variable. Additionally, because of the possibility that larger volumes of viral vectors may be more likely to induce an immune response, which the authors suggest as a potential influence on transgene expression, not including volume as a factor of interest seems to be an oversight. 

      We thank the reviewer for raising this important issue. We agree that injection volume could act as a confounding variable, particularly since larger volumes were used in only handheld cortical injections. This overlap makes it difficult to disentangle the effect of volume from those of brain region or injection method. Moreover, data points associated with these larger volumes also deviated when volume was included in the model.

      To address this, we performed a separate analysis restricted to injections delivered via microinjector, where a comparable volume range was used across cases. In this subset, we included injection volume as additional factor in the model and found that volume did not significantly impact peak expression levels. Instead, the presence of co-expressed protein tags remained a significant predictor, while viral titer no longer showed a significant effect. These updated results have replaced the originals in the revised Results section and in the new Figure 5. We have also revised the Discussion to reflect these updated findings.

      The authors conclude that vectors encoding co-expressed protein tags (such as HA) led to reduced peak expression levels, relative to vectors with an IRES-GFP sequence or with no such element at all. While interesting, this finding does not necessarily seem relevant for the efficacy of long-term expression and function, given that the authors show in Figures 1 and 2 that peak expression (as indicated by a change in binding potential relative to non-displaced radioligand, or ΔBPND) appears to taper off in all or most of the constructs assessed. The authors should take care to point out that the decline in peak expression should not be confused with the decline in longitudinal expression, as this is not clear in the discussion; i.e. the subheading, "Factors influencing DREADD expression," might be better written as, "Factors influencing peak DREADD expression," and subsequent wording in this section should specify that these particular data concern peak expression only. 

      We appreciate this important clarification. In response, we have revised the title to "Protein tags reduce peak DREADD expression levels" in the Results section and “Factors influencing peak DREADD expression levels” in the Discussion section. Additionally, we specified that our analysis focused on peak ΔBP<sub>ND</sub> values around 60 days post-injection. We have also explicitly distinguished these findings from the later-stage changes in expression seen in the longitudinal PET data in both the Results and Discussion sections.

      Reviewer #1 (Recommendations for the authors):

      (1) Will any of these datasets be made available to other researchers upon request?

      All data used to generate the figures have been made publicly available via our GitHub repository (https://github.com/minamimoto-lab/2024-Nagai-LongitudinalPET.git). This has been stated in the "Data availability" section in the revised manuscript.

      (2) Suggested modifications to figures:

      a) In Figures 2B and C, the inclusion of "serotype" as a separate legend with individual shapes seems superfluous, as the serotype is also listed as part of the colour-coded vector

      We agree that the serotype legend was redundant since this information is already included in the color-coded vector labels. In response, we have removed the serotype shape indicators and now represent the data using only vector-construct-based color coding for clarity in Figure 2B and C.

      b) In Figures 3A and B, it would be nice to see tics (representing agonist administration) for all subjects, not just the two that are exemplified in panels C-D and F-H. Perhaps grey tics for the non-exemplified subjects could be used.

      In response, we have included black and white ticks to indicate all agonist administration across all subjects in Figure 3A and B, with the type of agonist clearly specified. 

      c) In Figure 4C, a Nissl- stained section is said to demonstrate the absence of neuronal loss at the vector injection sites. However, if the neuronal loss is subtle or widespread, this might not be easily visualized by Nissl. I would suggest including an additional image from the same section, in a non-injected cortical area, to show there is no significant difference between the injected and non-injected region.

      To better demonstrate the absence of neuronal loss at the injection site, we have included an image from the contralateral, non-injected region of the same section for comparison (Fig. 4C).

      d) In Figure 5A: is it possible that the hM3Dq construct with a titer of 5×10^13 gc/ml is an outlier, relative to the other hM3Dq constructs used?

      We thank the reviewer for raising this important observation. To evaluate whether the high-titer constructs represented a statistical outlier that might artifactually influence the observed trends, we performed a permutation-based outlier analysis. This assessment identified this point in question, as well as one additional case (titer 4.6 x 10e13 gc/ml, #255, L_Put), as significant outlier relative to the distribution of the dataset.

      Accordingly, we excluded these two data points from the analysis. Importantly, this exclusion did not meaningfully alter the overall trend or the statistical conclusions—specifically, the significant effect of co-expressed protein tags on peak expression levels remain robust. We have updated the Methods section to describe this outlier handling and added a corresponding note in the figure legend.

      Reviewer #2 (Public review): 

      Weaknesses 

      This study is a meta-analysis of several experiments performed in one lab. The good side is that it combined a large amount of data that might not have been published individually; the downside is that all things were not planned and equated, creating a lot of unexplained variances in the data. This was yet judiciously used by the authors, but one might think that planned and organized multicentric experiments would provide more information and help test more parameters, including some related to inter-individual variability, and particular genetic constructs. 

      We thank the reviewer for bringing this important point to our attention. We fully acknowledge that the retrospective nature of our dataset—compiled from multiple studies conducted within a single laboratory—introduces variability related to differences in injection parameters and scanning timelines. While this reflects the practical realities and constraints of long-term NHP research, we agree that more standardized and prospectively designed studies would better control such source of variances. To address this, we have added the following statement to the "Technical consideration" section in Discussion:

      Lines 297, "This study included a retrospective analysis of datasets pooled from multiple studies conducted within a single laboratory, which inherently introduced variability across injection parameters and scan intervals. While such an approach reflects real-world practices in long-term NHP research, future studies, including multicenter efforts using harmonized protocols, will be valuable for systematically assessing inter-individual differences and optimizing key experimental parameters."

      Reviewer #2 (Recommendations for the authors):

      I just have a few minor points that might help improve the paper:

      (1) Figure 1C y-axis label: should add deltaBPnd in parentheses for clarity.

      We have added “ΔBP<sub>ND</sub>” to the y-axis label for clarity.

      The choice of a sigmoid curve is the simplest clear fit, but it doesn't really consider the presence of the peak described in the paper. Would there be a way to fit the dynamic including fitting the peak?

      We agree that using a simple sigmoid curve for modeling expression dynamics is a limitation. In response to this and a similar comment from Reviewer #3, we tested a double logistic function (as suggested) to see if it better represented the rise and decline pattern. However, as described below, the original simple sigmoid curve was a better fit for the data. We have included a discussion regarding this limitation of this analysis. See Reviewer #3 recommendations (2) for details.

      The colour scheme in Figure 1C should be changed to make things clearer, and maybe use another dimension (like dotted lines) to separate hM4Di from hM3Dq.

      We have improved the visual clarity of Figure 1C by modifying the color scheme to represent vector construct and using distinct line types (dashed for hM4Di and solid for hM3Dq data) to separate DREADD type.

      (2) Figure 2

      I don't understand how the referencing to 100 was made: was it by selecting the overall peak value or the peak value observed between 40 and 80 days? If the former then I can't see how some values are higher than the peak. If the second then it means some peak values occurred after 80 days and data are not completely re-aligned.

      We thank the reviewer for the opportunity to clarify this point. The normalization was based on the peak value observed between 40–80 days post-injection, as this window typically captured the peak expression phase in our dataset (see Figure 1). However, in some long-term cases where PET scans were limited during this period—e.g., with one scan performing at day 40—it is possible that the actual peak occurred later. Therefore, instances where ΔBP<sub>ND</sub> values slightly exceeded the reference peak at later time points likely reflect this sampling limitation. We have clarified this methodological detail in the revised Results section to improve transparency.

      The methods section mentions the use of CNO but this is not in the main paper which seems to state that only DCZ was used: the authors should clarify this

      Although DCZ was the primary agonist used, CNO and C21 were also used in a few animals (e.g., monkeys #153, #221, and #207) for behavioral assessments. We have clarified this in the Results section and revised Figure 3 to indicate the specific agonist used for each subject. Additionally, we have updated the Methods section to clearly specify the use and dosage of DCZ, CNO, and C21, to avoid any confusion regarding the experimental design.

      Reviewer #3 (Public review): 

      Minor weaknesses are related to a few instances of suboptimal phrasing, and some room for improvement in time course visualization and quantification. These would be easily addressed in a revision. <br /> These findings will undoubtedly have a very significant impact on the rapidly growing but still highly challenging field of primate chemogenetic manipulations. As such, the work represents an invaluable resource for the community.

      We thank the reviewer for the positive assessment of our manuscript and for the constructive suggestions. We address each comment in the following point-by-point responses and have revised the manuscript accordingly.

      Reviewer #3 (Recommendations for the authors):

      (1) Please clarify the reasoning was, behind restricting the analysis in Figure 1 only to 7 monkeys with subcortical AAV injection?

      We focused the analysis shown in Figure 1 on 7 monkeys with subcortical AAV injections who received comparative injection volumes. These data were primary part of vector test studies, allowing for repeated PET scans within 150 days post-injection. In contrast, monkeys with cortical injections—including larger volumes—were allocated to behavioral studies and therefore were not scanned as frequently during the early phase. We will clarify this rationale in the Results section.

      (2) Figure 1: Not sure if a simple sigmoid is the best model for these, mostly peaking and then descending somewhat, curves. I suggest testing a more complex model, for instance, double logistic function of a type f(t) = a + b/(1+exp(-c*(t-d))) - e/(1+exp(-g*(t-h))), with the first logistic term modeling the rise to peak, and the second term for partial decline and stabilization

      We appreciate the reviewer’s thoughtful suggestion to use a double logistic function to better model both the rising and declining phases of the expression curve. In response to this and similar comments from Reviewer #1, we tested the proposed model and found that, while it could capture the peak and subsequent decline, the resulting fit appeared less biologically plausible (See below). Moreover, model comparison using BIC favored the original simple sigmoid model (BIC = 61.1 vs. 62.9 for the simple and double logistic model, respectively). This information has been included in the revised figure legend for clarity.

      Given these results, we retained the original simple sigmoid function in the revised manuscript, as it provides a sufficient and interpretable approximation of the early expression trajectory—particularly the peak expression-time estimation, which was the main purpose of this analysis. We have updated the Methods section to clarify our modeling and rationale as follows:

      Lines 530, "To model the time course of DREADD expression, we used a single sigmoid function, referencing past in vivo fluorescent measurements (Diester et al., 2011). Curve fitting was performed using least squares minimization. For comparison, a double logistic function was also tested and evaluated using the Bayesian Information Criterion (BIC) to assess model fit."

      We also acknowledge that a more detailed understanding of post-peak expression changes will require additional PET measurements, particularly between 60- and 120-days post-injection, across a larger number of animals. We have included this point in the revised Discussion to highlight the need for future work focused on finer-grained modeling of expression decline:

      Lines 317, “Although we modeled the time course of DREADD expression using a single sigmoid function, PET data from several monkeys showed a modest decline following the peak. While the sigmoid model captured the early-phase dynamics and offered a reliable estimate of peak timing, additional PET scans—particularly between 60- and 120-days post-injection—will be essential to fully characterize the biological basis of the post-peak expression trajectories.”

      Author response image 1.<br />

      (3) Figure 2: It seems that the individual curves are for different monkeys, I counted 7 in B and 8 in C, why "across 11 monkeys"? Were there several monkeys both with hM4Diand hM3Dq? Does not look like that from Table 1. Generally, I would suggest associating specific animals from Tables 1 and 2 to the panels in Figures 1 and 2.

      Some animals received multiple vector types, leading to more curves than individual subjects. We have revised the figure legends and updated Table 2 to explicitly relate each curve with the specific animal and brain region.

      (4) I also propose plotting the average of (interpolated) curves across animals, to convey the main message of the figure more effectively.

      We agree that plotting the mean of the interpolated expression curves would help convey the group trend. We added averaged curves to Figure 2BC.

      (5) Similarly, in line 155 "We assessed data from 17 monkeys to evaluate ... Monkeys expressing hM4Di were assessed through behavioral testing (N = 11) and alterations in neuronal activity using electrophysiology (N = 2)..." - please explain how 17 is derived from 11, 2, 5 and 1. It is possible to glean from Table 1 that it is the calculation is 11 (including 2 with ephys) + 5 + 1 = 17, but it might appear as a mistake if one does not go deep into Table 1.

      We have clarified in both the text and Table 1 that some monkeys (e.g., #201 and #207) underwent both behavioral and electrophysiological assessments, resulting in the overlapping counts. Specifically, the dataset includes 11 monkeys for hM4Di-related behavior testing (two of which underwent electrophysiology testing), 5 monkeys assessed for hM3Dq with FDG-PET, and 1 monkey assessed for hM3Dq with electrophysiology, totaling 19 assessments across 17 monkeys. We have revised the Results section to make this distinction more explicit to avoid confusion, as follows:

      Lines 164, "Monkeys expressing hM4Di (N = 11) were assessed through behavioral testing, two of which also underwent electrophysiological assessment. Monkeys expressing hM3Dq (N = 6) were assessed for changes in glucose metabolism via [<sup>18</sup>F]FDG-PET (N = 5) or alterations in neuronal activity using electrophysiology (N = 1).”

      (6) Line 473: "These stock solutions were then diluted in saline to a final volume of 0.1 ml (2.5% DMSO in saline), achieving a dose of 0.1 ml/kg and 3 mg/kg for DCZ and CNO, respectively." Please clarify: the injection volume was always 0.1 ml? then it is not clear how the dose can be 0.1 ml/kg (for a several kg monkey), and why DCZ and CNO doses are described in ml/kg vs mg/kg?

      We thank the reviewer for pointing out this ambiguity. We apologize for the oversight and also acknowledge that we omitted mention of C21, which was used in a small number of cases. To address this, we have revised the “Administration of DREADD agonist” section of the Methods to clearly describe the preparation, the volume, and dosage for each agonist (DCZ, CNO, and C21) as follows:

      Lines 493, “Deschloroclozapine (DCZ; HY-42110, MedChemExpress) was the primary agonist used. DCZ was first dissolved in dimethyl sulfoxide (DMSO; FUJIFILM Wako Pure Chemical Corp.) and then diluted in saline to a final volume of 1 mL, with the final DMSO concentration adjusted to 2.5% or less. DCZ was administered intramuscularly at a dose of 0.1 mg/kg for hM4Di activation, and at 1–3 µg/kg for hM3Dq activation. For behavioral testing, DCZ was injected approximately 15 min before the start of the experiment unless otherwise noted. Fresh DCZ solutions were prepared daily.

      In a limited number of cases, clozapine-N-oxide (CNO; Toronto Research Chemicals) or Compound 21 (C21; Tocris) was used as an alternative DREADD agonist for some hM4Di experiments. Both compounds were dissolved in DMSO and then diluted in saline to a final volume of 2–3 mL, also maintaining DMSO concentrations below 2.5%. CNO and C21 were administered intravenously at doses of 3 mg/kg and 0.3 mg/kg, respectively.”

      (7) Figure 5A: What do regression lines represent? Do they show a simple linear regression (then please report statistics such as R-squared and p-values), or is it related to the linear model described in Table 3 (but then I am not sure how separate DREADDs can be plotted if they are one of the factors)?

      We thank the reviewer for the insightful question. In the original version of Figure 5A, the regression lines represented simple linear fits used to illustrate the relationship between viral titer and peak expression levels, based on our initial analysis in which titer appeared to have a significant effect without any notable interaction with other factors (such as DREADD type).

      However, after conducting a more detailed analysis that incorporated injection volume as an additional factor and excluded cortical injections and statistical outliers (as suggested by Reviewer #1), viral titer was no longer found to significantly predict peak expression levels. Consequently, we revised the figure to focus on the effect of reporter tag, which remained the most consistent and robust predictor in our model.

      In the updated Figure 5, we have removed the relationship between viral titer and expression level with regression lines.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The objective of this study was to infer the population dynamics (rates of differentiation, division, and loss) and lineage relationships of clonally expanding NK cell subsets during an acute immune response. 

      Strengths: 

      A rich dataset and thorough analysis of a particular class of stochastic models. 

      We thank the reviewer for the positive comment.

      Weaknesses: 

      The stochastic models used are quite simple; each population is considered homogeneous with first-order rates of division, death, and differentiation. In Markov process models such as these, there is no dependence of cellular behavior on its history of divisions. In recent years models of clonal expansion and diversification, in the settings of T and B cells, have progressed beyond this picture. So I was a little surprised that there was no mention of the literature exploring the role of replicative history in differentiation (e.g. Bresser Nat Imm 2022), nor of the notion of family 'division destinies' (either in division number or the time spent proliferating, as described by the Cyton and Cyton2 models developed by Hodgkin and collaborators; e.g. Heinzel Nat Imm 2017). The emerging view is that variability in clone (family) size may arise predominantly from the signals delivered at activation, which dictate each precursor's subsequent degree of expansion, rather than from the fluctuations deriving from division and death modeled as Poisson processes. 

      As you pointed out, the Gerlach and Buchholz Science papers showed evidence for highly skewed distributions of family sizes and correlations between family size and phenotypic composition. Is it possible that your observed correlations could arise if the propensity for immature CD27+ cells to differentiate into mature CD27- cells increases with division number? The relative frequency of the two populations would then also be impacted by differences in the division rates of each subset - one would need to explore this. But depending on the dependence of the differentiation rate on division number, there may be parameter regimes (and time points) at which the more differentiated cells can predominate within large clones even if they divide more slowly than their immature precursors. One might not then be able to rule out the two-state model. I would like to see a discussion or rebuttal of these issues. 

      We thank the reviewer for the insightful comment and drawing our attention to the Cyton models. We have discussed the Cyton models in the Introduction (lines 80-95) and the Discussion (lines 538-553) sections of the revised manuscript and carried out simulations for the variant of the Cyton model suggested by the reviewer. The two-state model showed that for certain parameters it can give rise to a negative correlation between the clone size and the percentage of immature (CD27+) NK cells in the absence of any death suggesting the potential importance of division destiny along with stochastic fluctuations in giving rise to the heterogeneity observed in NK cell clone size distributions in the expansion phase. In addition, we also considered a two-state model where the NK cell activation time in individual cells vary following a log-normal distribution; this two-state model also shows the presence of negative correlations between clone sizes and the percentage of immature NK cells within the clones. We have added new results (Figs. S2-3) and discussed the results (lines 223-232) in the Results and the Discussion (lines 538-553) sections. We believe these additional simulations provide new insights into the results we carried out with our two- and three- state models. 

      Reviewer #2 (Public review): 

      Summary: 

      Wethington et al. investigated the mechanistic principles underlying antigen-specific proliferation and memory formation in mouse natural killer (NK) cells following exposure to mouse cytomegalovirus (MCMV), a phenomenon predominantly associated with CD8+ T cells. Using a rigorous stochastic modeling approach, the authors aimed to develop a quantitative model of NK cell clonal dynamics during MCMV infection. 

      Initially, they proposed a two-state linear model to explain the composition of NK cell clones originating from a single immature Ly49+CD27+ NK cell at 8 days post-infection (dpi). Through stochastic simulations and analytical investigations, they demonstrated that a variant of the twostate model incorporating NK cell death could explain the observed negative correlation between NK clone sizes at 8 dpi and the percentage of immature (CD27+) NK cells (Page 8, Figure 1e, Supplementary Text 1). However, this two-state model failed to accurately reproduce the first (mean) and second (variance and covariance) moments of the measured CD27+ and CD27- NK cell populations within clones at 8 dpi (Figure 1g). 

      To address this limitation, the authors increased the model's complexity by introducing an intermediate maturation state, resulting in a three-stage model with the transition scheme: CD27+Ly6C- → CD27-Ly6C- → CD27-Ly6C+. This three-stage model quantitatively fits the first and second moments under two key constraints: (i) immature CD27+ NK cells exhibit faster proliferation than CD27- NK cells, and (ii) there is a negative correlation (upper bound: -0.2) between clone size and the fraction of CD27+ cells. The model predicted a high proliferation rate for the intermediate stage and a high death rate for the mature CD27-Ly6C+ cells. 

      Using NK cell reporter mice data from Adams et al. (2021), which tracked CD27+/- cell population dynamics following tamoxifen treatment, the authors validated the three-stage model. This dataset allowed discrimination between NK cells originating from the bone marrow and those pre-existing in peripheral blood at the onset of infection. To test the prediction that mature CD27- NK cells have a higher death rate, the authors measured Ly49H+ NK cell viability in the mice spleen at different time points post-MCMV infection. Experimental data confirmed that mature (CD27-) NK cells exhibited lower viability compared to immature (CD27+) NK cells during the expansion phase (days 4-8 post-infection). 

      Further mathematical analyses using a variant of the three-stage model supported the hypothesis that the higher death rate of mature CD27- cells contributes to a larger proportion of CD27- cells in the dead cell compartment, as introduced in the new variant model. 

      Altogether, the authors proposed a three-stage quantitative model of antigen-specific expansion and maturation of naïve Ly49H+ NK cells in mice. This model delineates a maturation trajectory: (i) CD27+Ly6C- (immature) → (ii) CD27-Ly6C- (mature I) → (iii) CD27-Ly6C+ (mature II). The findings highlight the highly proliferative nature of the mature I (CD27-Ly6C-) phenotype and the increased cell death rate characteristic of the mature II (CD27-Ly6C+) phenotype. 

      Strengths: 

      By designing models capable of explaining correlations, first and second moments, and employing analytical investigations, stochastic simulations, and model selection, the authors identified the key processes underlying antigen-specific expansion and maturation of NK cells. This model distinguishes the processes of antigen-specific expansion, contraction, and memory formation in NK cells from those observed in CD8+ T cells. Understanding these differences is crucial not only for elucidating the distinct biology of NK cells compared to CD8+ T cells but also for advancing the development of NK cell therapies currently under investigation. 

      We thank the reviewer for the positive comments.

      Weaknesses: 

      The conclusions of this paper are largely supported by the available data. However, a comparative analysis of model predictions with more recent works in the field would be desirable. Moreover, certain aspects of the simulations, parameter inference, and modeling require further clarification and expansion, as outlined below: 

      (1) Initial Conditions and Grassmann Data: The Grassmann data is used solely as a constraint, while the simulated values of CD27+/CD27- cells could have been directly fitted to the Grassmann data, which assumes a 1:1 ratio of CD27+/CD27- at t = 0. This approach would allow for an alternative initial condition rather than starting from a single CD27+ cell, potentially improving model applicability. 

      We fit the moments of the cell populations along with the ratio of resulting cells from an initial condition of 1:1 ratio of CD27+/CD27- cells at t=0 in the model. The initial condition agrees with the experimental data. However, this fit produced parameter values that will lead to greater growth of mature CD27- NK cells compared to that of immature CD27+ NK cells. This could result from the equal weights given to the ratio as well as to the different moments, and a realistic parameter estimate could correspond to an unequal weight between the ratio and the moments. Imposing the constraint Δ<sub>k</sub> >0 in the fitting drives the parameter search in the region, which seems to alleviate this issue that produces estimates of the rates consistent with higher growth of immature NK cells. We included Table S6 and accompanying description to show this, as well as an additional section in the Materials and Methods (lines 669-676). 

      (2) Correlation Coefficients in the Three-State Model: Although the parameter scan of the threestate model (Figure 2) demonstrates the potential for achieving negative correlations between colony size and the fraction of CD27+ cells, the authors did not present the calculated correlation coefficients using the estimated parameter values from fitting the three-state model to the data. Including these simulations would provide additional insight into the parameter space that supports negative correlations and further validate the model.  

      We have included this figure (Figure 2d) in the revised manuscript.

      (3) Viability Dynamics and Adaptive Response: The authors measured the time evolution of CD27+/- dynamics and viability over 30 days post-infection (Figure 4). It would be valuable to test whether the three-state model can reproduce the adaptive response of CD27- cells to MCMV infection, particularly the observed drop in CD27- viability at 5 dpi (prior to the 8 dpi used in the study) and its subsequent rebound at 8 dpi. Reproducing this aspect of the experiment is critical to determine whether the model can simultaneously explain viability dynamics and moment dynamics. Furthermore, this analysis could enable sensitivity analysis of CD27- viability with respect to various model parameters. 

      We have compared the expansion kinetics of the adoptively transferred Ly49H+ NK cells (Figure 2) and endogenous Ly49H+ NK cells, where the endogenous NK cells show slower growth rates than their adoptively transferred counterparts (see lines 422-429). The data shown in Figure 4 refer to the relative percentage of the mature and immature endogenous NK cells, thus cannot be explained by the three-state model calibrated by the expansion of the adoptively transferred NK cells. One of the issues with using the viability data for parameter estimation for endogenous cells is the need to assume a model for dead cell clearance. We assume a model where dead cells are cleared according to a first-order decay reaction and vary the rate of this reaction to show that the qualitative results are in line with our model rates. This model cannot recreate the dip and rebound observed in the data, and instead monotonically and asymptotically approaches a percentage of live cells. We have attached a figure showing this behavior below. Rather, we intend to use this model as qualitative validation that the relative viability of mature NK cells is lower than that of immature NK cells. Models that include time-dependence of clearance of dead cells, or models with a higher-order (i.e. second) reaction for clearance of dead cells in which propensity for clearance is lower at early times and greater at later times may be better suited for this purpose but are beyond the scope of our validation. 

      Author response image 1.

      Reviewer #1 (Recommendations for the authors):  

      I think the manuscript could be improved substantially by exploring alternative models that incorporate replicative history. At the very least it needs a deeper discussion of the literature relating to clonal expansion, putting the existing models in the context of these studies, and arguing convincingly that your conclusions are robust.  

      We have substantially expanded our explorations with alternative models, in particular we considered a variant of the Cyton model suggested by Reviewer#1, a model where NK cells become activated at different times, and a model with asymmetric NK cell division. We have shown the results (Figs. S2-3) in the Supplementary material and discussed the results in the Results and Discussion sections. Please refer to our response #1 to Reviewer #1 for more details. 

      Reviewer #2 (Recommendations for the authors): 

      (1) Possible Typo (Page 12, Line 254): 

      The phrase: "immature NK cells compared to their immature counterparts" appears to contain a typo. Consider rephrasing for clarity. 

      Done. Thanks for finding this. 

      (2) Clarification of Data Source and Computational Procedure: 

      In the statement: "The NK cell clones reported by Flommersfeld et al. contained mixtures of CD27+ and CD27- NK cells. We evaluated the percentage of CD27+ NK cells in each clone and computed the correlation (Csize-CD27+) of the size of the clone with the percentage of CD27+ NK cells in the clones." Please clarify the data source and computational methodology for evaluating the percentage of CD27+ cells within clones. Additionally, consider including the curated data in the supplementary materials. Since the data originates from different immune compartments, explain which compartments were used. If data from all compartments were included, discuss how the calculated correlation changes when stratifying data from different sources (e.g., spleen and lymph nodes).  

      We have clarified the data source (spleen) where appropriate.

      (3) Figure 1b (Correlation Coefficient): 

      While the correlation coefficient with p-value is mentioned, it would be beneficial to also provide the standard deviation of the correlation coefficient and a 95% confidence band for the fitted line. This is particularly relevant as the authors use -0.2 as the upper bound for the correlation coefficient when fitting the three-stage model. 

      We have included the CI and the p-value for the correlation shown in Figure 1b. The figure with the 95% confidence band shown in the figure (appended below) where both axes are in normal scale does not appear visually clear as in Figure 1b where the clone sizes are shown in the logscale. Thus, we did not include the confidence band in Figure 1b but display the CI and p-values on the figure. If the reviewer prefers, we can include the figure with the confidence band in the SI.

      Author response image 2.

      (4) Confidence Intervals in Tables: 

      If confidence intervals in the tables are calculated using bootstrapping, please mention this explicitly in the table headings for clarity. 

      Done.

      (5) Figure 2d-e (Simulation Method): 

      Specify the simulation method used (e.g., stochastic simulation algorithm [SSA], as mentioned in the materials and methods). Panel (e) lacks a caption-please provide one. Additionally, it would be interesting to include the correlation between clone size and the fraction of CD27+ cells in the clones (similar to the experimental data from Flommersfeld et al., 2021). 

      Done.

      (6) Figure 3 (Confidence Band): 

      Include a 95% confidence band for the simulated values to enhance the interpretability of the plots. 

      Done.

      (7) Materials and Methods Section:  Include a mathematical formula defining the metrics described, ensuring clarity and precision. 

      Done. See newly added lines 587-599, as well as existing content in the Supplementary Materials.

      (8) Supplementary Text 1 (Numerical Integration and AICc): 

      The section "Numerical Integration of Master Equation and Calculation of the AICc" is well done. However, given that the master equation involves a system of 106 coupled ODEs, it would be highly appreciated if the authors provided the formulation in matrix representation for better comprehension. 

      We have included a supplementary text (Supplementary Text I) and a schematic figure within the text to provide the details.

      (9) Figure S7b (Three-State Model Validation): 

      Given that the three-state model fits the data, assess whether it can also fit the first and secondmoment data effectively. This validation would strengthen the robustness of the model.

      Although we showed that the best fit of the clonal burst data (moments) vastly overestimates the growth rates of endogenous cells (Figure S9a, previously Figure S7a), we did not fully emphasize the differences in the datasets that make fitting both with the same parameters impossible. We have added additional text in the main text where Figure S9a is located (lines 427-429) to discuss this.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Strengths: 

      Sarpaning et al. provide a thorough characterization of putative Rnt1 cleavage of mRNA in S. cerevisiae. Previous studies have discovered Rnt1 mRNA substrates anecdotally, and this global characterization expands the known collection of putative Rnt1 cleavage sites. The study is comprehensive, with several types of controls to show that Rnt1 is required for several of these cleavages.

      Weaknesses: 

      (1) Formally speaking, the authors do not show a direct role of Rnt1 in mRNA cleavage - no studies were done (e.g., CLIP-seq or similar) to define direct binding sites. Is the mutant Rnt1 expected to trap substrates? Without direct binding studies, the authors rely on genetics and structure predictions for their argument, and it remains possible that a subset of these sites is an indirect consequence of rnt1. This aspect should be addressed in the discussion.

      We have added to this point in the discussion, as requested. We do not, however, agree that CLIP-seq or other methods are needed to address this point, or would even be helpful in the question the reviewer raises. 

      Importantly, we show that recombinant Rnt1 purified from E. coli cleaves the same sites as those mapped in vivo. This does provide direct evidence that Rnt1 directly binds those RNAs. Furthermore, it shows that it can bind these RNAs without the need of other proteins. Our observation that many mRNAs are cleaved at -14 and +16 positions from NGNN stem loops to leave 2-nt 3’ overhangs provides further support that these are the products of an RNase III enzyme, and Rnt1 is the only family member in yeast. Thus, we disagree with the reviewer that our studies do not show direct targeting.

      CLIP-seq experiments would be valuable, but they would address a different point. CLIP-seq measures protein binding to RNA targets, and it is likely that Rnt1 binds some RNAs without cleaving them. In addition, only a transient interaction are needed for cleavage and such transient interactions might not be readily detected by CLIP-seq. Thus, CLIP-seq would reveal the RNAs bound by Rnt1, but would not help identify which ones are cleaved. Catala et al (2004) showed that the catalytically inactive mutant of Rnt1 carries out some functions that are important for the cell cycle. The CLIP-seq studies would be valuable to determine these non-catalytic roles of Rnt1, but we consider those questions beyond the scope of the current study.

      (2) The comprehensive list of putative Rnt1 mRNA cleavage sites is interesting insofar as it expands the repertoire of Rnt1 on mRNAs, but the functional relevance of the majority of these sites remains unknown. Along these lines, the authors should present a more thorough characterization of putative Rnt1 sites recovered from in vitro Rnt1 cleavage.

      We have included new data that confirm that YDR514C cleavage by Rnt1 is relevant to yeast cell physiology. We show that YDR514C overexpression is indeed toxic, as we previously postulated. More importantly, we generated an allele of YDR514C that has synonymous mutations designed to disrupt the stem-loop recognized by Rnt1. We show that at 37 °C, both the wild-type and mutant allele are toxic to rnt1∆ cells, but that in cells that express Rnt1, the wild-type cleavable allele is more toxic than the allele with the mutated stem-loop. This genetic interaction provides strong evidence that cleavage of YDR514C by Rnt1 is relevant to cell physiology. 

      We have also added PARE analysis of poly(A)-enriched and poly(A)-depleted reactions and show that compared to Dcp2, Rnt1 preferentially targets poly(A)+ mRNAs, consistent with it targeting nuclear RNAs. We discuss in more detail that by cleaving nuclear RNA, Rnt1 provides a kinetic proofreading mechanism for mRNA export competence.

      (3) The authors need to corroborate the rRNA 3'-ETS tetraloop mutations with a northern analysis of 3'-ETS processing to confirm an ETS processing defect (which might need to be done in decay mutants to stabilize the liberated ETS fragment). They state that the tetraloop mutation does not yield a growth defect and use this as the basis for concluding that rRNA cleavage is not the major role of Rnt1 in vivo, which is a surprising finding. But it remains possible that tetraloop mutations did not have the expected disruptive effect in vivo; if the ETS is processed normally in the presence of tetraloop mutations, it would undermine this interpretation. This needs to be more carefully examined.

      We have removed the rRNA 3'-ETS tetraloop mutations, because initial northern blot analysis indicated that Rnt1 cleavage is not completely blocked by the mutations we designed. Therefore, the reviewer is correct that tetraloop mutations did not have the expected disruptive effect in vivo. Future investigations will be required to fully understand this. This was a minor point and removing this focuses the paper on its major contributions

      (4) To support the assertion that YDR514C cleavage is required for normal "homeostasis," and more specifically that it is the major contributor to the rnt1∆ growth defect, the authors should express the YDR514C-G220S mutant in the rDNA∆ strains with mutations in the 3'-ETS (assuming they disrupt ETS processing, see above). This simple experiment should provide a relative sense of "importance" for one or the other cleavage being responsible for the rnt1∆ defect. Given the accepted role of Rnt1 cleavage in rRNA processing and a dogmatic view that this is the reason for the rnt1∆ growth defect, such a result would be surprising and elevate the functional relevance and significance of Rnt1 mRNA cleavage.

      We agree that the experiment proposed by the reviewer is very simple, but we are puzzled by the rationale. First, our experiments do not support that there is anything special about the G220S mutation in YDR514C. A complete loss of function (ydr514c∆) also suppresses the growth defect, suggesting that ydr514c-G220S is a simple loss of function allele. We have clarified that the G220S mutation is distant from the stem-loop recognized by Rnt1 and is unlikely to affect cleavage by Rnt1. Instead, Rnt1 cleavage and the G220S mutation are independent alternative ways to reduce Ydr514c function. We have clarified this point in the text. 

      As mentioned in response to point #3, we have included other additional experiments that address the same overall question raised here – the importance of YDR514C mRNA cleavage by Rnt1.    

      (5) Given that some Rnt1 mRNA cleavage is likely nuclear, it is possible that some of these targets are nascent mRNA transcripts, as opposed to mature but unexported mRNA transcripts, as proposed in the manuscript. A role for Rnt1 in co-transcriptional mRNA cleavage would be conceptually similar to Rnt1 cleavage of the rRNA 3'-ETS to enable RNA Pol I "torpedo" termination by Rat1, described by Proudfoot et al (PMID 20972219). To further delineate this point, the authors could e.g., examine the poly-A tails on abundant Rnt1 targets to establish whether they are mature, polyadenylated mRNAs (e.g., northern analysis of oligo-dT purified material). A more direct test would be PARE analysis of oligo-dT enriched or depleted material to determine the poly-A status of the cleavage products. Alternatively, their association with chromatin could be examined. 

      We have added the requested PARE analysis of oligo-dT enriched or depleted material to determine the polyA status of the cleavage products and related discussions. These confirm our proposal that Rnt1 cleaves mature but unexported mRNA transcripts

      We also note that the northern blots shown in figures 2E, 4C, and 5B use oligo dT selected RNA because the signal was undetectable when we used total RNA. This suggests that the cleaved mRNAs are indeed polyadenylated. 

      The term nascent is somewhat ambiguous, but if the reviewer means RNA that is still associated with Pol II and has not yet been cleaved by the cleavage and polyadenylation machinery, we think that is inconsistent with our findings. We have also re-analyzed the NET-seq data from https://pubmed.ncbi.nlm.nih.gov/21248844/ and find no prominent peaks for our Rnt1 sites in Pol II associated RNAs, although for BDF2 NET-seq does suggest that “spliceosome-mediated decay” is co-transcriptional as would be expected. Altogether these data confirm our previous proposal that Rnt1 mainly cleaves mRNAs that have completed polyadenylated but are not yet exported.

      (6) While laboratory strains of budding yeast have a single RNase III ortholog Rnt1, several other budding yeast have a functional RNAi system with Dcr and Ago (PMID 19745116), and laboratory yeast strains are a derived state due to pressure from the killer virus to lose the RNAi system (PMID 21921191). The current study could provide new insight into the relative substrate preferences of Rnt1 and budding yeast Dicer, which could be experimentally confirmed by expressing Dcr in RNT1 and rnt1∆ strains. In lieu of experiments, discussion of the relevance of Rnt1 cleavage compared to yeast RNAi should be included in the discussion before the "human implications" section.

      The reviewer points out that most other eukaryotic species have multiple RNase III family members, which is a general point we discussed and have now expanded on. The reviewer specifically points to papers that study a species that was incorrectly referred to as Saccharomyces castellii in PMID 19745116, but whose current name is Naumovozyma castellii, reflecting that it is not that closely related to S. cerevisiae (diverged about 86 million years ago; for the correct species phylogeny, see http://ygob.ucd.ie/browser/species.html, as both of the published papers the reviewer cites have some errors in the phylogeny). 

      The other species discussed in PMID 19745116 (Vanderwaltozyma polyspora and Candida albicans) are even more distant. There have been several studies on substrate specificity of Dcr1 versus Rnt1 (including PMID 19745116). 

      The reviewer suggests that expressing Dcr1 in S. cerevisiae would be a valuable addition. However, we can’t envision a mechanism by which S. cerevisiae maintained physiologically relevant Dcr1 substrates in the absence of Dcr1. The results from the proposed study would, in our opinion, be limited to identifying RNAs that can be cleaved in this particular artificial system. We think an important implication of our work is that similar studies to ours should be caried out in rnt1∆, dcr1∆, and double mutants in either S. pombe or N. castellii, as well as in drosha knock outs in animals, and we discuss this in more detail in the revised paper. 

      (7) For SNR84 in Figure S3D, it appears that the TSS may be upstream of the annotated gene model. Does RNA-seq coverage (from external datasets) extend upstream to these additional mapped cleavages? The assertion that the mRNA is uncapped is concerning; an alternative explanation is that the nascent mRNA has a cap initially but is subsequently cleaved by Rnt1. This point should be clarified or reworded for accuracy.

      We agree with the reviewer that the most likely explanation is that the primary SNR84 transcript is capped, and 5’ end processed by Rnt1 and Rat1 to make a mature 5’ monophosphorylated SNR84 and have clarified the text accordingly. We suspect our usage of “uncapped” might have been confusing. “uncapped” was not meant to indicate that the primary transcript did not receive a cap, but instead that the mature transcript did not have a cap. We now use “5’ end processed” and “5’ monophosphorylated”. 

      Reviewer #2 (Public review):  

      The yeast double-stranded RNA endonuclease Rnt1, a homolog of bacterial RNase III, mediates the processing of pre-rRNA, pre-snRNA, and pre-snoRNA molecules. Cells lacking Rnt1 exhibit pronounced growth defects, particularly at lower temperatures. In this manuscript, Notice-Sarpaning examines whether these growth defects can be attributed at least in part to a function of Rnt1 in mRNA degradation. To test this, the authors apply parallel analysis of RNA ends (PARE), which they developed in previous work, to identify polyA+ fragments with 5' monophosphates in RNT1 yeast that are absent in rnt1Δ cells. Because such RNAs are substrates for 5' to 3' exonucleolytic decay by Rat1 in the nucleus or Xrn1 in the cytoplasm, these analyses were performed in a rat1-ts xrn1Δ background. The data recapitulate known Rtn1 cleavage sites in rRNA, snRNAs, and snoRNAs, and identify 122 putative novel substrates, approximately half of which are mRNAs. Of these, two-thirds are predicted to contain double-stranded stem loop structures with A/UGNN tetraloops, which serve as a major determinant of Rnt1 substrate recognition. Rtn1 resides in the nucleus, and it likely cleaves mRNAs there, but cleavage products seem to be degraded after export to the cytoplasm, as analysis of published PARE data shows that some of them accumulate in xrn1Δ cells. The authors then leverage the slow growth of rnt1Δ cells for experimental evolution. Sequencing analysis of thirteen faster-growing strains identifies mutations predominantly mapping to genes encoding nuclear exosome co-factors. Some of the strains have mutations in genes encoding a laratdebranching enzyme, a ribosomal protein nuclear import factor, poly(A) polymerase 1, and the RNAbinding protein Puf4. In one of the puf4 mutant strains, a second mutation is also present in YDR514C, which the authors identify as an mRNA substrate cleaved by Rnt1. Deletion of either puf4 or ydr514C marginally improves the growth of rnt1Δ cells, which the authors interpret as evidence that mRNA cleavage by Rnt1 plays a role in maintaining cellular homeostasis by controlling mRNA turnover. 

      While the PARE data and their subsequent in vitro validation convincingly demonstrate Rnt1mediated cleavage of a small subset of yeast mRNAs, the data supporting the biological significance of these cleavage events is substantially less compelling. This makes it difficult to establish whether Rnt1-mediated mRNA cleavage is biologically meaningful or simply "collateral damage" due to a coincidental presence of its target motif in these transcripts.

      We thank the reviewer and have added additional data to support our conclusion that mRNA cleavage, at least for YDR514C, is not simply collateral damage, but a physiologically relevant function of Rnt1. From an evolutionary perspective, cleavage of mRNAs by Rnt1 might have initially been collateral damage, but if there is a way to use this mechanism, evolution is probably going to use it.

      (1) A major argument in support of the claim that "several mRNAs rely heavily on Rnt1 for turnover" comes from comparing number of PARE reads at the transcript start site (as a proxy for fraction of decapped transcripts) and at the Rnt1 cleavage site (as a proxy for fraction of Rnt1-cleaved transcripts). The argument for this is that "the major mRNA degradation pathway is through decapping". However, polyA tail shortening usually precedes decapping, and transcripts with short polyA tails would be strongly underrepresented in PARE sequencing libraries, which were constructed after two rounds of polyA+ RNA selection. This will likely underestimate the fraction of decapped transcripts for each mRNA. There is a wide range of well-established methods that can be used to directly measure differences in the half-life of Rnt1 mRNA targets in RNT1 vs rnt1Δ cells. Because the PARE data rely on the presence of a 5' phosphate to generate sequencing reads, they also cannot be used to estimate what fraction of a given mRNA transcript is actually cleaved by Rnt1. 

      The reviewer is correct that decapping preferentially affects mRNAs with shortened poly(A) tails, that Rnt1 cleavage likely affects mostly newly made mRNAs with long poly(A) tails, and that PARE may underestimate the decay of mRNAs with shortened poly(A) tails. We have reanalyzed our previously published data where we performed PARE on both the poly(A)-enriched fraction and the poly(A)-depleted fraction (that remains after two rounds of oligo dT selection). Rnt1 products are over-represented in the poly(A)-enriched fraction, while decapping products are enriched in the poly(A)-depleted fraction, providing further support to our conclusion that Rnt1 cleaves nuclear RNA. We have re-written key sections of the paper accordingly.

      The reviewer also points out that “There is a wide range of well-established methods that can be used to directly measure differences in the half-life of Rnt1 mRNA targets in RNT1 vs rnt1Δ cells.” However, all of those methods measure mRNA degradation rates from the steady state pool, which is mostly cytoplasmic. We have, in different contexts, used these methods, but as we pointed out they are inappropriate to measure degradation of nuclear RNA. There are some studies that measure nuclear degradation rates, but this requires purifying nuclei. There are two major drawbacks to this. First, it cannot distinguish between degradation in the nucleus and export from the nucleus because both processes cause disappearance from the nucleus. Second, the purification of yeast nuclei requires “spheroplasting” or enzymatically removing the rigid cell wall. This spheroplasting is likely to severely alter the physiological state of the yeast cell. Given these significant drawbacks and the substantial time and money required, we chose not to perform this experiment.  

      (2) Rnt1 is almost exclusively nuclear, and the authors make a compelling case that its concentration in the cytoplasm would likely be too low to result in mRNA cleavage. The model for Rnt1-mediated mRNA turnover would therefore require mRNAs to be cleaved prior to their nuclear export in a manner that would be difficult to control. Alternatively, the Rnt1 targets would need to re-enter prior to cleavage, followed by export of the cleaved fragments for cytoplasmic decay. These processes would need to be able to compete with canonical 5' to 3' and 3' to 5' exonucleolytic decay to influence mRNA fate in a biologically meaningful way.

      We disagree that mRNA export would be difficult to control, as is elegantly demonstrated by the 13 KDa HIV Rev protein. The export of many other RNAs is tightly controlled such that many RNAs are rapidly degraded in the nucleus by, for example, Rat1 and the RNA exosome, while other RNAs are rapidly exported. Indeed, the competition between RNA export and nuclear degradation is generally thought to be an important quality control for a variety of mRNAs and ncRNAs. We do agree with the reviewer that re-import of mRNAs appears unlikely (which is why we do not discuss it), although it occurs efficiently for other Rnt1-cleaved RNAs such as snRNAs. We have clarified the text accordingly, including in the introduction, results, and discussion. 

      (3) The experimental evolution clearly demonstrates that mutations in nuclear exosome factors are the most frequent suppressors of the growth defects caused by Rnt1 loss. This can be rationalized by stabilization of nuclear exosome substrates such as misprocessed snRNAs or snoRNAs, which are the major targets of Rnt1. The rescue mutations in other pathways linked to ribosomal proteins (splicing, ribosomal protein import, ribosomal mRNA binding) support this interpretation. By contrast, the potential suppressor mutation in YDR514C does not occur on its own but only in combination with a puf4 mutation; it is also unclear whether it is located within the Rnt1 cleavage motif or if it impacts Rnt1 cleavage at all. This can easily be tested by engineering the mutation into the endogenous YDR514C locus with CRISPR/Cas9 or expressing wild-type and mutant YDR514C from a plasmid, along with assaying for Rnt1 cleavage by northern blot. Notably, the growth defect complementation of YDR514C deletion in rnt1Δ cells is substantially less pronounced than the growth advantage afforded by nuclear exosome mutations (Figure S9, evolved strains 1 to 5). These data rather argue for a primary role of Rnt1 in promoting cell growth by ensuring efficient ribosome biogenesis through pre-snRNA/pre-snoRNA processing. 

      The reviewer makes several points. 

      First, we have clarified that the ydr514c-G220S mutation is not near the Rnt1 cleavage motif and is unlikely to affect cleavage by Rnt1. This is exactly what would be expected for a mutation that was selected for in an rnt1∆ strain. Although the reviewer appears to expect it, a mutation that affects Rnt1 cleavage could not be selected for in a strain that lacks Rnt1.

      Second, the reviewer points out that the original ydr514c mutations arose in a strain that also had a puf4 deletion. However, we show that ydr514c∆ also suppresses rnt1∆. Furthermore, we have added additional data that overexpressing an uncleavable YDR514C mRNA affects yeast growth at 37 °C more than the wild-type cleavable form further supporting that the cleavage of YDR154C by Rnt1 is physiologically relevant. 

      Reviewer #2 (Recommendations for the authors): 

      (1) The description of the PARE library construction protocol and data analysis workflow is insufficient to ensure their robustness and reproducibility. The library construction protocol should include details of the individual steps, and the data analysis workflow description should include package versions and exact commands used for each analysis step.

      We have clarified that the experiments were performed exactly as previously described and have included very detailed methods. The Galaxy server does not require commands and instead we have indicated the parameters chosen in the various steps. We have also added that the PARE libraries for poly(A)+ and poly(A)- fractions were generated in the lab of Pam Green according to their protocol, which is not exactly the same as ours. Nevertheless, the Rnt1 sites are also evident from those libraries, further demonstrating the robustness of our data. 

      (2) PARE signal is expressed as a ratio of sequencing coverage at a given nucleotide in RNT1 vs rnt1Δ cells. This poses challenges to estimating fold changes: by definition, there should be no coverage at Rnt1 cleavage sites in rnt1Δ cells, as there will not be any 5' monophosphate-containing mRNA fragments to be ligated to the library construction linker. This should be accounted for in the data analysis pipeline - the DESeq2 package, for example, handles this very well (https://support.bioconductor.org/p/64014/).

      The reviewer is correct and we have clarified how we do account for the possibility of having 0 reads by adding an arbitrary 0.01 cpm to all PARE scores for wild type and mutant. In the original manuscript this was not explicitly mentioned and the reader would have to go to our previous paper to learn about this detail. Adding this 0.01 cpm pseudocount avoids dividing by 0 when we calculate a comPARE score. This means we actually underestimate the fold change. As can be seen in the red line in the image below, the y-axis modified log2FC score maxes out along a diagonal line at log2([average RNT1 reads]/0.01) instead of at infinity. That is, at a wild type peak height of 1 cpm, the maximum possible score is log2(1.01/.01), which equals 6.66, and at 10 cpm, the maximum score is ~10, etc.). As can be seen, many of the scores fall along this diagonal, reflecting that indeed, there are 0 reads in the rnt1∆ samples.

      Author response image 1.

      There are multiple ways to deal with this issue, and ours is not uncommon. DESeq2, suggested by the reviewer, uses a different method, which relies on the assumption that the dispersion of read counts for genes of any given expression strength is constant, and then uses that dispersion to “correct” the 0 read counts. While this is a valid way for differential gene expression when comparing similar RNAs, the underlying assumption that the dispersion of expression of all genes is similar for similar expression level is questionable for comparing, for example, mRNAs, snoRNAs, and snRNAs. Thus, we are not convinced that this is a better way to deal with 0 counts. Our analysis accepts that 0 might be the best estimate for the number of counts that are expected from rnt1∆ samples. 

      (3) The analysis in Figure S8 is insufficient to demonstrate that the four mRNAs depicted are significantly more abundant in rnt1Δ vs RNT1 cells - differences in coverage could simply be a result of different sequencing depth. Please use an appropriate method for estimating differential expression from RNA-Seq data (e.g., DESeq2). 

      Unfortunately, the previously published data we included as figure S8 (now figure S9) did not include replicates, and we agree that it does not rigorously show an effect. The reviewer suggests that we analyze the data by DESeq2, which requires replicates, and thus, cannot be done. Instead we have clarified this. If the reviewer is not satisfied with this, we are prepared to delete it.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      Overall, the manuscript reveals the role of actin polymerization to drive the fusion of myoblasts during adult muscle regeneration. This pathway regulates fusion in many contexts, but whether it was conserved in adult muscle regeneration remained unknown. Robust genetic tools and histological analyses were used to support the claims convincingly. 

      We very much appreciate the positive comments from this Reviewer.

      There are a few interpretations that could be adjusted. 

      The beginning of the results about macrophages traversing ghost fibers after regeneration was a surprise given the context in the abstract and introduction. These results also lead to new questions about this biology that would need to be answered to substantiate the claims in this section. Also, it is unclear the precise new information learned here because it seems obvious that macrophages would need to extravasate the basement membrane to enter ghost fibers and macrophages are known to have this ability. Moreover, the model in Figure 4D has macrophages and BM but there is not even mention of this in the legend. The authors may wish to consider removing this topic from the manuscript. 

      We appreciate this comment and acknowledge that the precise behavior of macrophages when they infiltrate and/or exit the ghost fibers during muscle regeneration is not the major focus of this study. However, we think that visualizing macrophages squeezing through tiny openings on the basement membrane to infiltrate and/or exit from the ghost fibers is valuable. Thus, we have moved the data from the original main Figure 2 to the new Figure S1. 

      Regarding the model in Figure 4D, we have removed the macrophages because the depicted model represents a stage after the macrophages’ exit from the ghost fiber. 

      Which Pax7CreER line was used? In the methods, the Jax number provided is the Gaka line but in the results, Lepper et al 2009 are cited, which is not the citation for the Gaka line. 

      The Pax7<sup>CreER</sup> line used in this study is the one generated in Lepper et al. 2009. We corrected this information in “Material and Methods” of the revised manuscript. 

      Did the authors assess regeneration in the floxed mice that do not contain Cre as a control? Or is it known these alleles do not perturb the function of the targeted gene? 

      We examined muscle regeneration in the floxed mice without Cre. As shown in Figure 1 below, none of the homozygous ArpC2<sup>fl/fl</sup>, N-WASP<sup>fl/fl</sup>, CYFIP1<sup>fl/fl</sup> or N-WASP<sup>fl/fl</sup>;CYFIP1<sup>fl/fl</sup> alleles affected  muscle regeneration, indicating that these alleles do not perturb the function of the targeted gene.  

      Author response image 1.

      The muscle regeneration was normal in mice with only floxed target gene(s). Cross sections of TA muscles were stained with anti-Dystrophin and DAPI at dpi 14. n = 3 mice of each genotype, and > 80 ghost fibers in each mouse were examined. Mean ± s.d. values are shown in the dot-bar plot, and significance was determined by two-tailed student’s t-test. ns: not significant. Scale bar: 100 μm.

      The authors comment: 'Interestingly, expression of the fusogenic proteins, MymK and MymX, was up-regulated in the TA muscle of these mice (Figure S4F), suggesting that fusogen overexpression is not able to rescue the SCM fusion defect resulted from defective branched actin polymerization.' It is unclear if fusogens are truly overexpressed because the analysis is performed at dpi 4 when the expression of fusogens may be decreased in control mice because they have already fused. Also, only two animals were analyzed and it is unclear if MymX is definitively increased. The authors should consider adjusting the interpretation to SCM fusion defect resulting from defective branched actin polymerization is unlikely to be caused by a lack of fusogen expression. 

      We agree with the Reviewer that fusogen expression may simply persist till later time points in fusion mutants without being up-regulated. We have modified our interpretation according to the Reviewer’s suggestion. 

      Regarding the western blots in the original Figure S4F, we now show one experiment from each genotype, and include the quantification of MymK and MymX protein levels from 3 animals in the revised manuscript (new Figure S5F-S5H). 

      Reviewer #1 (Recommendations for the authors): 

      (1) The ArpC2 cKO data could be presented in a clearer fashion. In the text, ArpC2 is discussed but in the figure, there are many other KOs presented and ArpC2 is the fourth one shown in the figure. The other KOs are discussed later. It may be worthwhile for the authors to rearrange the figures to make it easier for readers. 

      Thank you for this suggestion. We have rearranged the genotypes in the figures accordingly and placed ArpC2 cKO first. 

      The authors comment: 'Since SCM fusion is mostly completed at dpi 4.5 (Figure 1B) (Collins et al. 2024)'. This is not an accurate statement of the cited paper. While myofibers are formed by dpi 4.5 with centralized nuclei, there are additional fusion events through at least 21dpi. The authors should adjust their statement to better reflect the data in Collins et al 2024, which could include mentioning that primary fusions could be completed at dpi 4.5 and this is the process they are studying. 

      We have adjusted our statement accordingly in the revised manuscript.

      The authors comment: 'Consistent with this, the frequency distribution of SCM number per ghost fiber displayed a dramatic shift toward higher numbers in the ArpC2<sup>cKO</sup> mice (Figure S5C). These results indicate that the actin cytoskeleton plays an essential role in SCM fusion as the fusogenic proteins. Should it read 'These results indicate that the actin cytoskeleton plays AS an essential role in SCM fusion as the fusogenic proteins'? 

      Yes, and we adjusted this statement accordingly in the revised manuscript. 

      Minor comments 

      (1) In the results the authors state 'To induce genetic deletion of ArpC2 in satellites....'; 'satellites' is a term not typically used for satellite cells. 

      Thanks for catching this. We changed “satellites” to satellite cells.

      (2) In the next sentence, the satellite should be capitalized. 

      Done.

      (3) The cross-section area should be a 'cross-sectional area'. 

      Changed.

      Reviewer #2 (Public review):

      To fuse, differentiated muscle cells must rearrange their cytoskeleton and assemble actinenriched cytoskeletal structures. These actin foci are proposed to generate mechanical forces necessary to drive close membrane apposition and fusion pore formation. 

      While the study of these actin-rich structures has been conducted mainly in drosophila, the present manuscript presents clear evidence this mechanism is necessary for the fusion of adult muscle stem cells in vivo, in mice. 

      We thank this Reviewer for the positive comment.

      However, the authors need to tone down their interpretation of their findings and remember that genetic proof for cytoskeletal actin remodeling to allow muscle fusion in mice has already been provided by different labs (Vasyutina E, et al. 2009 PMID: 19443691; Gruenbaum-Cohen Y, et al., 2012 PMID: 22736793; Hamoud et al., 2014 PMID: 24567399). In the same line of thought, the authors write they "demonstrated a critical function of branched actin-propelled invasive protrusions in skeletal muscle regeneration". I believe this is not a premiere, since Randrianarison-Huetz V, et al., previously reported the existence of finger-like actin-based protrusions at fusion sites in mice myoblasts (PMID: 2926942) and Eigler T, et al., live-recorded said "fusogenic synapse" in mice myoblasts (PMID: 34932950). Hence, while the data presented here clearly demonstrate that ARP2/3 and SCAR/WAVE complexes are required for differentiating satellite cell fusion into multinucleated myotubes, this is an incremental story, and the authors should put their results in the context of previous literature. 

      In this study, we focused on elucidating the mechanisms of myoblast fusion during skeletal muscle regeneration, which remained largely unknown. Thus, we respectfully disagree with this Reviewer that “this is an incremental story” for the following reasons – 

      First, while we agree with this Reviewer that “genetic proof for cytoskeletal actin remodeling to allow muscle fusion in mice has already been provided by different labs”, most of the previous genetic studies, including ours (Lu et al. 2024), characterizing the roles of actin regulators (Elmo, Dock180, Rac, Cdc42, WASP, WIP, WAVE, Arp2/3) in mouse myoblast fusion were conducted during embryogenesis (Laurin et al. 2008; Vasyutina et al. 2009; Gruenbaum-Cohen et al. 2012; Tran et al. 2022; Lu et al. 2024), instead of during adult muscle regeneration, the latter of which is the focus of this study. 

      Second, prior to this study, several groups tested the roles of SRF, CaMKII theta and gemma, Myo10, and Elmo, which affect actin cytoskeletal dynamics, in muscle regeneration. These studies have shown that knocking out SRF, CaMKII, Myo10, or Elmo caused defects in mouse muscle regeneration, based on measuring the cross-sectional diameters of regenerated myofibers only (Randrianarison-Huetz et al. 2018; Eigler et al. 2021; Hammers et al. 2021; Tran et al. 2022). However, none of these studies visualized myoblast fusion at the cellular and subcellular levels during muscle regeneration in vivo. For this reason, it remained unclear whether the muscle regeneration defects in these mutants were indeed due to defects in myoblast fusion, in particular, defects in the formation of invasive protrusions at the fusogenic synapse. Thus, the previous studies did not demonstrate a direct role for the actin cytoskeleton, as well as the underlying mechanisms, in myoblast fusion during muscle regeneration in vivo.

      Third, regarding actin-propelled invasive protrusions at the fusogenic synapse, our previous study (Lu et al. 2024) revealed these structures by fluorescent live cell imaging and electron microscopy (EM) in cultured muscle cells, as well as EM studies in mouse embryonic limb muscle, firmly establishing a direct role for invasive protrusions in mouse myoblast fusion in cultured muscle cells and during embryonic development. Randrianarison-Huetz et al. (2018) reported the existence of finger-like actin-based protrusions at cell contact sites of cultured mouse myoblasts. It was unclear from their study, however, if these protrusions were at the actual fusion sites and if they were invasive (Randrianarison-Huetz et al. 2018). Eigler et al. (2021) reported protrusions at fusogenic synapse in cultured mouse myoblasts. It was unclear from their study, however, if the protrusions were actin-based and if they were invasive (Eigler et al. 2021). Neither Randrianarison-Huetz et al. (2018) nor Eigler et al. (2021) characterized protrusions in developing mouse embryos or regenerating adult muscle. 

      Taken together, to our knowledge, this is the first study to characterize myoblast fusion at the cellular and subcellular level during mouse muscle regeneration. We demonstrate that branched actin polymerization promotes invasive protrusion formation and myoblast fusion during the regeneration process. We believe that this work has laid the foundation for additional mechanistic studies of myoblast fusion during skeletal muscle regeneration.

      The citations in the original manuscript were primarily focused on previous in vivo studies of Arp2/3 and the actin nucleation-promoting factors (NPFs), N-WASP and WAVE (Richardson et al. 2007; Gruenbaum-Cohen et al. 2012), and of invasive protrusions mediating myoblast fusion in intact animals (Drosophila, zebrafish and mice) (Sens et al. 2010; Luo et al. 2022; Lu et al. 2024). We agree with this reviewer, however, that it would be beneficial to the readers if we provide a more comprehensive summary of previous literature, including studies of both intact animals and cultured cells, as well as studies of additional actin regulators upstream of the NPFs, such as small GTPases and their GEFs. Thus, we have significantly expanded our Introduction to include these studies and cited the corresponding literature in the revised manuscript.

      Reviewer #2 (Recommendations for the authors): 

      (1) I am concerned that the authors did not evaluate the efficiency of the target allele deletion efficiency following Pax7-CreER activation. The majority, if not all, of the published work focusing on this genetic strategy presents the knock-down efficiency using either genotyping PCR, immunolocalization, western-blot; etc... 

      (2) Can the authors provide evidence that the N-WASP, CYFIP1, and ARPC2 proteins are depleted in TAM-treated tissue? Alternatively, can the author perform RT-qPCR on freshly isolated MuSCs to validate the absence of N-WASP, CYFIP1, and ARPC2 mRNA expression?

      Thank you for these comments. We have assessed the target allele deletion efficiency with isolated satellite cells from TAM-injected mice in which Pax7-CreER is activated. Western blot analyses showed that the protein levels of N-WASP, CYFIP1, and ArpC2 significantly decreased in the satellite cells of knockout mice. Please see the new Figure S2.

      Reviewer #3 (Public review): 

      The manuscript by Lu et al. explores the role of the Arp2/3 complex and the actin nucleators NWASP and WAVE in myoblast fusion during muscle regeneration. The results are clear and compelling, effectively supporting the main claims of the study. However, the manuscript could benefit from a more detailed molecular and cellular analysis of the fusion synapse. Additionally, while the description of macrophage extravasation from ghost fibers is intriguing, it seems somewhat disconnected from the primary focus of the work. 

      Despite this, the data are robust, and the major conclusions are well supported. Understanding muscle fusion mechanism is still a widely unexplored topic in the field and the authors make important progress in this domain. 

      We appreciate the positive comments from this Reviewer.

      We agree with this Reviewer and Reviewer #1 that the macrophage study is not the primary focus of the work. However, we think that visualizing macrophages squeezing through tiny openings on the basement membrane to infiltrate and/or exit from the ghost fibers is valuable. Thus, we have moved the data from the original main Figure 2 to the new Figure S1. 

      I have a few suggestions that might strengthen the manuscript as outlined below.  

      (1) Could the authors provide more detail on how they defined cells with "invasive protrusions" in Figure 4C? Membrane blebs are commonly observed in contacting cells, so it would be important to clarify the criteria used for counting this specific event. 

      Thanks for this suggestion. We define invasive protrusions as finger-like protrusions projected by a cell into its fusion partner. Based on our previous studies (Sens et al. 2010; Luo et al. 2022; Lu et al. 2024), these invasive protrusions are narrow (with 100-250 nm diameters) and propelled by mechanically stiff actin bundles. In contrast, membrane blebs are spherical protrusions formed by the detachment of the plasma membrane from the underlying actin cytoskeleton. In general, the blebs are not as mechanically stiff as invasive protrusions and would not be able to project into neighboring cells. Thus, we do not think that the protrusions in Figure 4B are membrane blebs. We clarified the criteria in the text and figure legends of the revised manuscript.

      (2) Along the same line, please clarify what each individual dot represents in Figure 4C. The authors mention quantifying approximately 83 SCMs from 20 fibers. I assume each dot corresponds to data from individual fibers, but if that's the case, does this imply that only around four SCMs were quantified per fiber? A more detailed explanation would be helpful. 

      To quantitatively assess invasive protrusions in Ctrl and mutant mice, we analyzed 20 randomly selected ghost fibers per genotype. Within each ghost fiber, we examined randomly selected SCMs in a single cross section (a total of 83, 147 and 93 SCMs in Ctrl, ArpC2<sup>cKO</sup> and MymX<sup>cKO</sup> mice were examined, respectively). 

      In Figure 4C, each dot was intended to represent the percentage of SCMs with invasive protrusions in a single cross section of a ghost fiber. However, we mistakenly inserted a wrong graph in the original Figure 4C. We sincerely apologize for this error and have replaced it with the correct graph in the new Figure 4C.

      (3) Localizing ArpC2 at the invasive protrusions would be a strong addition to this study. Furthermore, have the authors examined the localization of Myomaker and Myomixer in ArpC2 mutant cells? This could provide insights into potential disruptions in the fusion machinery.

      We have examined the localization of the Arp2/3 complex on the invasive protrusions in cultured SCMs and included the data in Figure 4A of the original manuscript. Specifically, we showed enrichment of mNeongreen-tagged Arp2, a subunit of the Arp2/3 complex, on the invasive protrusions at the fusogenic synapse of cultured SCMs (see the enlarged panels on the right; also see supplemental video 4). The small size of the invasive protrusions on SCMs prevented a detailed analysis of the precise Arp2 localization along the protrusions.  Please see our recently published paper (Lu et al. 2024) for the detailed localization and function of the Arp2/3 complex during invasive protrusion formation in cultured C2C12 cells. 

      We have also attempted to localize the Arp2/3 complex in the regenerating muscle in vivo using an anti-ArpC2 antibody (Millipore, 07-227-I), which was used in many studies to visualize the Arp2/3 complex in cultured cells. Unfortunately, the antibody detected non-specific signals in the regenerating TA muscle of the ArpC2<sup>cKO</sup> animals. Thus, it cannot be used to detect specific ArpC2 signals in muscle tissues. Besides the specificity issue of the antibody, it is technically challenging to visualize invasive protrusions with an F-actin probe at the fusogenic synapses of regenerating muscle by light microscopy, due to the high background of F-actin signaling within the muscle cells. 

      Regarding the fusogens, we show that both are present in the TA muscle of the ArpC2<sup>cKO</sup> animals by western blot (Figure S5F-S5H). Thus, the fusion defect in these animals is not due to the lack of fusogen expression. Since the focus of this study is on the role of the actin cytoskeleton in muscle regeneration, the subcellular localization of the fusogens was not investigated in the current study. 

      (4) As a minor curiosity, can ArpC2 WT and mutant cells fuse with each other?

      Our previous work in Drosophila embryos showed that Arp2/3-mediated branched actin polymerization is required in both the invading and receiving fusion partners (Sens et al. 2010).  To address this question in mouse muscle cells, we co-cultured GFP<sup>+</sup> WT cells with mScarleti<sup>+</sup> WT (or mScarleti<sup>+</sup> ArpC2<sup>cKO</sup> cells) in vitro and assessed their ability to fuse with one another. We found that ArpC2<sup>cKO</sup> cells could barely fuse with WT cells (new Figure 3F and 3G), indicating that the Arp2/3-mediated branched actin polymerization is required in both fusion partners. This result is consistent with our findings in Drosophila embryos. 

      (5) The authors report a strong reduction in CSA at 14 dpi and 28 dpi, attributing this defect primarily to failed myoblast fusion. Although this claim is supported by observations at early time points, I wonder whether the Arp2/3 complex might also play roles in myofibers after fusion. For instance, Arp2/3 could be required for the growth or maintenance of healthy myofibers, which could also contribute to the reduced CSA observed, since regenerated myofibers inherit the ArpC2 knockout from the stem cells. Could the authors address or exclude this possibility? This is rather a broader criticism of how things are being interpreted in general beyond this paper. 

      This is an interesting question. It is possible that Arp2/3 may play a role in the growth or maintenance of healthy myofibers. However, the muscle injury and regeneration process may not be the best system to address this question because of the indispensable early step of myoblast fusion. Ideally, one may want to knockout Arp2/3 in myofibers of young healthy mice and observe fiber growth in the absence of muscle injury and compare that to the wild-type littermates. Since these experiments are out of the scope of this study, we revised our conclusion that the fusion defect in ArpC2<sup>cKO</sup> mice should account, at least in part, for the strong reduction in CSA at 14 dpi and 28 dpi, without excluding additional possibilities such as Arp2/3’s potential role in the growth or maintenance of healthy myofibers.  

      References:

      Eigler T, Zarfati G, Amzallag E, Sinha S, Segev N, Zabary Y, Zaritsky A, Shakked A, Umansky KB, Schejter ED et al. 2021. ERK1/2 inhibition promotes robust myotube growth via CaMKII activation resulting in myoblast-to-myotube fusion. Dev Cell 56: 3349-3363 e3346.

      Gruenbaum-Cohen Y, Harel I, Umansky KB, Tzahor E, Snapper SB, Shilo BZ, Schejter ED. 2012. The actin regulator N-WASp is required for muscle-cell fusion in mice. Proc Natl Acad Sci U S A 109: 11211-11216.

      Hammers DW, Hart CC, Matheny MK, Heimsath EG, Lee YI, Hammer JA, 3rd, Cheney RE, Sweeney HL. 2021. Filopodia powered by class x myosin promote fusion of mammalian myoblasts. Elife 10.

      Laurin M, Fradet N, Blangy A, Hall A, Vuori K, Cote JF. 2008. The atypical Rac activator Dock180 (Dock1) regulates myoblast fusion in vivo. Proc Natl Acad Sci U S A 105: 15446-15451.

      Lu Y, Walji T, Ravaux B, Pandey P, Yang C, Li B, Luvsanjav D, Lam KH, Zhang R, Luo Z et al. 2024. Spatiotemporal coordination of actin regulators generates invasive protrusions in cell-cell fusion. Nat Cell Biol 26: 1860-1877.

      Luo Z, Shi J, Pandey P, Ruan ZR, Sevdali M, Bu Y, Lu Y, Du S, Chen EH. 2022. The cellular architecture and molecular determinants of the zebrafish fusogenic synapse. Dev Cell 57: 1582-1597 e1586.

      Randrianarison-Huetz V, Papaefthymiou A, Herledan G, Noviello C, Faradova U, Collard L, Pincini A, Schol E, Decaux JF, Maire P et al. 2018. Srf controls satellite cell fusion through the maintenance of actin architecture. J Cell Biol 217: 685-700.

      Richardson BE, Beckett K, Nowak SJ, Baylies MK. 2007. SCAR/WAVE and Arp2/3 are crucial for cytoskeletal remodeling at the site of myoblast fusion. Development 134: 4357-4367.

      Sens KL, Zhang S, Jin P, Duan R, Zhang G, Luo F, Parachini L, Chen EH. 2010. An invasive podosome-like structure promotes fusion pore formation during myoblast fusion. J Cell Biol 191: 1013-1027.

      Tran V, Nahle S, Robert A, Desanlis I, Killoran R, Ehresmann S, Thibault MP, Barford D, Ravichandran KS, Sauvageau M et al. 2022. Biasing the conformation of ELMO2 reveals that myoblast fusion can be exploited to improve muscle regeneration. Nat Commun 13: 7077.

      Vasyutina E, Martarelli B, Brakebusch C, Wende H, Birchmeier C. 2009. The small G-proteins Rac1 and Cdc42 are essential for myoblast fusion in the mouse. Proc Natl Acad Sci U S A 106: 8935-8940.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      EnvA-pseudotyped glycoprotein-deleted rabies virus has emerged as an essential tool for tracing monosynaptic inputs to genetically defined neuron populations in the mammalian brain. Recently, in addition to the SAD B19 rabies virus strain first described by Callaway and colleagues in 2007, the CVS N2c rabies virus strain has become popular due to its low toxicity and high trans-synaptic transfer efficiency. However, despite its widespread use in the mammalian brain, particularly in mice, the application of this cell-type-specific monosynaptic rabies tracing system in zebrafish has been limited by low labeling efficiency and high toxicity. In this manuscript, the authors aimed to develop an efficient retrograde monosynaptic rabies-mediated circuit mapping tool for larval zebrafish. Given the translucent nature of larval zebrafish, whole-brain neuronal activities can be monitored, perturbed, and recorded over time. Introducing a robust circuit mapping tool for larval zebrafish would enable researchers to simultaneously investigate the structure and function of neural circuits, which would be of significant interest to the neural circuit research community. Furthermore, the ability to track rabies-labeled cells over time in the transparent brain could enhance our understanding of the trans-synaptic retrograde tracing mechanism of the rabies virus. 

      To establish an efficient rabies virus tracing system in the larval zebrafish brain, the authors conducted meticulous side-by-side experiments to determine the optimal combination of trans-expressed rabies G proteins, TVA receptors, and recombinant rabies virus strains. Consistent with observations in the mouse brain, the CVS N2c strain trans-complemented with N2cG was found to be superior to the SAD B19 combination, offering lower toxicity and higher efficiency in labeling presynaptic neurons. Additionally, the authors tested various temperatures for the larvae post-virus injection and identified 36℃ as the optimal temperature for improved virus labeling. They then validated the system in the cerebellar circuits, noting evolutionary conservation in the cerebellar structure between zebrafish and mammals. The monosynaptic inputs to Purkinje cells from granule cells were neatly confirmed through ablation experiments.

      However, there are a couple of issues that this study should address. Additionally, conducting some extra experiments could provide valuable information to the broader research field utilizing recombinant rabies viruses as retrograde tracers.

      (1) It was observed that many radial glia were labeled, which casts doubt on the specificity of trans-synaptic spread between neurons. The issues of transneuronal labeling of glial cells should be addressed and discussed in more detail. In this manuscript, the authors used a transgenic zebrafish line carrying a neuron-specific Cre-dependent reporter and EnvA-CVS N2c(dG)-Cre virus to avoid the visualization of virally infected glial cells. However, this does not solve the real issue of glial cell labeling and the possibility of a nonsynaptic spread mechanism.

      In agreement with the reviewer’s suggestion, we have incorporated a standalone section in the revised Discussion (page 9) to address the issue of transneuronal glial labeling, including its spatial distribution, temporal dynamics, potential mechanisms, and possible strategies for real resolution.

      Regarding the specificity of trans-synaptic spread between neurons, we have demonstrated that our transsynaptic tracing system reliably and specifically labels input neurons. Structurally, we only observed labeling of inferior olivary cells (IOCs) outside the cerebellum, which are the only known extracerebellar inputs to Purkinje cells (PCs), while all other traced neurons remained confined within the cerebellum throughout the observation period (see Figure 2G–I). Functionally, we verified that the traced neurons formed synaptic connections with the starter PCs (see Figure 2J–M). Together, these findings support the conclusion that our system enables robust and specific retrograde monosynaptic tracing of neurons in larval zebrafish.

      Regarding the transneuronal labeling of radial glia cells, we observed that their distribution closely correlates with the location of neuronal somata and dendrites (see Author response image 2). In zebrafish, radial glial cells are considered functional analogs of astrocytes and are often referred to as radial astroglia. The adjacent labeled astroglia may participate in tripartite synapses with the starter neurons and express viral receptors that enable RV particle entry at postsynaptic sites. This suggests that rabies-based tracing in zebrafish may serve as a valuable tool for identifying synaptically associated and functionally connected glia. Leveraging this approach to investigate glia–neuron interactions represents a promising direction for future research.

      In our system, the glial labeling diminishes at later larval stages, likely due to abortive infection (see Author response image 3 and relevant response). However, the eventual clearance of infection does not preclude the initial infection of glial cells, which may compete with neuronal labeling and reduce overall tracing efficiency. Notably, transneuronal infection of glial cells by RV has also been observed in mammals (Marshel et al., 2010). To minimize such off-target labeling, future work should focus on elucidating the mechanisms underlying glial susceptibility—such as receptor-mediated viral entry— and developing strategies to suppress receptor expression specifically in glia, thereby improving the specificity and efficiency of neuronal circuit tracing.

      In addition, wrong citations in Line 307 were made when referring to previous studies discovering the same issue of RVdG-based transneuronal labeling radial glial cells. "The RVdG-based transneuronal labeling of radial glial cells was commonly observed in larval zebrafish29,30".

      The cited work was conducted using vesicular stomatitis virus (VSV). A more thorough analysis and/or discussion on this topic should be included.

      We thank the reviewer for pointing out the citation inaccuracy. The referenced study employed vesicular stomatitis virus (VSV), which, like RV, is a member of the Rhabdoviridae family. We have revised the text accordingly—from "RVdG-based transneuronal labeling of radial glial cells…" to " Transneuronal labeling of radial glial cells mediated by VSV, a member of the Rhabdoviridae family like RV, has been commonly observed in larval zebrafish" (page 9, line 347).

      Several key questions should be addressed:

      Does the number of labeled glial cells increase over time? 

      Yes, as shown in Figure 2—figure supplement 1C and G, the number of labeled radial glial cells significantly increased from 2 to 6 days post-injection (dpi). This phenomenon has been addressed in the revised Discussion section (page 9, line 357).

      Do they increase at the same rate over time as labeled neurons?

      Although glial cell labeling continued to increase over time, we observed a slowdown in labeling rate between 6 and 10 dpi, as shown in Figure 2—figure supplement 1C and G. Therefore, we divided the timeline into two intervals (2–6 and 6–10 dpi) to compare the rate of increase in labeling between neurons and glia. The rate (R) was defined as the daily change in convergence index. To quantify the difference between neuronal and glial labeling rates, we calculated a labeling rate index: R<sub>g</sub>−R<sub>n</sub>, where R<sub>g</sub> and R<sub>n</sub> denote the rates for glia and neurons, respectively) (Author response image1). Our analysis revealed that, between 2 and 6 dpi, glial cells exhibited a higher labeling rate than neurons. However, this trend reversed between 6 and 10 dpi, with neurons surpassing glial cells in labeling rate. These findings have been included in the revised Discussion section (page 9).

      Author response image 1.

      Labeling rate index of glia and neurons across two time intervals. Data points represent the mean labeling rate index for each tracing strategy within each time interval. *P < 0.05 (nonparametric two-tailed Mann-Whitney test).  

      Are the labeled glial cells only present around the injection site?

      We believe the reviewer is inquiring whether labeled glial cells are spatially restricted to the vicinity of starter neurons. The initial infection is determined by the expression of TVA rather than the injection site. For example, injecting a high volume of virus into the anterior hindbrain resulted in the infection of TVA-expressing cells in distant regions, including the 109 tectum and posterior hindbrain (Author response image 2). 

      Regarding glial labeling, PC starter experiments showed that labeled glial cells (i.e. Bergmann glia) were predominantly localized within the cerebellum, likely due to the confinement of PC dendrites to this region. When using vglut2a to define starter neurons, glial labeling was frequently observed near the soma and dendrites of starter cells (14 out 114 of 17 cases; Author response image 2). These observations suggest that transneuronal labeled glial cells may be synaptically associated with the starter neurons. We have included this point in the revised Discussion section (page 9).

      Author response image 2.

      Location of transneuronal labeled glial cells. (a and b) Confocal images showing the right tectum (a) and posterior hindbrain (b) of different WT larvae expressing EGFP and TVA using UGNT in randomly sparse neurons (vglut2a<sup>+</sup>) and infected with CVSdGtdTomato[EnvA] (magenta) injected into the anterior hindbrain. Dashed yellow circles, starter neurons (EGFP<sup>+</sup>/tdTomato<sup>+</sup>); gray arrows, transneuronally labeled radial glia (tdTomato<sup>+</sup>/EGFP<sup>−</sup>); dashed white lines, tectum or hindbrain boundaries. C, caudal; R, rostral. Scale bars, 20 μm.

      Can the phenomenon of transneuronal labeling of radial glial cells be mitigated if the tracing is done in slightly older larvae?

      Yes, we agree. As elaborated in the following response, we hypothesize that the loss of fluorescence in radial glial cells at later developmental stages is due to abortive infection (see Author response image 3 and associated response). This supports the notion that abortive infection becomes increasingly pronounced as larvae mature, potentially explaining the negligible glial labeling observed in adult zebrafish (Dohaku et al., 2019; Satou et al., 2022). However, as noted in our response to the first comment, the disappearance of fluorescence does not indicate the absence of viral entry. Viral receptors may express on glial cells, allowing initial infection despite a failure in subsequent replication. Consequently, glial infection—though abortive—may still compete with neuronal infection and reduce tracing efficiency.

      What is the survival rate of the infected glial cells over time?

      We observed the disappearance of glial fluorescence after transneuronal labeling, while we did not observe punctate fluorescent debris typically indicative of apoptotic cell death. Therefore, we favor the hypothesis that the loss of glial fluorescence results from abortive infection rather than cell death. Abortive infection refers to a scenario in which viral replication is actively suppressed by host antiviral responses, preventing the production of infectious viral particles. For example, recent studies have shown that lab-attenuated rabies virus (RV) induces the accumulation of aberrant double-stranded DNA in astrocytes, which activates mitochondrial antiviral-signaling protein (MAVS) and subsequent interferon expression (Tian et al., 2018). This antiviral response inhibits RV replication, ultimately resulting in abortive infection. 

      In addition, we quantified the proportion of glial cells labeled at 2 dpi and 4dpi that retained fluorescence over time. By 6 dpi (approximately 11 dpf), glial labeling had largely diminished in both groups (Author response image 3). These results suggest that the decline in glial fluorescence is more closely linked to larval age than to the duration of glial infection, supporting the notion of abortive infection. This also addresses the reviewer’s earlier concern and indicates that glial labeling is mitigated in older larvae.

      Author response image 3.

      Fraction of glial cells with fluorescence retention. (a and b) Proportion of glial cells labeled at 2 dpi (a) and 4 dpi (b) that retained fluorescence over time. Data are from the CVS|N2cG|36°C group. In boxplots: center, median; bounds of box, first and third quartiles; whiskers, minimum and maximum values. n.s., not-significant; *P < 0.05, **P < 0.01 (nonparametric two-tailed Mann-Whitney test).

      If an infected glial cell dies due to infection or gets ablated, does the rabies virus spread from the dead glial cells?

      In our system, glial cells do not express the rabies glycoprotein (G). Therefore, even if glial cells are transneuronally infected, they cannot support viral budding or assembly of infectious particles due to the absence of G (Mebatsion et al., 1996), preventing further viral propagation to neighboring cells.

      If TVA and rabies G are delivered to glial cells, followed by rabies virus injection, will it lead to the infection of other glial cells or neurons?

      We have conducted experiments in which TVA and rabies G were specifically expressed in astroglia using the gfap promoter, followed by RVdG-mCherry[EnvA] injection. This resulted in initial infection of TVA-positive astroglia and occasional subsequent labeling of nearby TVA-negative astroglia (Author response image 4), suggesting astroglia-toastroglia transmission. Notably, no neuronal labeling was observed. This glial-to-glial spread is consistent with previous rabies tracing studies reporting similar phenomena involving the interaction of astrocytes with astrocytes and microglia (Clark et al., 2021). However, the underlying mechanism remains unclear, and we have discussed this in response to the first comment.

      Author response image 4.

      Viral tracing initiated from astroglia. (a) Confocal images of the tectum of a larva expressing EGFP and TVA using UGBT in randomly sparse astroglia (gfap<sup>+</sup>) and infected by SADdG-mCherry[EnvA] (magenta) injected into the anterior hindbrain.  (b) Confocal images of the posterior hindbrain of a larva expressing EGFP and TVA using UGNT in randomly sparse astroglia (gfap<sup>+</sup>) and infected by CVSdG-tdTomato[EnvA] (magenta) injected into the anterior hindbrain. Dashed yellow circles, starter astroglia (EGFP+/mCherry<su>+</sup> or EGFP<sup>+</sup>/tdTomato<sup>+</sup>); gray arrows, transneuronally labeled astroglia (tdTomato<sup>+</sup>/EGFP<sup>−</sup>); dashed white lines, tectum or hindbrain boundaries. C, caudal; R, rostral. Scale bars, 20 μm.<br />

      Answers to any of these questions could greatly benefit the broader research community.

      (2) The optimal virus tracing effect has to be achieved by raising the injected larvae at 36C. Since the routine temperature of zebrafish culture is around 28C, a more thorough characterization of the effect on the health of zebrafish should be conducted.

      Yes, 36°C is required to achieve optimal labeling efficiency. Although this is above the standard zebrafish culture temperature (28°C), previous work (Satou et al., 2022) and our observations indicate that this transient elevation does not adversely affect larval health within the experimental time window. 

      In the previous study, Satou et al. reported no temperature-dependent effects on swimming behavior, social interaction, or odor discrimination in adult fish maintained at 28°C and 36°C. In larvae, both non-injected and virus-injected fish showed a decrease in survival at later time points (7 dpi), with slightly increased mortality observed at elevated temperatures.

      In our study, we raised the same batch of non-virus-injected larvae at 28°C and 36°C, and found no mortality over a 10-day period. For CVS-N2c-injected larvae, electrode insertion caused injury, but survival rates remained around 80% at both temperatures (see Figure 3A). Moreover, we successfully maintained CVS-N2c-injected larvae at 36°C for over a month, indicating that elevated temperature does not adversely affect fish health. Notably, higher temperatures were associated with an accelerated developmental rate. 

      This point was briefly addressed in the previous version and has now been further elaborated in the revised Discussion section (page 8).

      (3) Given the ability of time-lapse imaging of the infected larval zebrafish brain, the system can be taken advantage of to tackle important issues of rabies virus tracing tools.

      a) Toxicity. 

      The toxicity of rabies viruses is an important issue that limits their application and affects the interpretation of traced circuits. For example, if a significant proportion of starter cells die before analysis, the traced presynaptic networks cannot be reliably assigned to a "defined" population of starter cells. In this manuscript, the authors did an excellent job of characterizing the effects of different rabies strains, G proteins derived from various strains, and levels of G protein expression on starter cell survival. However, an additional parameter that should be tested is the dose of rabies virus injection. The current method section states that all rabies virus preparations were diluted to 2x10^8 infection units per ml, and 2-5 nl of virus suspension was injected near the target cells. It would be interesting to know the impact of the dose/volume of virus injection on retrograde tracing efficiency and toxicity. Would higher titers of the virus lead to more efficient labeling but stronger toxicities? What would be the optimal dose/volume to balance efficiency and toxicity? Addressing these questions would provide valuable insights and help optimize the use of rabies viruses for circuit tracing.

      This is an important concern. Viral cytotoxicity is primarily driven by the level of viral transcription and replication, which inhibits host protein synthesis (Komarova et al., 2007). The RVdG-EnvA typically infects cells at a rate of one viral particle per cell (Zhang et al., 2024), suggesting that increasing viral concentration does not proportionally increase percell infection. Accordingly, viral titer and injection volume are unlikely to influence cytotoxicity at the single-cell level. In our experiments, injection volumes up to 20 nl (i.e., 4 to 10 times the standard injection volume) did not affect starter cell survival. However, higher titers or volumes may increase the number of initially infected starter cells, potentially leading to greater overall mortality in larval zebrafish.

      Similarly, given that rabies virus typically infects cells at one particle per cell, increasing viral titer alone is unlikely to enhance tracing efficiency once the virus type is fixed. In contrast, the level of G protein expression significantly influences tracing efficiency (see Figure 2D). However, excessive G protein expression reduces the survival of starter cells (see Figure 3D). Therefore, careful control of G protein levels is essential to balance tracing efficiency and cytotoxicity.

      Notably, regardless of whether infected cells undergo apoptosis or necrosis due to cytotoxicity, the resulting disruption of the plasma membrane severely impairs viral budding. As a result, the formation of intact, G protein-enveloped viral particles is prevented, limiting further infection of neighboring neurons.

      The latest second-generation ΔGL RV vectors (Jin et al., 2024), which lack both the G and L (viral polymerase) genes, have been shown to markedly reduce cytotoxicity. These improved tracing strategies may be explored in future zebrafish studies to further optimize labeling efficiency and cell viability.

      The issue of viral titer and volume has been addressed in the revised Discussion section (page 10).

      b) Primary starters and secondary starters: 

      Given that the trans-expression of TVA and G is widespread, there is the possibility of coexistence of starter cells from the initial infection (primary starters) and starter cells generated by rabies virus spreading from the primary starters to presynaptic neurons expressing G. This means that the labeled input cells could be a mixed population connected with either the primary or secondary starter cells.

      It would be immensely interesting if time-lapse imaging could be utilized to observe the appearance of such primary and secondary starter cells. Assuming there is a time difference between the initial appearance of these two populations, it may be possible to differentiate the input cells wired to these populations based on a similar temporal difference in their initial appearance. This approach could provide valuable insights into the dynamics of rabies virus spread and the connectivity of neural circuits.

      The reviewers suggestion is valuable. Regarding the use of Purkinje cells (PCs) as starter cells, we consider the occurrence of secondary PCs to be extremely rare. Although previous evidence suggests that PCs can form synaptic connections with one another (Chang et al., 2020), our sparse labeling strategy—typically involving fewer than 10 labeled cells— significantly reduces the likelihood of viral transmission between PC starter cells. In addition, if secondary starter PCs were frequently generated, we would expect increased tracing efficiency at 10 dpi compared to 6 dpi. However, our results show no significant difference (see Figure 2—figure supplement 1C and G). 

      Given the restricted expression of TVA and G in PCs, even if a limited number of secondary starters were generated, the labeled inputs would predominantly be granule cells (GCs), thereby preserving the cell-type identity of upstream inputs. While this raises a potential concern regarding an overestimation of the convergence index (CI). Notably, within the GC-PC circuit, individual GCs often project to multiple PCs. Consequently, a GC labeled via a secondary PC may also a bona fide presynaptic partner of the primary starter population. This overlap could mitigate the overestimation of CI. Taken together, we believe that the CI values reported in this study provide a reasonable approximation of monosynaptic connectivity.

      In scenarios where TVA and G are broadly expressed—for example, under the control of vglut2a promoter—secondary starter cells may arise frequently. In such cases, long-term time-lapse imaging in the zebrafish whole brain presents a promising strategy to distinguish primary and secondary starter cells, along with their respective input populations, based on the timing of their appearance. This approach potentially enables multi-step circuit tracing within individual animals. An alternative strategy is to use an EnvA-pseudotyped, G-competent rabies virus, which allows targeted initial infection while supporting multisynaptic propagation. When combined with temporally resolved imaging, this strategy could facilitate direct labeling of higher-order circuits and allow clear differentiation between multi-order inputs and the original starter population over time.

      In conclusion, we find this suggestion compelling and will explore these strategies in future studies to optimize and broaden the application of rabies virus-based circuit tracing.

      Reviewer #2 (Public Review):

      The study by Chen, Deng et al. aims to develop an efficient viral transneuronal tracing method that allows efficient retrograde tracing in the larval zebrafish. The authors utilize pseudotyped-rabies virus that can be targeted to specific cell types using the EnvA-TvA systems. Pseudotyped rabies virus has been used extensively in rodent models and, in recent years, has begun to be developed for use in adult zebrafish. However, compared to rodents, the efficiency of the spread in adult zebrafish is very low (~one upstream neuron labeled per starter cell). Additionally, there is limited evidence of retrograde tracing with pseudotyped rabies in the larval stage, which is the stage when most functional neural imaging studies are done in the field. In this study, the authors systematically optimized several parameters of rabies tracing, including different rabies virus strains, glycoprotein types, temperatures, expression construct designs, and elimination of glial labeling. The optimal configurations developed by the authors are up to 5-10 fold higher than more typically used configurations.

      The results are solid and support the conclusions. However, the methods should be described in more detail to allow other zebrafish researchers to apply this method in their own work.

      Additionally, some findings are presented anecdotally, i.e., without quantification or sufficient detail to allow close examinations. Lastly, there is concern that the reagents created by the authors will not be easily accessible to the zebrafish community.

      (1) The titer used in each experiment was not stated. In the methods section, it is stated that aliquots are stored at 2x10e8. Is it diluted for injection? Are all of the experiments in the manuscripts with the same titer?

      We injected all three viral vectors as undiluted stock aliquots. The titer for SADdGmCherry[EnvA], CVSdG-tdTomato[EnvA], and CVSdG-mCherry-2A-Cre[EnvA]) was 2 × 10<sup>8</sup>, 2 × 10<sup>8</sup>, and 3 × 10<sup>8</sup> infectious units/mL, respectively. This has been clarified in the updated Methods section (page 12).

      (2) The age for injection is quite broad (3-5 dpf in Fig 1 and 4-6 dpf in Fig 2). Given that viral spread efficiency is usually more robust in younger animals, describing the exact injection age for each experiment is critical.

      We appreciate the reviewer’s suggestions. For the initial experiments tracing randomly from neurons in Figure 1, the injection age was primarily 3–4 dpf, with a one-day difference. Due to the slower development of PCs, the injection age for experiments related to Figure 2,3, and 4, is mainly 5 dpf. To clarify the developmental stages at the time of injection for each experiment, we have  newly added tables (see Figure 1,2—table supplement 2) listing the number of fish used at each injection age for all experimental groups shown in Figure 1 and 2.

      (3) More details should be provided for the paired electrical stimulation-calcium imaging study. How many GC cells were tested? How many had corresponding PC cell responses? What is the response latency? For example, images of stimulated and recorded GCs and PCs should be shown.

      Yes, these are important details for the paired electrical stimulation-calcium imaging study. We stimulated 33 GCs from 32 animals and detected calcium responses in putative postsynaptic PCs in 15 cases. Among these, we successfully ablated the single GC in 11 pairs and observed a weakened calcium response in PCs following ablation (see Figure 2M). The response latency was determined as the first calcium imaging frame where ΔF/F exceeded the baseline (pre-stimulus average) by 3 times the standard deviation. Imaging was performed at 5 Hz, and as shown in Figure 2L, the calculated average response latency was 152 ± 35 ms (mean ± SEM), indicating an immediate response with calcium intensity from the first post-stimulus imaging frame consistently exceeding the threshold.

      We have added additional details to the Results (page 5), Discussion (page 9), and Methods (page 15) sections. A representative image showing both the stimulated GC and the recorded PC has been added to Figure 2 in the revised manuscript (see Figure 2K).

      (4) It is unclear how connectivity between specific PC and GC is determined for single neuron connectivity. In other images (Figure 4C), there are usually multiple starter cells and many GCs. It was not shown that the image resolution can establish clear axon dendritic contacts between cell pairs.

      In our experiments, sparse labeling typically results in 1–10 starter cells per fish. Regarding the case shown in Figure 4C (right column), only two PC starters were labeled, which simplifies the assignment of presynaptic inputs to individual PCs. Connectivity is determined based on clear axon-dendritic or axon-cell body apposition between GCs and PCs. We have accordingly added more details to the Methods (page 16) section regarding how we determined connectivity between specific PCs and GCs.

      Reviewer #2 (Recommendations For The Authors):

      To enable broader use of this technique, I would encourage the authors to submit their zebrafish lines, plasmids, and plasmid sequences to public repositories such as ZIRC and  Addgene. Additionally, there is no mention of how viral vectors will be shared.

      We have deposited the related zebrafish lines at CZRC (China Zebrafish Resource Center) and uploaded plasmid maps and sequences to Addgene. The viral vectors are available through BrainCase (Shenzhen, China). We have included the information in the revised manuscript.

      Reviewer #3 (Public Review):

      Summary:

      The authors establish reagents and define experimental parameters useful for defining neurons retrograde to a neuron of interest.

      Strengths:

      A clever approach, careful optimization, novel reagents, and convincing data together lead to convincing conclusions.

      Weaknesses: 

      In the current version of the manuscript, the tracing results could be better centered with  respect to past work, certain methods could be presented more clearly, and other approaches worth considering.

      Appraisal/Discussion:

      Trans-neuronal tracing in the larval zebrafish preparation has lagged behind rodent models,limiting "circuit-cracking" experiments. Previous work has demonstrated that pseudotyped rabies virus-mediated tracing could work, but published data suggested that there was considerable room for optimization. The authors take a major step forward here, identifying a number of key parameters to achieve success and establishing new transgenic reagents that incorporate modern intersectional approaches. As a proof of concept, the manuscript concludes with a rough characterization of inputs to cerebellar Purkinje cells. The work will be of considerable interest to neuroscientists who use the zebrafish model.

      Reviewer #3 (Recommendations For The Authors):

      The main limitations of the work are as follows:

      (1) The optimizations might differ for different neurons. Purkinje cells are noteworthy because they develop considerably during the time window detailed here, almost doubling in number between 7-14dpf. Presumably, connectivity follows. This sort of neurogenesis is much less common elsewhere. It would be useful to show similar results in, say, tectal neurons, which would have spatially-restricted retinal ganglion cells labelled.

      We acknowledge that Purkinje cells (PCs) undergo significant development between 7–14 dpf, which may influence synaptic connectivity and result in differences in tracing efficiency. However, all experimental conditions were standardized across groups, and the selection of starter PCs was unbiased, typically focusing on PCs in the lateral region of the CCe (corpus cerebelli) subregion, ensuring that the relative comparisons remain valid. 

      We agree that testing other neuronal populations would be valuable, as tracing efficiency is influenced by multiple factors, such as the number of endogenous inputs, synaptic maturation, and developmentally regulated synaptic strength. Tectal neurons, which receive spatially restricted retinal ganglion cell inputs, would be a suitable choice for further investigation. However, due to the various tectal cell types and the opacity of the eyeball, such studies present additional technical challenges and are beyond the scope of this paper.

      (2) The virus is delivered by means of microinjection near the cell. This is invasive and challenging for labs that dont routinely perform electrophysiology. It would be useful to know if coarser methods of viral delivery (e.g. intraventricular injection) would be successful. 

      Our protocol does not require the level of precision needed for electrophysiology. The procedure can be performed using a standard high-magnification upright (135× magnification, Nikon SMZ18) or inverted fluorescence microscope (200× magnification, Olympus IX51). The virus suspension was loaded into a glass micropipette with a ~10 µm tip diameter and directly microinjected into the target region using a micromanipulator. The procedure was comparable to embryonic microinjection in terms of precision and operational control. Notably, direct contact with the target cells is not necessary, as the injected virus solution can diffuse and effectively infect nearby cells.  

      We had attempted intraventricular injection as an alternative, but it failed to produce robust labeling, reinforcing the necessity for direct tissue injection. 

      We have now included additional methodological details in the Methods section (page 13). 

      (3) Because of the combination of transgenic lines, plasmid injection, and viral type, it is often confusing to follow exactly what is being done for a particular experiment. It would be useful to specify the transgenic background used for each experiment using standard nomenclature e.g. "Plasmids were injected into Tg(elavl3:GAL4) fish." This is particularly important for the experiments in Figure 4: it isnt clear what the background used for the sparse labels was. 

      Thank the reviewer for bringing this issue to our attention. In order to improve clarity, we have revised the figure legends to explicitly state the transgenic background, injected plasmids, and viral type used in each experiment, particularly for Figure 4. 

      (4) Plasmids should be deposited with Addgene along with maps specifying the particular "codon-optimized Tetoff" per 388. 

      We confirm that all plasmids, including those containing codon-optimized Tetoff constructs, have been uploaded to Addgene along with detailed maps.

      (5) It would be useful to know if there were more apoptotic cells after transfection -- an acridine orange or comparable assay is recommended, rather than loss of fluorescence. 

      We appreciate the reviewer’s suggestion to assess apoptosis using acridine orange staining or comparable assays. We agree that such methods can provide more direct detection of apoptotic events. However, we believe that the difference in cytotoxicity is already evident in our current data: SAD-infected cells exhibit greater loss than CVSinfected cells (see Figure 3D). This is consistent with previous observations in mice, where greater toxicity of SAD compared to CVS was demonstrated using propidium iodide (PI) staining in cultured cells (Reardon et al., 2016).

      (6) Line 219-228 Hibis lab has described the subtypes of granule cells in detail already; the work should discuss the tracings with respect to previous characterizations instead of limiting that work to a citation. 

      Thanks for the reminding of this point. We have expanded the Results section (page 6) to discuss the subtypes of GCs and PCs in relation to previously reported characterizations.

      (7) "Activities" is often used when "activity" is correct. The use of English in the manuscript is, by and large, excellent, but its worth running the text through software like Grammarly to catch the occasional error. 

      We have carefully edited the manuscript using professional language editing tools to correct any grammatical issues.

      (8) The experiments in 2J-2L would be more convincing if they were performed on inferior olive inputs as well -- especially given the small size of the granule cells. 

      We acknowledge the reviewers observation that granule cells (GCs) are relatively small, which may underline the finding that, out of 33 stimulated GCs, only 15 were capable of eliciting calcium responses in putative postsynaptic PCs. However, in all 11 pairs where a single GC was successfully ablated, we observed a weakened calcium response in PCs after the ablation (see Figure 2M), suggesting our tracing approach specifically identifies synaptically coupled neurons. We have clarified this point in the revised manuscript (page 5).

      We agree that verifying the IO inputs to PCs would strengthen the validity of our findings. However, in our experiments, the probability of tracing upstream IO cells was relatively low. This may be due to the developmental immaturity of the synapse and the fact that each PC typically receives input from a single IO cell. Additionally, the deep and distant anatomical location of the IO presents technical challenges for paired electrical stimulationcalcium imaging study. To address these limitations, we are currently exploring the integration of viral tracing and optogenetics to further investigate IO-PC connectivity in future studies.

      (9) It would be useful if the manuscript discussed the efficacy of trans-synaptic labelling. What fraction of granule cell / olivary inputs to a particular Purkinje cell do the authors think their method captures?

      This is an important point for assessing the efficacy of our trans-synaptic labeling. Ideally, electron microscopy (EM) data would provide the most precise evaluation. In the absence of EM data, we estimated the number of GCs, IOs and PCs using light microscopy-based cell counting. 

      At approximately 7 dpf, we manually counted 327 ± 14 PCs and 2318 ± 70 GCs in the Tg(2×en.cpce-E1B:tdTomato-CAAX) and Tg(cbln12:GAL4FF);Tg(5×UAS:EGFP) zebrafish cerebellum, across all subregions (Va, CCe, EG, and LCa). Given the developmental increase in the number of GCs and the fact that some GCs that have exclusively ipsilateral projections, and that a single PC would not receive input from all parallel fibers, we estimate that by 10–14 dpf, a single PC receives approximately 1000– 2000 GC inputs. Under optimal tracing conditions, we observed an average of 20 labeled GC inputs per PC, yielding a capture fraction of ~1–2%. Although this represents only a subset of total inputs, it is consistent with mammalian studies (Wall et al., 2010; Callaway et al., 2015), suggesting inherent limitations of this viral labeling approach.

      For IO inputs, we counted 325 ± 26 inferior olivary neurons in Tg(elavl3:H2B-GCaMP6s) fish. A single PC likely receives input from one IO neuron, though an IO neuron may innervate multiple PCs. Accordingly, the observed capture rate for IO inputs was lower (7 out of 248 starters). 

      Further optimization is required to enhance the tracing efficiency. We have now incorporated a Discussion on this point in the revised manuscript (page 8).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this study, Ana Lapao et al. investigated the roles of Rab27 effector SYTL5 in cellular membrane trafficking pathways. The authors found that SYTL5 localizes to mitochondria in a Rab27A-dependent manner. They demonstrated that SYTL5-Rab27A positive vesicles containing mitochondrial material are formed under hypoxic conditions, thus they speculate that SYTL5 and Rab27A play roles in mitophagy. They also found that both SYTL5 and Rab27A are important for normal mitochondrial respiration. Cells lacking SYTL5 undergo a shift from mitochondrial oxygen consumption to glycolysis which is a common process known as the Warburg effect in cancer cells. Based on the cancer patient database, the author noticed that low SYTL5 expression is related to reduced survival for adrenocortical carcinoma patients, indicating SYTL5 could be a negative regulator of the Warburg effect and potentially tumorigenesis.

      Strengths:

      The authors take advantage of multiple techniques and novel methods to perform the experiments.

      (1) Live-cell imaging revealed that stably inducible expression of SYTL5 co-localized with filamentous structures positive for mitochondria. This result was further confirmed by using correlative light and EM (CLEM) analysis and western blotting from purified mitochondrial fraction.

      (2) In order to investigate whether SYTL5 and Rab27A are required for mitophagy in hypoxic conditions, two established mitophagy reporter U2OS cell lines were used to analyze the autophagic flux.

      Weaknesses:

      This study revealed a potential function of SYTL5 in mitophagy and mitochondrial metabolism. However, the mechanistic evidence that establishes the relationship between SYTL5/Rab27A and mitophagy is insufficient. The involvement of SYTL5 in ACC needs more investigation. Furthermore, images and results supporting the major conclusions need to be improved.

      We thank the reviewer for their constructive comments. We agree that a complete understanding of the mechanism by which SYTL5 and Rab27A are recruited to the mitochondria and subsequently involved in mitophagy requires further investigation. Here, we have shown that SYTL5 recruitment to the mitochondria requires both its lipid-binding C2 domains and the Rab27A-binding SHD domain (Figure 1G-H). This implies a coincidence detection mechanism for mitochondrial localisation of SYTL5.  Additionally, we find that mitochondrial recruitment of SYTL5 is dependent on the GTPase activity and mitochondrial localisation of Rab27A (Figure 2D-E). We also identified proteins linked to the cellular response to oxidative stress, reactive oxygen species metabolic process, regulation of mitochondrion organisation and protein insertion into mitochondrial membrane to be enriched in the SYTL5 interactome (Figure 3A and C).

      However, less details regarding the mitochondrial localisation of Rab27A are understood. To investigate this, we have now performed a mass spectrometry analysis to identify the interactome of Rab27A (see Author response table 1 below,). U2OS cells with stable expression of mScarlet-Rab27A or mScarlet only, were subjected to immunoprecipitation, followed by MS analysis.  Of the 32 significant Rab27A-interacting hits (compared to control), two of the hits are located in the inner mitochondrial membrane (IMM); ATP synthase F(1) complex subunit alpha (P25705), and mitochondrial very long-chain specific acyl-CoA dehydrogenase (VLCAD)(P49748). However, as these IMM proteins are not likely involved in mitochondrial recruitment of Rab27A, observed under basal conditions, we choose not to include these data in the manuscript. 

      It is known that other RAB proteins are recruited to the mitochondria. During parkin-mediated mitophagy, RABGEF1 (a guanine nucleotide exchange factor) is recruited through its ubiquitin-binding domain and directs mitochondrial localisation of RAB5, which subsequently leads to recruitment of RAB7 by the MON1/CCZ1 complex[1]. As already mentioned in the discussion (p. 12), ubiquitination of the Rab27A GTPase activating protein alpha (TBC1D10A) is reduced in the brain of Parkin KO mouse compared to controls[35], suggesting a possible connection of Rab27A with regulatory mechanisms that are linked with mitochondrial damage and dysfunction. While this an interesting avenue to explore, in this paper we will not follow up further on the mechanism of mitochondrial recruitment of Rab27A. 

      Author response table 1.

      Rab27A interactome. Proteins co-immunoprecipitated with mScarlet-Rab27A vs mScarlet expressing control. The data show average of three replicates. 

      To investigate the role of SYTL5 in the context of ACC, we acquired the NCI-H295R cell line isolated from the adrenal gland of an adrenal cancer patient. The cells were cultured as recommended from ATCC using DMEM/F-12 supplemented with NuSerum and ITS +premix. It is important to note that the H295R cells were adapted to grow as an adherent monolayer from the H295 cell line which grows in suspension. However, there can still be many viable H295R cells in the media. 

      We attempted to conduct OCR and ECAR measurements using the Seahorse XF upon knockdown of SYTL5 and/or Rab27A in H295R cells. For these assays, it is essential that the cells be seeded in a monolayer at 70-90% confluency with no cell clusters[4]. Poor adhesion of the cells can cause inaccurate measurements by the analyser. Unfortunately, the results between the five replicates we carried out were highly inconsistent, the same knockdown produced trends in opposite directions in different replicates. This is likely due to problems with seeding the cells. Despite our best efforts to optimise seeding number, and pre-coating the plate with poly-D-lysine[5] we observed poor attachment of cells and inability to form a monolayer. 

      To study the localisation of SYTL5 and Rab27A in an ACC model, we transduced the H295R cells with lentiviral particles to overexpress pLVX-SV40-mScarlet-I-Rab27A and pLVX-CMV-SYTL5-EGFP-3xFLAG. Again, this proved unsuccessful after numerous attempts at optimising transduction. 

      These issues limited our investigation into the role of SYTL5 in ACC to the cortisol assay (Supplementary Figure 6). For this the H295R cells were an appropriate model as they are able to produce an array of adrenal cortex steroids[6] including cortisol[7]. In this assay, measurements are taken from cell culture supernatants, so the confluency of the cells does not prevent consistent results as the cortisol concentration was normalised to total protein per sample. With this assay we were able to rule out a role for SYTL5 and Rab27A in the secretion of cortisol.  

      Another consideration when investigating the involvement of SYTL5 in ACC, is that in general ACC cells should have a low expression of SYTL5 as is seen from the patient expression data (Figure 6B).

      The reviewer also writes “Furthermore, images and results supporting the major conclusions need to be improved.”. We have tried several times, without success, to generate U2OS cells with CRISPR/Cas9-mediated C-terminal tagging of endogenous SYTL5 with mNeonGreen, using an approach that has been successfully implemented in the lab for other genes. This is likely due to a lack of suitable sgRNAs targeting the C-terminal region of SYTL5, which have a low predicted efficiency score and a large number of predicted off-target sites in the human genome including several other gene exons and introns (see Author response image 2). 

      We have also included new data (Supplementary Figure 4B) showing that some of the hypoxia-induced SYTL5-Rab27A-positive vesicles stain positive for the autophagy markers p62 and LC3B when inhibiting lysosomal degradation, further strengthening our data that SYTL5 and Rab27A function as positive regulators of mitophagy.  

      Reviewer #2 (Public review): 

      Summary:

      The authors provide convincing evidence that Rab27 and STYL5 work together to regulate mitochondrial activity and homeostasis.

      Strengths:

      The development of models that allow the function to be dissected, and the rigorous approach and testing of mitochondrial activity.

      Weaknesses:

      There may be unknown redundancies in both pathways in which Rab27 and SYTL5 are working which could confound the interpretation of the results.

      Suggestions for revision:

      Given that Rab27A and SYTL5 are members of protein families it would be important to exclude any possible functional redundancies coming from Rab27B expression or one of the other SYTL family members. For Rab27 this would be straightforward to test in the assays shown in Figure 4 and Supplementary Figure 5. For SYTL5 it might be sufficient to include some discussion about this possibility.

      We thank the reviewer for pointing out the potential redundancy issue for Rab27A and SYTL5. There are multiple studies demonstrating the redundancy between Rab27A and Rab27B. For example, in a study of the disease Griscelli syndrome, caused by Rab27A loss of function, expression of either Rab27A or Rab27B rescues the healthy phenotype indicating redundancy[8]. This redundancy however applies to certain function and cell types. In fact, in a study regarding hair growth, knockdown of Rab27B had the opposite effect to knockdown of Rab27A[9].

      In this paper, we conducted all assays in U2OS cells, in which the expression of Rab27B is very low. Human Protein Atlas reports expression of 0.5nTPM for Rab27B, compared to 18.4nTPM for Rab27A. We also observed this low level of expression of Rab27B compared to Rab27A by qPCR in U2OS cells. Therefore, there would be very little endogenous Rab27B expression in cells depleted of Rab27A (with siRNA or KO). In line with this, Rab27B peptides were not detected in our SYTL5 interactome MS data (Table 1 in paper). Moreover, as Rab27A depletion inhibits mitochondrial recruitment of SYTL5 and mitophagy, it is not likely that Rab27B provides a functional redundancy. It is possible that Rab27B overexpression could rescue mitochondrial localisation of SYTL5 in Rab27A KO cells, but this was not tested as we do not have any evidence for a role of Rab27B in these cells. Taken together, we believe our data imply that Rab27B is very unlikely to provide any functional redundancy to Rab27A in our experiments. 

      For the SYTL family, all five members are Rab27 effectors, binding to Rab27 through their SHD domain. Together with Rab27, all SYTL’s have been implicated in exocytosis in different cell types. For example, SYTL1 in exocytosis of azurophilic granules from neutrophils[10], SYTL2 in secretion of glucagon granules from pancreatic α cells[11], SYTL3 in secretion of lytic granules from cytotoxic T lymphocytes[12], SYTL4 in exocytosis of dense hormone containing granules from endocrine cells[13] and SYTL5 in secretion of the RANKL cytokine from osteoblasts[14]. This indicates a potential for redundancy through their binding to Rab27 and function in vesicle secretion/trafficking. However, one study found that different Rab27 effectors have distinct functions at different stages of exocytosis[15].

      Very little known about redundancy or hierarchy between these proteins. Differences in function may be due to the variation in gene expression profile across tissues for the different SYTL’s (see Author response image 1 below). SYTL5 is enriched in the brain unlike the others, suggesting possible tissue specific functions. There are also differences in the binding affinities and calcium sensitivities of the C2iA and C2B domains between the SYTL proteins[16].

      Author response image 1.

      GTEx Multi Gene Query for SYTL1-5

      All five SYTL’s are expressed in the U2OS cell line with nTPMs according to Human Protein Atlas of SYTL1: 7.5, SYTL2: 13.4, SYTL3:14.2, SYTL4: 8.7, SYTL5: 4.8. In line with this, in the Rab27A interactome, when comparing cells overexpressing mScarlet-Rab27A with control cells, we detected all five SYTL’s as specific Rab27A-interacting proteins (see Author response table 1 above). Whereas, in the SYTL5 interactome we did not detect any other SYTL protein (table 1 in paper), confirming that they do not form a complex with SYTL5. 

      We have included the following text in the discussion (p. 12): “SYTL5 and Rab27A are both members of protein families, suggesting possible functional redundancies from Rab27B or one of the other SYTL isoforms. While Rab27B has a very low expression in U2OS cells, all five SYTL’s are expressed. However, when knocking out or knocking down SYTL5 and Rab27A we observe significant effects that we presume would be negated if their isoforms were providing functional redundancies. Moreover, we did not detect any other SYTL protein or Rab27B in the SYTL5 interactome, confirming that they do not form a complex with SYTL5.”

      Suggestions for Discussion: 

      Both Rab27A and STYL5 localize to other membranes, including the endolysosomal compartments. How do the authors envisage the mechanism or cellular modifications that allow these proteins, either individually or in complex to function also to regulate mitochondrial funcYon? It would be interesYng to have some views.

      We agree that it would be interesting to better understand the mechanism involved in modulation of the localisation and function of SYTL5 and Rab27A at different cellular compartments, including the mitochondria. Here, we have shown that SYTL5 recruitment to the mitochondria involves coincidence detection, as both its lipid-binding C2 domains and the Rab27A-binding SHD domain are required (Figure 1G-H). Both these domains also seem required for localisation of SYTL5 to vesicles, and we can only speculate that binding to different lipids (Figure 1F) may regulate SYTL5 localisation. Additionally, we find that mitochondrial recruitment of SYTL5 is dependent on the GTPase activity and mitochondrial localisation of Rab27A (Figure 2D-E). However, this seems also the case for vesicular recruitment of SYTL5, although a few SYTL5-Rab27A (T23N) positive vesicles were seen (Figure 2E). 

      To characterise the mechanisms involved in mitochondrial localisation of Rab27A, we have performed mass spectrometry analysis to identify the interactome of Rab27A (see Author response table 1 above). U2OS cells with stable expression of mScarlet-Rab27A or mScarlet only were subjected to immunoprecipitation, followed by MS analysis.  Of the 32 significant Rab27A-interacting hits (compared to control), two of the hits localise in the inner mitochondrial membrane (IMM); ATP synthase F(1) complex subunit alpha (P25705), and mitochondrial very long-chain specific acyl-CoA dehydrogenase (VLCAD)(P49748). However, as these IMM proteins are not likely involved in mitochondrial recruitment of Rab27A, observed under basal conditions, we chose not to include these data in the manuscript. 

      It is known that other RAB proteins are recruited to the mitochondria by regulation of their GTPase activity. During parkin-mediated mitophagy, RABGEF1 (a guanine nucleotide exchange factor) is recruited through its ubiquitin-binding domain and directs mitochondrial localisation of RAB5, which subsequently leads to recruitment of RAB7 by the MON1/CCZ1 GEF complex[1]. As already mentioned in the discussion (p.12), ubiquitination of the Rab27A GTPase activating protein alpha (TBC1D10A) is reduced in the brain of Parkin KO mouse compared to controls[35], suggesting a possible connection of Rab27A with regulatory mechanisms that are linked with mitochondrial damage and dysfunction. While this an interesting avenue to explore, it is beyond the scope of this paper. 

      Our data suggest that SYTL5 functions as a negative regulator of the Warburg effect, the switch from OXPHOS to glycolysis. While both SYTL5 and Rab27A seem required for mitophagy of selective mitochondrial components, and their depletion leading to reduced mitochondrial respiration and ATP production, only depletion of SYTL5 caused a switch to glycolysis. The mechanisms involved are unclear, but we found several proteins linked to the cellular response to oxidative stress, reactive oxygen species metabolic process, regulation of mitochondrion organisation and protein insertion into mitochondrial membrane to be enriched in the SYTL5 interactome (Figure 3A and C).

      We have addressed this comment in the discussion on p.12 

      Reviewer #3 (Public review):

      Summary:

      In the manuscript by Lapao et al., the authors uncover a role for the Rab27A effector protein SYTL5 in regulating mitochondrial function and turnover. The authors find that SYTL5 localizes to mitochondria in a Rab27A-dependent way and that loss of SYTL5 (or Rab27A) impairs lysosomal turnover of an inner mitochondrial membrane mitophagy reporter but not a matrix-based one. As the authors see no co-localization of GFP/mScarlet tagged versions of SYTL5 or Rab27A with LC3 or p62, they propose that lysosomal turnover is independent of the conventional autophagy machinery. Finally, the authors go on to show that loss of SYTL5 impacts mitochondrial respiration and ECAR and as such may influence the Warburg effect and tumorigenesis. Of relevance here, the authors go on to show that SYTL5 expression is reduced in adrenocortical carcinomas and this correlates with reduced survival rates.

      Strengths:

      There are clearly interesting and new findings here that will be relevant to those following mitochondrial function, the endocytic pathway, and cancer metabolism.

      Weaknesses:

      The data feel somewhat preliminary in that the conclusions rely on exogenously expressed proteins and reporters, which do not always align.

      As the authors note there are no commercially available antibodies that recognize endogenous SYTL5, hence they have had to stably express GFP-tagged versions. However, it appears that the level of expression dictates co-localization from the examples the authors give (though it is hard to tell as there is a lack of any kind of quantitation for all the fluorescent figures). Therefore, the authors may wish to generate an antibody themselves or tag the endogenous protein using CRISPR.

      We agree that the level of SYTL5 expression is likely to affect its localisation. As suggested by the reviewer, we have tried hard, without success, to generated U2OS cells with CRISPR knock-in of a mNeonGreen tag at the C-terminus of endogenous SYTL5, using an approach that has been successfully implemented in the lab for other genes. This is likely due to a lack of suitable sgRNAs targeting the C-terminal region of SYTL5, which have a low predicted efficiency score and a large number of predicted off-target sites in the human genome including several other gene exons and introns (see Author response image 2). 

      Author response image 2.

      Overview of sgRNAs targeting the C-terminal region of SYTL5 

      Although the SYTL5 expression level might affect its cellular localization, we also found the mitochondrial localisation of SYTL5-EGFP to be strongly increased in cells co-expressing mScarletRab27A, supporting our findings of Rab27A-mediated mitochondrial recruitment of SYTL5. We have also included new data (Supplementary Figure 4B) showing that some of the hypoxia-induced SYTL5Rab27A-positive vesicles stain positive for the autophagy markers p62 and LC3B when inhibiting lysosomal degradation, further strengthening our data that SYTL5 and Rab27A function as positive regulators of mitophagy.  

      In relation to quantitation, the authors found that SYTL5 localizes to multiple compartments or potentially a few compartments that are positive for multiple markers. Some quantitation here would be very useful as it might inform on function. 

      We find that SYTL5-EGFP localizes to mitochondria, lysosomes and the plasma membrane in U2OS cells with stable expression of SYTL5-EGFP and in SYTL5/Rab27A double knock-out cells rescued with SYTL5EGFP and mScralet-Rab27A. We also see colocalization of SYTL5-EGFP with endogenous p62, LC3 and LAMP1 upon induction of mitophagy. However, as these cell lines comprise a heterogenous pool with high variability we do not believe that quantification of the overexpressing cell lines would provide beneficial information in this scenario. As described above, we have tried several times to generate SYTL5 knock-in cells without success.  

      The authors find that upon hypoxia/hypoxia-like conditions that punctate structures of SYTL5 and Rab27A form that are positive for Mitotracker, and that a very specific mitophagy assay based on pSu9-Halo system is impaired by siRNA of SYTL5/Rab27A, but another, distinct mitophagy assay (Matrix EGFP-mCherry) shows no change. I think this work would strongly benefit from some measurements with endogenous mitochondrial proteins, both via immunofluorescence and western blot-based flux assays. 

      In addition to the western blotting for different endogenous ETC proteins showing significantly increased levels of MTCO1 in cells depleted of SYTL5 and/or Rab27A (Figure 5E-F), we have now blotted for the endogenous mitochondrial proteins, COXIV and BNIP3L, in DFP and DMOG conditions upon knockdown of SYTL5 and/or Rab27A (Figure 5G and Supplementary Figure 5A). Although there was a trend towards increased levels, we did not see any significant changes in total COXIV or BNIP3L levels when SYTL5, Rab27A or both are knocked down compared to siControl. Blotting for endogenous mitochondrial proteins is however not the optimum readout for mitophagy. A change in mitochondrial protein level does not necessarily result from mitophagy, as other factors such as mitochondrial biogenesis and changes in translation can also have an effect. Mitophagy is a dynamic process, which is why we utilise assays such as the HaloTag and mCherry-EGFP double tag as these indicate flux in the pathway. Additionally, as mitochondrial proteins have different half-lives, with many long-lived mitochondrial proteins[17], differences in turnover rates of endogenous proteins make the results more difficult to interpret. 

      A really interesting aspect is the apparent independence of this mitophagy pathway on the conventional autophagy machinery. However, this is only based on a lack of co-localization between p62or LC3 with LAMP1 and GFP/mScarlet tagged SYTL5/Rab27A. However, I would not expect them to greatly colocalize in lysosomes as both the p62 and LC3 will become rapidly degraded, while the eGFP and mScarlet tags are relatively resistant to lysosomal hydrolysis. -/+ a lysosome inhibitor might help here and ideally, the functional mitophagy assays should be repeated in autophagy KOs. 

      We thank the reviewer for this suggestion. We have now repeated the colocalisation studies in cells treated with DFP with the addition of bafilomycin A1 (BafA1) to inhibit the lysosomal V-ATPase. Indeed, we find that a few of the SYTL5/Rab27A/MitoTracker positive structures also stain positive for p62 and LC3 (Supplementary Figure 4B). As expected, the occurrence of these structures was rare, as BafA1 was only added for the last 4 hrs of the 24 hr DFP treatment. However, we cannot exclude the possibility that there are two different populations of these vesicles.

      The link to tumorigenesis and cancer survival is very interesYng but it is not clear if this is due to the mitochondrially-related aspects of SYTL5 and Rab27A. For example, increased ECAR is seen in the SYTL5 KO cells but not in the Rab27A KO cells (Fig.5D), implying that mitochondrial localization of SYTL5 is not required for the ECAR effect. More work to strengthen the link between the two sections in the paper would help with future direcYons and impact with respect to future cancer treatment avenues to explore. 

      We agree that the role of SYTL5 in ACC requires future investigation. While we observe reduced OXPHOS levels in both SYTL5 and Rab27A KO cells (Figure 5B), glycolysis was only increased in SYTL5 KO cells (Figure 5D). We believe this indicates that Rab27A is being negatively regulated by SYTL5, as ECAR was unchanged in both the Rab27A KO and Rab27A/SYTL5 dKO cells. This suggests that Rab27A is required for the increase in ECAR when SYTL5 is depleted, therefore SYTL5 negatively regulates Rab27A. The mechanism involved is unclear, but we found several proteins linked to the cellular response to oxidative stress, reactive oxygen species metabolic process, regulation of mitochondrion organisation and protein insertion into mitochondrial membrane to be enriched in the SYTL5 interactome (Figure 3A and C).

      To investigate the link to cancer further, we tested the effect of knockdown of SYTL5 and/or Rab27A on the levels of mitochondrial ROS. ROS levels were measured by flow cytometry using the MitoSOX Red dye, together with the MitoTracker Green dye to normalise ROS levels to the total mitochondria. Cells were treated with the antioxidant N-acetylcysteine (NAC)[18] as a negative control and menadione as a positive control, as menadione induces ROS production via redox cycling[19]. We must consider that there is also a lot of autofluorescence from cells that makes it impossible to get a level of ‘zero ROS’ in this experiment. We did not see a change in ROS with knockdown of SYTL5 and/or Rab27A compared to the NAC treated or siControl samples (see Author response image 3 below). The menadione samples confirm the success of the experiment as ROS accumulated in these cells. Thus, based on this, we do not believe that low SYTL5 expression would affect ROS levels in ACC tumours.

      Author response image 3.

      Mitochondrial ROS production normalised to total mitochondria

      As discussed in our response to Reviewer #1, we tried hard to characterise the role of SYTL5 in the context of ACC using the NCI-H295R cell line isolated from the adrenal gland of an adrenal cancer patient. We attempted to conduct OCR and ECAR measurements using the Seahorse XF upon knockdown of SYTL5 and/or Rab27A in H295R cells without success, due to poor attachment of the cells and inability to form a monolayer. We also transduced the H295R cells with lentiviral particles to overexpress pLVX-SV40-mScarlet-I-Rab27A and pLVX-CMV-SYTL5-EGFP-3xFLAG to study the localisation of SYTL5 and Rab27A in an ACC model. Again, this proved unsuccessful after numerous attempts at optimising the transduction. These issues limited our investigation into the role of SYTL5 in ACC to the cortisol assay (Supplementary Figure 6). For this the H295R cells were an appropriate model as they are able to produce an array of adrenal cortex steroids[6] including cortisol[7] In this assay, measurements are taken from cell culture supernatants, so the confluency of the cells does not prevent consistent results as the cortisol concentration was normalised to total protein per sample. With this assay we were able to rule out a role for SYTL5 and Rab27A in the secretion of cortisol.  

      Another consideration when investigating the involvement of SYTL5 in ACC, is that in general ACC cells should have a low expression of SYTL5 as is seen from the patient expression data (Figure 6B).

      Further studies into the link between SYTL5/Rab27A and cancer are beyond the scope of this paper as we are limited to the tools and expertise available in the lab.

      References

      (1) Yamano, K. et al. Endosomal Rab cycles regulate Parkin-mediated mitophagy. eLife 7 (2018). https://doi.org:10.7554/eLife.31326

      (2) Carré, M. et al. Tubulin is an inherent component of mitochondrial membranes that interacts with the voltage-dependent anion channel. The Journal of biological chemistry 277, 33664-33669 (2002). https://doi.org:10.1074/jbc.M203834200

      (3) Hoogerheide, D. P. et al. Structural features and lipid binding domain of tubulin on biomimetic mitochondrial membranes. Proceedings of the National Academy of Sciences 114, E3622-E3631 (2017). https://doi.org:10.1073/pnas.1619806114

      (4) Plitzko, B. & Loesgen, S. Measurement of Oxygen Consumption Rate (OCR) and Extracellular Acidification Rate (ECAR) in Culture Cells for Assessment of the Energy Metabolism. Bio Protoc 8, e2850 (2018). https://doi.org:10.21769/BioProtoc2850

      (5) Yavin, E. & Yavin, Z. Attachment and culture of dissociated cells from rat embryo cerebral hemispheres on polylysine-coated surface. The Journal of cell biology 62, 540-546 (1974). https://doi.org:10.1083/jcb.62.2.540

      (6) Wang, T. & Rainey, W. E. Human adrenocortical carcinoma cell lines. Mol Cell Endocrinol 351, 5865 (2012). https://doi.org:10.1016/j.mce.2011.08.041

      (7) Rainey, W. E. et al. Regulation of human adrenal carcinoma cell (NCI-H295) production of C19 steroids. J Clin Endocrinol Metab 77, 731-737 (1993). https://doi.org:10.1210/jcem.77.3.8396576

      (8) Barral, D. C. et al. Functional redundancy of Rab27 proteins and the pathogenesis of Griscelli syndrome. J. Clin. Invest. 110, 247-257 (2002). https://doi.org:10.1172/jci15058

      (9) Ku, K. E., Choi, N. & Sung, J. H. Inhibition of Rab27a and Rab27b Has Opposite Effects on the Regulation of Hair Cycle and Hair Growth. Int. J. Mol. Sci. 21 (2020). https://doi.org:10.3390/ijms21165672

      (10) Johnson, J. L., Monfregola, J., Napolitano, G., Kiosses, W. B. & Catz, S. D. Vesicular trafficking through cortical actin during exocytosis is regulated by the Rab27a effector JFC1/Slp1 and the RhoA-GTPase–activating protein Gem-interacting protein. Mol. Biol. Cell 23, 1902-1916 (2012). https://doi.org:10.1091/mbc.e11-12-1001

      (11) Yu, M. et al. Exophilin4/Slp2-a targets glucagon granules to the plasma membrane through unique Ca2+-inhibitory phospholipid-binding activity of the C2A domain. Mol. Biol. Cell 18, 688696 (2007). https://doi.org:10.1091/mbc.e06-10-0914

      (12) Kurowska, M. et al. Terminal transport of lyXc granules to the immune synapse is mediated by the kinesin-1/Slp3/Rab27a complex. Blood 119, 3879-3889 (2012). https://doi.org:10.1182/blood-2011-09-382556

      (13) Zhao, S., Torii, S., Yokota-Hashimoto, H., Takeuchi, T. & Izumi, T. Involvement of Rab27b in the regulated secretion of pituitary hormones. Endocrinology 143, 1817-1824 (2002). https://doi.org:10.1210/endo.143.5.8823

      (14) Kariya, Y. et al. Rab27a and Rab27b are involved in stimulation-dependent RANKL release from secretory lysosomes in osteoblastic cells. J Bone Miner Res 26, 689-703 (2011). https://doi.org:10.1002/jbmr.268

      (15) Zhao, K. et al. Functional hierarchy among different Rab27 effectors involved in secretory granule exocytosis. Elife 12 (2023). https://doi.org:10.7554/eLife.82821

      (16) Izumi, T. Physiological roles of Rab27 effectors in regulated exocytosis. Endocr J 54, 649-657 (2007). https://doi.org:10.1507/endocrj.kr-78

      (17) Bomba-Warczak, E. & Savas, J. N. Long-lived mitochondrial proteins and why they exist. Trends in cell biology 32, 646-654 (2022). https://doi.org:10.1016/j.tcb.2022.02.001

      (18) Curtin, J. F., Donovan, M. & Cotter, T. G. Regulation and measurement of oxidative stress in apoptosis. Journal of Immunological Methods 265, 49-72 (2002). https://doi.org:https://doi.org/10.1016/S0022-1759(02)00070-4

      (19) Criddle, D. N. et al. Menadione-induced Reative Oxygen Species Generation via Redox Cycling Promotes Apoptosis of Murine Pancreatic Acinar Cells. Journal of Biological Chemistry 281, 40485-40492 (2006). https://doi.org:https://doi.org/10.1074/jbc.M607704200

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Turner et al. present an original approach to investigate the role of Type-1 nNOS interneurons in driving neuronal network activity and in controlling vascular network dynamics in awake head-fixed mice. Selective activation or suppression of Type-1 nNOS interneurons has previously been achieved using either chemogenetic, optogenetic, or local pharmacology. Here, the authors took advantage of the fact that Type-1 nNOS interneurons are the only cortical cells that express the tachykinin receptor 1 to ablate them with a local injection of saporin conjugated to substance P (SP-SAP). SP-SAP causes cell death in 90 % of type1 nNOS interneurons without affecting microglia, astrocytes, and neurons. The authors report that the ablation has no major effects on sleep or behavior. Refining the analysis by scoring neural and hemodynamic signals with electrode recordings, calcium signal imaging, and wide-field optical imaging, the authors observe that Type-1 nNOS interneuron ablation does not change the various phases of the sleep/wake cycle. However, it does reduce low-frequency neural activity, irrespective of the classification of arousal state. Analyzing neurovascular coupling using multiple approaches, they report small changes in resting-state neural-hemodynamic correlations across arousal states, primarily mediated by changes in neural activity. Finally, they show that nNOS type 1 interneurons play a role in controlling interhemispheric coherence and vasomotion.

      In conclusion, these results are interesting, use state-of-the-art methods, and are well supported by the data and their analysis. I have only a few comments on the stimulus-evoked haemodynamic responses, and these can be easily addressed.

      We thank the reviewer for their positive comments on our work.

      Reviewer #2 (Public review):

      Summary:

      This important study by Turner et al. examines the functional role of a sparse but unique population of neurons in the cortex that express Nitric oxide synthase (Nos1). To do this, they pharmacologically ablate these neurons in the focal region of whisker-related primary somatosensory (S1) cortex using a saponin-substance P conjugate. Using widefield and 2photon microscopy, as well as field recordings, they examine the impact of this cell-specific lesion on blood flow dynamics and neuronal population activity. Locally within the S1 cortex, they find changes in neural activity paFerns, decreased delta band power, and reduced sensory-evoked changes in blood flow (specifically eliminating the sustained blood flow change amer stimulation). Surprisingly, given the tiny fraction of cortical neurons removed by the lesion, they also find far-reaching effects on neural activity paFerns and blood volume oscillations between the cerebral hemispheres.

      Strengths:

      This was a technically challenging study and the experiments were executed in an expert manner. The manuscript was well wriFen and I appreciated the cartoon summary diagrams included in each figure. The analysis was rigorous and appropriate. Their discovery that Nos1 neurons can have far-reaching effects on blood flow dynamics and neural activity is quite novel and surprising (to me at least) and should seed many follow-up, mechanistic experiments to explain this phenomenon. The conclusions were justified by the convincing data presented.

      Weaknesses:

      I did not find any major flaws in the study. I have noted some potential issues with the authors' characterization of the lesion and its extent. The authors may want to re-analyse some of their data to further strengthen their conclusions. Lastly, some methodological information was missing, which should be addressed.

      We thank the reviewer for their enthusiasm for our work.

      Reviewer #3 (Public review):

      The role of type-I nNOS neurons is not fully understood. The data presented in this paper addresses this gap through optical and electrophysiological recordings in adult mice (awake and asleep).

      This manuscript reports on a study on type-I nNOS neurons in the somatosensory cortex of adult mice, from 3 to 9 months of age. Most data were acquired using a combination of IOS and electrophysiological recordings in awake and asleep mice. Pharmacological ablation of the type-I nNOS populations of cells led to decreased coherence in gamma band coupling between lem and right hemispheres; decreased ultra-low frequency coupling between blood volume in each hemisphere; decreased (superficial) vascular responses to sustained sensory stimulus and abolishment of the post-stimulus CBV undershoot. While the findings shed new light on the role of type-I nNOS neurons, the etiology of the discrepancies between current observations and literature observations is not clear and many potential explanations are put forth in the discussion.

      We thank the reviewer for their comments.

      Reviewer #1 (Recommendations for the authors):  

      (1) Figure 3, Type-1 nNOS interneuron ablation has complex effects on neural and vascular responses to brief (.1s) and prolonged (5s) whisker stimulation. During 0.1 s stimulation, ablation of type 1 nNOS cells does not affect the early HbT response but only reduces the undershoot. What is the pan-neuronal calcium response? Is the peak enhanced, as might be expected from the removal of inhibition? The authors need to show the GCaMP7 trace obtained during this short stimulation.

      Unfortunately, we did not perform brief stimulation experiments in GCaMP-expressing mice. As we did not see a clear difference in the amplitude of the stimulus-evoked response with our initial electrophysiology recordings (Fig. 3a), we suspected that an effect might be visible with longer duration stimuli and thus pivoted to a pulsed stimulation over the course of 5 seconds for the remaining cohorts. It would have been beneficial to interweave short-stimulus trials for a direct comparison between the complimentary experiments, but we did not do this.

      During 5s stimulation, both the early and delayed calcium/vascular responses are reduced. Could the authors elaborate on this? Does this mean that increasing the duration of stimulation triggers one or more additional phenomena that are sensitive to the ablation of type 1 nNOS cells and mask what is triggered by the short stimulation? Are astrocytes involved? How do they interpret the early decrease in neuronal calcium?

      As our findings show that ablation reduces the calcium/vascular response more prominently during prolonged stimulation, we do suspect that this is due to additional NO-dependent mechanisms or downstream responses. NO is modulator of neural activity, generally increasing excitability (Kara and Friedlander 1999, Smith and Otis 2003), so any manipulation that changes NO levels will change (likely decrease) the excitability of the network, potentially resulting in a smaller hemodynamic response to sensory stimulation secondary to this decrease. While short stimuli engage rapid neurovascular coupling mechanisms, longer duration (>1s) stimulation could introduce additional regulatory elements, such as astrocytes, that operate on a slower time scale. On the right, we show a comparison of the control groups ploFed together from Fig. 3a and 3b with vertical bars aligned to the peak. During the 5s stimulation, the time-to-peak is roughly 830 milliseconds later than the 0.1s stimulation, meaning it’s plausible that the signals don’t separate until later. Our interpretation is that the NVC mechanisms responsible for brief stimulus-evoked change are either NO-independent or are compensated for in the SSP-SAP group by other means due to the chronic nature of the ablation. 

      We have added the following text to the Discussion (Line 368): “Loss of type-I nNOS neurons drove minimal changes in the vasodilation elicited by brief stimulation, but led to decreased vascular responses to sustained stimulation, suggesting that the early phase of neurovascular coupling is not mediated by these cells, consistent with the multiple known mechanisms for neurovascular coupling (AFwell et al 2010, Drew 2019, Hosford & Gourine 2019) acting through both neurons and astrocytes with multiple timescales (Le Gac et al 2025, Renden et al 2024, Schulz et al 2012, Tran et al 2018).”

      Author response image 1.

      (2) In Figures 4d and e, it is unclear to me why the authors use brief stimulation to analyze the relationship between HbT and neuronal activity (gamma power) and prolonged stimulation for the relationship between HbT and GCaMP7 signal. Could they compare the curves with both types of stimulation?

      As discussed previously, we did not use the same stimulation parameters across cohorts. The mice with implanted electrodes received only brief stimulation, while those undergoing calcium imaging received longer duration stimulus. 

      Reviewer #2 (Recommendations for the authors):

      (1) Results, how far-reaching is the cell-specific ablation? Would it be possible to estimate the volume of the cortex where Nos1 cells are depleted based on histology? Were there signs of neuronal injury more remotely, for example, beading of dendrites?

      We regularly see 1-2 mm in diameter of cell ablation within the somatosensory cortex of each animal, which is consistent with the spread of small molecules. Ribosome inactivating proteins like SAP are smaller than AAVs (~5 nm compared to ~25 nm in diameter) and thus diffuse slightly further. We observed no obvious indication of neuronal injury more remotely or in other brain regions, but we did not image or characterize dendritic beading, as this would require a sparse labeling of neurons to clearly see dendrites (NeuN only stains the cell body). Our histology shows no change in cell numbers. 

      We have added the following text to the Results (Line 124): “Immunofluorescent labeling in mice injected with Blank-SAP showed labeling of nNOS-positive neurons near the injection site. In contrast, mice injected with SP-SAP showed a clear loss in nNOS-labeling, with a typical spread of 1-2 mm from the injection site, though nNOS-positive neurons both subcortically and in the entirety of the contralateral hemisphere remaining intact.”

      (2) For histological analysis of cell counts amer the lesion, more information is needed. How was the region of interest for counting cells determined (eg. 500um radius from needle/pipeFe tract?) and of what volume was analysed?

      The region of interest for both SSP-SAP and Blank SAP injections was a 1 mm diameter circle centered around the injection site and averaged across sections (typically 3-5 when available). In most animals, the SSP-SAP had a lateral spread greater than 500 microns and encompassed the entire depth of cortex (1-1.5 mm in SI, decreasing in the rostral to caudal direction). The counts within the 1 mm diameter ROI were averaged across sections and then converted into the cells per mm area as presented. Note the consistent decrease in type I nNOS cells seen across mice in Fig 1d, Fig S1b.

      We have added the following text in the Materials & Methods (Line 507): “The region of interest for analysis of cell counts was determined based on the injection site for both SP-SAP and Blank SAP injections, with a 1 mm diameter circle centered around the injection site and averaged across 3-5 sections where available. In most animals, the SP-SAP had a lateral spread greater than 500 microns and encompassed the entire depth of cortex (1-1.5 mm in SI).”

      (3) Based on Supplementary Figure 1, it appears that the Saponin conjugate not only depletes Nos neurons but also may affect vascular (endothelial perhaps) Nos expression. Some quantification of this effect and its extent may be insighIul in terms of ascribing the effects of the lesion directly on neurons vs indirectly and perhaps more far-reaching via vascular/endothelial NOS.

      Thank you for this comment. While this is a possibility, while we have found that the high nNOS expression of type-I nnoos neurons makes NADPH diaphorase a good stain for detecting them, it is less useful for cell types that expres NOS at lower levels.  We have found that the absolute intensity of NADPH diaphorase staining is somewhat variable from section to section. Variability in overall NADPH diaphorase intensity is likely due to several factors, such as duration of staining, thickness of the section, and differences in PFA concentration within the tissue and between animals. As NADPH diaphorase staining is highly sensitive to amount PFA exposure, any small differences in processing could affect the intensity, and slight differences in perfusion quality and processing could account. A second, perhaps larger issue could be due to differences in the number of arteries (which will express NOS at much higher levels than veins, and thus will appear darker) in the section. We did not stain for smooth muscle and so cannot differentiate arteries and veins.  Any difference in vessel intensity could be due to random variations in the numbers of arteries/veins in the section. While we believe that this is a potentially interesting question, our histological experiments were not able to address it.

      (4) The assessment for inflammation took place 1 month amer the lesion, but the imaging presumably occurred ~ 2 weeks amer the lesion. Note that it seemed somewhat ambiguous as to when approximately, the imaging, and electrophysiology experiments took place relative to the induction of the lesion. Presumably, some aspects of inflammation and disruption could have been missed, at the time when experiments were conducted, based on this disparity in assessment. The authors may want to raise this as a possible limitation.

      We apologize for our unclear description of the timeline. We began imaging experiments at least 4 weeks amer ablation, the same time frame as when we performed our histological assays. 

      We have added the following text to the Discussion (Line 379): “With imaging beginning four weeks amer ablation, there could be compensatory rewiring of local and/or network activity following type-I nNOS ablation, where other signaling pathways from the neurons to the vasculature become strengthened to compensate for the loss of vasodilatory signaling from the typeI nNOS neurons.”

      (5) Results Figure 2, please define "P or delta P/P". Also, for Figure 2c-f, what do the black vertical ticks represent?

      ∆P/P is the change in the gamma-band power relative to the resting-state baseline, and black tick marks indicate binarized periods of vibrissae motion (‘whisking’). We have clarified this in Figure caption 2 (Line 174).

      (6) Figure 3b-e, is there not an undershoot (eventually) amer 5s of stimulation that could be assessed? 

      Previous work has shown that there is no undershoot in response to whisker stimulations of a few seconds (Drew, Shih, Kelinfeld, PNAS, 2011).  The undershoot for brief stimuli happens within ~2.5 s of the onset/cessation of the brief stimulation, this is clearly lacking in the response to the 5s stim (Fig 3).  The neurovascular coupling mechanisms recruited during the short stimulation are different than those recruited during the long stimulus, making a comparison of the undershoot between the two stimulation durations problematic. 

      For Figures 3e and 6 how was surface arteriole diameter or vessel tone measured? 2P imaging of fluorescent dextran in plasma? Please add the experimental details of 2P imaging to the methods. Including some 2P images in the figures couldn't hurt to help the reader understand how these data were generated.

      We have added details about our 2-photon imaging (FITC-dextran, full-width at half-maximum calculation for vessel diameter) as well as a trace and vessel image to Figure 2.

      We have added the following text to the Materials & Methods (Line 477): “In two-photon experiments, mice were briefly anesthetized and retro-orbitally injected with 100 µL of 5% (weight/volume) fluorescein isothiocyanate–dextran (FITC) (FD150S, Sigma-Aldrich, St. Louis, MO) dissolved in sterile saline.”

      We have added the following text to the Materials & Methods (Line 532): “A rectangular box was drawn around a straight, evenly-illuminated vessel segment and the pixel intensity was averaged along the long axis to calculate the vessel’s diameter from the full-width at half-maximum (https://github.com/DrewLab/Surface-Vessel-FWHM-Diameter; (Drew, Shih et al. 2011)).”

      (7) Did the authors try stimulating other body parts (eg. limb) to estimate how specific the effects were, regionally? This is more of a curiosity question that the authors could comment on, I am not recommending new experiments.

      We did measure changes in [HbT] in the FL/HL representation of SI during locomotion (Line 205), which is known to increase neural activity in the somatosensory cortex (Huo, Smith and Drew, Journal of Neuroscience, 2014; Zhang et al., Nature Communications 2019). We observed a similar but not statistically significant trend of decreased [HbT] in SP-SAP compared to control. This may have been due to the sphere of influence of the ablation being centered on the vibrissae representation and not having fully encompassed the limb representation. We agree with the referee that it would be interesting to characterize these effects on other sensory regions as well as brain regions associated with tasks such as learning and behavior.

      (8) Regarding vasomotion experiments, are there no other components of this waveform that could be quantified beyond just variance? Amplitude, frequency? Maybe these don't add much but would be nice to see actual traces of the diameter fluctuations. Further, where exactly were widefield-based measures of vasomotion derived from? From some seed pixel or ~1mm ROI in the center of the whisker barrel cortex? Please clarify.

      The reviewer’s point is well taken. We have added power spectra of the resting-state data which provides amplitude and frequency information. The integrated area under the curve of the power spectra is equal to the variance. Widefield-based measures of vasomotion were taken from the 1 mm ROI in the center of the whisker barrel cortex.

      We have added the following text to the Materials & Methods (Line 560): “Variance during the resting-state for both ∆[HbT] and diameter signals (Fig. 7) was taken from resting-state events lasting ≥10 seconds in duration. Average ∆[HbT] from within the 1 mm ROI over the vibrissae representation of SI during each arousal state was taken with respect to awake resting baseline events ≥10 seconds in duration.” 

      (9) On page 13, the title seems like a bit strong. The data show a change in variance but that does not necessarily mean a change in absolute amplitude. Also, I did not see any reports of absolute vessel widths between groups from 2P experiments so any difference in the sampling of larger vs smaller arterioles could have affected the variance (ie. % changes could be much larger in smaller arterioles).

      We have updated the title of Figure 7 to specifically state power (which is equivalent to the variance) rather than amplitude (Line 331). We have also added absolute vessel widths to the Results (Line 340): “There was no difference in resting-state (baseline) diameter between the groups, with Blank-SAP having a diameter of 24.4 ± 7.5 μm and SP-SAP having a diameter of 23.0 ± 9.4 μm (Fest, p ti 0.61). “

      (10) Big picture question. How could a manipulation that affects so few cells in 1 hemisphere (below 0.5% of total neurons in a region comprising 1-2% of the volume of one hemisphere) have such profound effects in both hemispheres? The authors suggest that some may have long-range interhemispheric projections, but that is presumably a fraction of the already small fraction of Nos1 neurons. Perhaps these neurons have specializing projections to subcortical brain nuclei (Nucleus Basilis, Raphe, Locus Coerulus, reticular thalamus, etc) that then project widely to exert this outsized effect? Has there not been a detailed anatomical characterization of their efferent projections to cortical and sub-cortical areas? This point could be raised in the discussion.

      We apologize for the lack of clarity of our work in this point.  We would like to clarify that the only analysis showing a change in the unablated hemisphere being coherence/correlation analysis between the two hemispheres.  Other metrics (LFP power and CBV power spectra) do not change in the hemisphere contralateral to the injections site, as we show in data added in two supplementary figures (Fig. S4 and 7). The coherence/correlation is a measure of the correlated dynamics in the two hemispheres. For this metric to change, there only needs to be a change in the dynamics of one hemisphere relative to another.  If some aspects of the synchronization of neural and vascular dynamics across hemispheres are mediated by concurrent activation of type I nNOS neurons in both hemispheres, ablating them in one hemisphere will decrease synchrony. It is possible that type I nNOS neurons make some subcortical projections that were not reported in previous work (Tomioka 2005, Ruff 2024), but if these exist they are likely to be very small in number as they were not noted.  

      We have added the text in the Results (Line 228): “In contrast to the observed reductions in LFP in the ablated hemisphere, we noted no gross changes in the power spectra of neural LFP in the unablated hemisphere (Fig. S7) or power of the cerebral blood volume fluctuations in either hemisphere (Fig. S4).”

      Line 335): “The variance in ∆[HbT] during rest, a measure of vasomotion amplitude, was significantly reduced following type-I nNOS ablation (Fig. 7a), dropping from 40.9 ± 3.4 μM<sup>2</sup> in the Blank-SAP group (N ti 24, 12M/12F) to 23.3 ± 2.3 μM<sup>2</sup> in the SP-SAP group (N ti 24, 11M/13F) (GLME p ti 6.9×10<sup>-5</sup>) with no significant di[erence in the unablated hemisphere (Fig. S7).”

      Reviewer #3 (Recommendations for the authors):

      (1)  The reporting would be greatly strengthened by following ARRIVE guidelines 2.0: https://arriveguidelines.org/: aFrition rates and source of aFrition, justification for the use of 119 (beyond just consistent with previous studies), etc.

      We performed a power analysis prior to our study aiming to detect a physiologically-relevant effect size of (Cohen’s d) ti 1.3, or 1.3 standard deviations from the mean. Alpha and Power were set to the standard 0.05 and 0.80 respectively, requiring around 8 mice per group (SP-SAP, Blank, and for histology, naïve animals) for multiple independent groups (ephys, GCamp, histology). To potentially account for any aFrition due to failures in Type-I nNOS neuron ablation or other problems (such as electrode failure or window issues) we conservatively targeted a dozen mice for each group. Of mice that were imaged (1P/2P), two SP-SAP mice were removed from the dataset (24 SP-SAP remaining) post-histological analysis due to not showing ablation of nNOS neurons, an aFrition rate of approximately 8%.

      We have added the following text to the Materials & Methods (Line 441): “Sample sizes are consistent with previous studies (Echagarruga et al 2020, Turner et al 2023, Turner et al 2020, Zhang et al 2021) and based on a power analysis requiring 8-10 mice per group (Cohen’s d ti 1.3, α ti 0.05, (1 - β) ti 0.800). Experimenters were not blind to experimental conditions or data analysis except for histological experiments. Two SP-SAP mice were removed from the imaging datasets (24 SP-SAP remaining) due to not showing ablation of nNOS neurons during post-histological analysis, an aFrition rate of approximately 8%.”

      (2) Intro, line 38: Description of the importance of neurovascular coupling needs improvement. Coordinated haemodynamic activity is vital for maintaining neuronal health and the energy levels needed.

      We have added a sentence to the introduction (Line 41): “Neurovascular coupling plays a critical role in supporting neuronal function, as tightly coordinated hemodynamic activity is essential for meeting energy metabolism and maintaining brain health (Iadecola et al 2023, Schaeffer & Iadecola 2021).“

      (3) Given the wide range of mice ages, how was the age accounted for/its effects examined?

      Previous work from our lab has shown that there is no change in hemodynamics responses in awake mice over a wide range of ages (2-18 months), so the age range we used (3 and 9 months of age) should not impact this.  

      We have added the following text in the Results (Line 437): “Previous work from our lab has shown that the vasodilation elicited by whisker stimulation is the same in 2–4-month-old mice as in 18-month-old mice (BenneF, Zhang et al. 2024). As the age range used here is spanned by this time interval, we would not expect any age-related differences.”

      (4) How was the susceptibility of low-frequency neuronal coupling signals to noise managed? How were the low-frequency bands results validated?

      We are not sure what the referee is asking here. Our electrophysiology recordings were made differentially using stereotrodes with tips separated by ~100µm, which provides excellent common-mode rejection to noise and a localized LFP signal. Previous publications from our lab (Winder et al., Nature Neuroscience 2017; Turner et al., eLife2020) and others (Tu, Cramer, Zhang, eLife 2024) have repeatedly show that there is a very weak correlation between the power in the low frequency bands and hemodynamic signals, so our results are consistent with this previous work. 

      (5) It would be helpful to demonstrate the selectivity of cell *death* (as opposed to survival) induced by SP-SAP injections via assessments using markers of cell death.

      We agree that this would be helpful complement to our histological studies that show loss of type-I nNOS neurons, but no loss of other cells and minimal inflammation with SP-saporin injections.  However, we did not perform histology looking at cell death, only at surviving cells, given that we see no obvious inflammation or cells loss, which would be triggered by nonspecific cell death.  Previous work has established that saporin is cytotoxic and specific only to cell that internalize the saporin.   Internalization of saporin causes cell death via apoptosis (Bergamaschi, Perfe et al. 1996), and that the substance P receptor is internalized when the receptor is bound (Mantyh, Allen et al. 1995). Treatment of internalized saporin generates cellular debris that is phagocytosed by microglial, consistent with cell death (Seeger, Hartig et al. 1997). While it is possible that treatment of SP-saporin causes type 1 nNOS neurons to stop expressing nitric oxide synthase (which would make them disappear from our IHC staining), we think that this is unlikely given the literature shows internalized saporin is clearly cytotoxic. 

      We have added the following text to the Results (Line 131): “It is unlikely that the disappearance of type-I nNOS neurons is because they stopped expressing nNOS, as internalized saporin is cytotoxic. Exposure to SP-conjugated saporin causes rapid internalization of the SP receptor-ligand complex (Mantyh, Allen et al. 1995), and internalized saporin causes cell death via apoptosis (Bergamaschi, Perfe et al. 1996). In the brain, the resulting cellular debris from saporin administration is then cleared by microglia phagocytosis (Seeger, Hartig et al. 1997).”

      (6) Was the decrease in inter-hemispheric correlation associated with any changes to the corpus callosum?

      We noted no gross changes to the structure of the corpus callosum in any of our histological reconstructions following SSPSAP administration, however, we did not specifically test for this. Again, as we note in our reply in reviewer 2, the decrease in interhemispheric synchronization does not imply that there are changes in the corpus callosum and could be mediated by the changes in neural activity in the hemisphere in which the Type-I nNOS neurons were ablated.

      (7) How were automated cell counts validated?

      Criteria used for automated cell counts were validated with comparisons of manual counting as described in previous literature. We have added additional text describing the process in the Materials & Methods (Line 510): “For total cell counts, a region of interest (ROI) was delineated, and cells were automatically quantified under matched criteria for size, circularity and intensity. Image threshold was adjusted until absolute value percentages were between 1-10% of the histogram density. The function Analyze Par-cles was then used to estimate the number of particles with a size of 100-99999 pixels^2 and a circularity between 0.3 and 1.0 (Dao, Suresh Nair et al. 2020, Smith, Anderson et al. 2020, Sicher, Starnes et al. 2023). Immunoreactivity was quantified as mean fluorescence intensity of the ROI (Pleil, Rinker et al. 2015).”

      (8) Given the weighting of the vascular IOS readout to the superficial tissue, it is important to qualify the extent of the hemodynamic contrast, ie the limitations of this readout.

      We have added the following text to the Discussion (Line 385): “Intrinsic optical signal readout is primarily weighted toward superficial tissue given the absorption and scaFering characteristics of the wavelengths used. While surface vessels are tightly coupled with neural activity, it is still a maFer of debate whether surface or intracortical vessels are a more reliable indicator of ongoing activity (Goense et al 2012; Huber et al 2015; Poplawsky & Kim 2014).” 

      (9) Partial decreases observed through type-I iNOS neuronal ablation suggest other factors also play a role in regulating neural and vascular dynamics: data presented thus do *not* "indicate disruption of these neurons in diseases ranging from neurodegeneration to sleep disturbances," as currently stated. Please revise.

      We agree with the reviewer. We have changed the abstract sentence to read (Line 30): “This demonstrates that a small population of nNOS-positive neurons are indispensable for regulating both neural and vascular dynamics in the whole brain, raising the possibility that loss of these neurons could contribute to the development of neurodegenerative diseases and sleep disturbances.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This work starts with the observation that embryo polarization is asynchronous starting at the early 8-cell stage, with early polarizing cells being biased towards producing the trophectoderm (TE) lineage. They further found that reduced CARM1 activity and upregulation of its substrate BAF155 promote early polarization and TE specification, this piece of evidence connects the previous finding that at Carm1 heterogeneity 4-cell stage guide later cell lineages - the higher Carm1-expressing blastomeres are biased towards ICM lineage. Thus, this work provides a link between asymmetries at the 4-cell stage and polarization at the 8-cell stage, providing a cohesive explanation regarding the first lineage allocation in mouse embryos.

      Strengths:

      In addition to what has been put in the summary, the advanced 3D image-based analysis has found that early polarization is associated with a change in cell geometry in blastomeres, regarding the ratio of the long axis to the short axis. This is considered a new observation that has not been identified.

      Weaknesses:

      For the microinjection-based method to overexpression/deletion of proteins, although it has been shown to be effective in the early embryo settings and has been widely used, it may not fully represent the in vivo situation in some cases, compared to other strategies such as the use of knock-in mice. This is a minor weakness; it would be good to include some sentences in the discussion on the potential caveats.

      We thank the reviewer for their insightful summary of our work, and their adjudication on the novelty of our research. We agree with the reviewer that microinjection-based methods, whilst being the standard and widely used in the field, have their weaknesses. In this study, we have primarily used microinjection of previously tested and known constructs which may help mitigate these concerns, and have referenced numerous studies in which these constructs have been used and tested. Nevertheless, the authors are aware of this drawback and have tried to address this previously in other research using novel artificial intelligence techniques (Shen and Lamba et al., 2022 – cited in the manuscript) and this continues to be an active area of investigation for us.

      Reviewer #2 (Public review):

      Summary:

      In this study, Lamba and colleagues suggest a molecular mechanism to explain cell heterogeneity in cell specification during pre-implantation development. They show that embryo polarization is asynchronous. They propose that reduced CARM1 activity and upregulation of its substrate BAF155 promote early polarization and trophectoderm specification.

      Strengths:

      The authors use appropriate and validated methodology to address their scientific questions. They also report excellent live imaging. Most of the data are accompanied by careful quantifications.

      Weaknesses:

      I think this manuscript requires some more quantification, increased number of embryos in their evaluations and clearly stating the number of embryos evaluated per experiments.

      We thank the reviewer for these thoughtful comments on our work, their kind assessment of the strength of our research, and their notes on the weaknesses. We have replied to their points raised below.

      Here are some points:

      (1) It should be clearly stated in all figure legends and in the text how many cells from how many embryos were analyzed.

      We appreciate this comment to provide detailed quantification for every experiment in the paper and stating the numbers of embryos (if a whole embryo level experiment) or blastomeres used for statistical tests and displayed in the graph.

      (2) I think that the number of embryos sometimes are too low. These are mouse embryos easily accessible and the methods used are well established in this lab, so the authors should make an effort to have at least 10/15 embryos per experiment. For example "In agreement with this, hybridization chain reaction (HCR) RNA fluorescence in situ hybridization of early 8-cell stage embryos revealed that the number of CDX2 mRNA puncta was higher in polarized blastomeres with a PARD6-positive apical domain than in unpolarized blastomeres, for 5 out of 6 embryos with EP cells (Figure 3A, B)".. or the data for Figure 4, we know how many cells but now how many embryos.

      We appreciate the reviewer’s comment regarding the number of embryos used in the hybridization chain reaction (HCR) experiment. We agree that increasing the number of embryos could, in principle, further add statistical power. However, both first authors have since left the lab to begin their postdoctoral training or joining a company, and it is not feasible for us to generate additional embryos at this stage.

      Importantly, we believe the number of embryos included in the current manuscript is sufficient to support our conclusions, especially when considered in the context of the broader experimental design, the timing of the study, and our ethical commitment to minimizing animal use.

      Notably, the initial HCR experiment targeting Cdx2 mRNA served as a key indication that prompted further investigation of CDX2 at the protein level. These follow-up experiments were conducted with increased numbers of embryos and/or cells and are presented in Figure 3 and the associated supplementary figures (we now have 124 cells (including 23 EP cells) from 16 embryos), thereby strengthening and confirming the conclusion suggested by the HCR data.

      (3) It would be useful to see in Figure 4 an example of asymmetric cell division as done for symmetric cell division in panel 4B. This could really help the reader to understand how the authors assessed this.

      We used live imaging to track cell division patterns. Cells expressing RFP-tagged polarity proteins were observed during division to identify the resulting daughter cells. Immediately after cytokinesis, we assessed the polarity status of each daughter cell. If both daughter cells were polarized, the division was classified as symmetric; if only one was polarized, it was classified as asymmetric.

      Author response image 1.

      8-cell stage embryos expressing Ezrin-RFP (fire colour) was imaged during 8-16 cell stage division. Top panel arrows indicate a symmetric cell division in which polarity domain became partitioned into both daughter cells; bottom panel indicates asymmetric division in which the polarity domain only get inherited to one cell of the two daughter cells.

      (4) Figure 5C there is a big disproportion of the number of EP and LP identified. Could the authors increase the number of embryos quantified and see if they can increase EP numbers?

      We thank the reviewer for this comment and want to clarify an important detail: EP cells are a phenomenon with average cellular frequency of less than 10% as compared to LP cells (the other 90%). Therefore, when investigating natural embryo development without bias or exclusion, there will likely be an imbalance in the number of EP and LP cells as is the case for Figure 5C. In this case, morphological differences and clear statistical significance were seen between the shape of EP and LP cells within the cells quantified and therefore we decided not to expend further mice for this particular experiment – but we agree with the comment that in most cases additional embryos would help strength our conclusions further.

      (5) Could the authors give more details about how they mount the embryos for live imaging? With agarose or another technique? In which dishes? Overlaid with how much medium and oil? This could help other labs that want to replicate the live imaging in their labs. Also, was it a z-stack analysis? If yes, how many um per stack? Ideally, if they also know the laser power used (at least a range) it would be extremely useful.

      We thank the reviewer for this comment and have provided additional detail here and in the Methods section. For live imaging our embryos, we used glass-bottom 35 mm dishes. We then fixed a small cut square of nylon mesh (5mm to 1cm width and height) onto this plate in the centre using silicon which was used as a grid (diameter of approximately 150 micrometres) for deposition of embryos. After drying of the silicon (overnight) and washing with water, the grid was overlaid with a drop of 100 microlitres of KSOM and then covered with mineral oil until this KSOM drop was submerged. After incubation under conditions for live imaging, single embryos were deposited in each ‘well’ of the grid before being placed in the microscope, which was equilibrated at the correct temperature and CO2.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors use electrophysiological and behavioral measurements to examine how animals could reliably determine odor intensity/concentration across repeated experiences. Because stimulus repetition leads to short-term adaptation evidenced by reduced overall firing rates in the antennal lobe and firing rates are otherwise concentration-dependent, there could be an ambiguity in sensory coding between reduced concentration or more recent experience. This would have a negative impact on the animal's ability to generate adaptive behavioral responses that depend on odor intensities. The authors conclude that changes in concentration alter the constituent neurons contributing to the neural population response, whereas adaptation maintains the 'activated ensemble' but with scaled firing rates. This provides a neural coding account of the ability to distinguish odor concentrations even after extended experience. Additional analyses attempt to distinguish hypothesized circuit mechanisms for adaptation but are inconclusive. A larger point that runs through the manuscript is that overall spiking activity has an inconsistent relationship with behavior and that the structure of population activity may be the more appropriate feature to consider.

      To my knowledge, the dissociation of effects of odor concentration and adaptation on olfactory system population codes was not previously demonstrated. This is a significant contribution that improves on any simple model based on overall spiking activity. The primary result is most strikingly supported by visualization of a principal components analysis in Figure 4. However, there are some weaknesses in the data and analyses that limit confidence in the overall conclusions.

      We thank the reviewer for evaluating our work and highlighting its strengths and deficiencies. We have revised the manuscript with expanded behavioral datasets and additional analyses that we believe convincingly support our conclusion. 

      (1) Behavioral work interpreted to demonstrate discrimination of different odor concentrations yields inconsistent results. Only two of the four odorants follow the pattern that is emphasized in the text (Figure 1F). Though it's a priori unlikely that animals are incapable of distinguishing odor concentrations at any stage in adaptation, the evidence presented is not sufficient to reach this conclusion.

      We have expanded our dataset and now show that the behavioral response is significantly different for high and low concentration exposures of the same odorant. This was observed for all four odorants in our study (refer to Revised Fig. 1F).

      (2) While conclusions center on concepts related to the combination of activated neurons or the "active ensemble", this specific level of description is not directly demonstrated in any part of the results. We see individual neural responses and dimensional reduction analyses, but we are unable to assess to what extent the activated ensemble is maintained across experience.

      We have done several additional analyses (see provisional response). Notably, we have corroborated our dimensionality reduction and correlation analysis results with a quantitative classification analysis that convincingly demonstrates that odor identity and intensity of the odorant can be decoded from the ensemble neural activity, and this could be achieved in an adaptation-invariant fashion (refer to Revised Supplementary Fig. 4). 

      (3) There is little information about the variance or statistical strength of results described at the population level. While the PCA presents a compelling picture, the central point that concentration changes and adaptation alter population responses across separable dimensions is not demonstrated quantitatively. The correlation analysis that might partially address this question is presented to be visually interpreted with no additional testing.

      We have included a plot that compares the odor-evoked responses across all neurons (mean ± variance) at both intensity levels for each odorant (Revised Supplementary Fig. 5). This plot clearly shows how the ensemble neural activity profile varies with odor intensity and how these response patterns are robustly maintained across trials. 

      (4) Results are often presented separately for each odor stimulus or for separate datasets including two odor stimuli. An effort should be made to characterize patterns of results across all odor stimuli and their statistical reliability. This concern arises throughout all data presentations.

      We had to incorporate a 15-minute window between presentations of odorants to reset adaptation. Due to this, we were unable to extracellularly record from all four odorants at two intensities from a single experiment (~ 3.5 hours of recording for just 2 odorants at two intensities with one odorant at higher intensity repeated at the end; Fig. 2a). Therefore, we recorded two datasets. Each dataset captured the responses of ~80 PNs to two odorants at two intensities, one odorant at the higher concentration repeated at the end of the experiment to show repeatability of changes due to adaptation. 

      (5) The relevance of the inconclusive analysis of inferred adaptation mechanisms in Figure 2d-f and the single experiment including a complex mixture in Figure 7 to the motivating questions for this study are unclear.

      Figure 2d-f has been revised. While we agree that the adaptation mechanisms are not fully clear, there is a trend that the most active PNs are the neurons that change the most across trials. This change and the response in the first trial are negatively correlated, indicating that vesicle depletion could be an important contributor to the observed results. However, neurons that adapt strongly at higher intensities are not the ones that adapt at lower intensities. This complicates the understanding of how neural responses vary with intensities and the adaptation that happens due to repetition. This has been highlighted in the revised manuscript. 

      Regarding Figure 7, we wanted to examine the odor-specificity of the changes that happen due to repeated encounters of an odorant. Specifically, wondered if the neural response reduction and behavioral enhancements were a global, non-specific state change in the olfactory system brought about by the repetition of any odorant, or are the observed neural and behavioral response changes odor-specific.

      (6) Throughout the description of the results, typical standards for statistical reporting (sample size, error bars, etc.) are not followed. This prevents readers from assessing effect sizes and undermines the ability to assign a confidence to any particular conclusion.

      We have revised the manuscript to fix these issues and included sample size and error bars in our plots.  

      Reviewer #2 (Public Review):

      Summary:

      The authors' main goal was to evaluate how both behavioral responses to odor, and their early sensory representations are modified by repeated exposure to odor, asking whether the process of adaptation is equivalent to reducing the concentration of an odor. They open with behavioral experiments that actually establish that repeated odor presentation increases the likelihood of evoking a behavioral response in their experimental subjects - locusts. They then examine neural activity patterns at the second layer of the olfactory circuit. At the population level, repeated odor exposure reduces total spike counts, but at the level of individual cells there seems to be no consistent guiding principle that describes the adaptation-related changes, and therefore no single mechanism could be identified.

      Both population vector analysis and pattern correlation analysis indicate that odor intensity information is preserved through the adaptation process. They make the closely related point that responses to an odor in the adapted state are distinct from responses to lower concentration of the same odor. These analyses are appropriate, but the point could be strengthened by explicitly using some type of classification analysis to quantify the adaptation effects. e.g. a confusion matrix might show if there is a gradual shift in odor representations, or whether there are trials where representations change abruptly.

      Strengths:

      One strength is that the work has both behavioral read-out of odor perception and electrophysiological characterization of the sensory inputs and how both change over repeated stimulus presentations. It is particularly interesting that behavioral responses increase while neuronal responses generally decrease. Although the behavioral effect could occur fully downstream of the sensory responses the authors measure, at least those sensory responses retain the core features needed to drive behavior despite being highly adapted.

      Weaknesses:

      Ultimately no clear conceptual framework arises to understand how PN responses change during adaptation. Neither the mechanism (vesicle depletion versus changes in lateral inhibition) nor even a qualitative description of those changes. Perhaps this is because much of the analysis is focused on the entire population response, while perhaps different mechanisms operate on different cells making it difficult to understand things at the single PN level.

      From the x-axis scale in Fig 2e,f it appeared to me that they do not observe many strong PN responses to these stimuli, everything being < 10 spikes/sec. So perhaps a clearer effect would be observed if they managed to find the stronger responding PNs than captured in this dataset.

      We thank the reviewer for his/her evaluation of our work. Indeed, our work does not clarify the mechanism that underlies the adaptation over trials, and how this mechanism accounts for adaptation that is observed at two different intensities of the same odorant. However, as we highlight in the revised manuscript, there is some evidence for the vesicle depletion hypothesis. For the plots shown in Fig. 2, the firing rates were calculated after averaging across time bins and trials. Hence, the lower firing rates. The peak firing rates of the most active neurons are ~100 Hz. So, we are certain that we are collecting responses from a representative ensemble of neurons in this circuit.

      Reviewer #3 (Public Review):

      Summary:

      How does the brain distinguish stimulus intensity reduction from response reductions due to adaptation? Ling et al study whether and how the locust olfactory system encodes stimulus intensity and repetition differently. They show that these stimulus manipulations have distinguishable effects on population dynamics.

      Strengths:

      (1) Provides a potential strategy with which the brain can distinguish intensity decrease from adaptation. -- while both conditions reduce overall spike counts, intensity decrease can also changes which neurons are activated and adaptation only changes the response magnitude without changing the active ensemble.

      (2) By interleaving a non-repeated odor, they show that these changes are odor-specific and not a non-specific effect.

      (3) Describes how proboscis orientation response (POR) changes with stimulus repetition., Unlike the spike counts, POR increases in probability with stimulus. The data portray the variability across subjects in a clear way.

      We thank the reviewer for the summary and for highlighting the strengths of our work.

      Weaknesses:

      (1) Behavior

      a. While the "learning curve" of the POR is nicely described, the behavior itself receives very little description. What are the kinematics of the movement, and do these vary with repetition? Is the POR all-or-nothing or does it vary trial to trial?

      The behavioral responses were monitored in unconditioned/untrained locusts. Hence, these are innate responses to the odorants. These innate responses are usually brief and occur after the onset of the stimulus. However, there is variability across locusts and trials (refer Revised Supplementary Fig. 1). When the same odorant is conditioned with food reward, the POR responses become more stereotyped and occur rapidly within a few hundred milliseconds. 

      Author response image 1.

      POR response dynamics in a conditioned locust. The palps were painted in this case (left panel), and the distance between the palps was tracked as a function of time (right panel).

      b. What are the reaction times? This can constrain what time window is relevant in the neural responses. E.g., if the reaction time is 500 ms, then only the first 500 ms of the ensemble response deserves close scrutiny. Later spikes cannot contribute.

      This is an interesting point. We had done this analysis for conditioned POR responses. For innate POR, as we noted earlier, there is variability across locusts. Many responses occur rapidly after odor onset (<1 s), while some responses do occur later during odor presentation and in some cases after odor termination. It is important to note that these dynamical aspects of the POR response, while super interesting, should occur at a much faster time scale compared to the adaptation that we are reporting across trials or repeated encounters of an odorant.

      c. The behavioral methods are lacking some key information. While references are given to previous work, the reader should not be obligated to look at other papers to answer basic questions: how was the response measured? Video tracking? Hand scored?

      We agree and apologize for the oversight. We have revised the methods and added a video to show the POR responses. Videos were hand-scored. 

      d. Can we be sure that this is an odor response? Although airflow out of the olfactometer is ongoing throughout the experiment, opening and closing valves usually creates pressure jumps that are likely to activate mechanosensors in the antennae.

      Interesting. We have added a new Supplementary Fig. 2 that shows that the POR to even presentations of paraffin oil (solvent; control) is negligible.  This should confirm that the POR is a behavioral response to the odorant. 

      Furthermore, all other potential confounds identified by the reviewer are present for every odorant and every concentration presented.  However, the POR varies in an odor-identity and intensity-specific manner. 

      e. What is the baseline rate of PORs in the absence of stimuli?

      Almost zero. 

      f. What can you say about the purpose of the POR? I lack an intuition for why a fly would wiggle the maxillary palps. This is a question that is probably impossible to answer definitively, but even a speculative explanation would help the reader better understand.

      The locusts use these finger-like maxillary palps to grab a grass blade while eating. Hence, we believe that this might be a preparatory response to feeding. We have noted that the PORs are elicited more by food-related odorants. Hence, we think it is a measure of odor appetitiveness. This has been added to the manuscript. 

      (2) Physiology

      a. Does stimulus repetition affect "spontaneous" activity (i.e., firing in the interstimulus interval? To study this question, in Figures 2b and c, it would be valuable to display more of the prestimulus period, and a quantification of the stability or lability of the inter-stimulus activity.

      Done. Yes, the spontaneous activity does appear to change in an odor-specific manner. We have done some detailed analysis of the same in this preprint:

      Ling D, Moss EH, Smith CL, Kroeger R, Reimer J, Raman B, Arenkiel BR. Conserved neural dynamics and computations across species in olfaction. bioRxiv [Preprint]. 2023 Apr 24:2023.04.24.538157. doi: 10.1101/2023.04.24.538157. PMID: 37162844; PMCID: PMC10168254

      b. When does the response change stabilize? While the authors compare repetition 1 to repetition 25, from the rasters it appears that the changes have largely stabilized after the 3rd or 4th repetition. In Figure 5, there is a clear difference between repetition 1-3 or so and the rest. Are successive repetitions more similar than more temporally-separated repetitions (e.g., is rep 13 more similar to 14 than to 17?). I was not able to judge this based on the dendrograms of Figure 5. If the responses do stabilize at it appears, it would be more informative to focus on the dynamics of the first few repetitions.

      The reviewer makes an astute observation. Yes, the changes in firing rates are larger in the first three trials (Fig. 3c). The ensemble activity patterns, though, are relatively stable across all trials as indicated by the PCA plots and classification analysis results.

      Author response image 2.

      Correlation as a function of trial number. All correlations were made with respect to the odor-evoked responses in the last odor trial of hex(H) and bza(H).

      c. How do temporal dynamics change? Locust PNs have richly varied temporal dynamics, but how these may be affected is not clear. The across-population average is poorly suited to capture this feature of the activity. For example, the PNs often have an early transient response, and these appear to be timed differently across the population. These structures will be obscured in a cross population average. Looking at the rasters, it looks like the initial transient changes its timing (e.g., PN40 responses move earlier; PN33 responses move later.). Quantification of latency to first spike after stimulus may make a useful measure of the dynamics.

      As noted earlier, to keep our story simple in this manuscript, we have only focused on the variations across trials (i.e., much slower response dynamics). We did this as we are not recording neural and behavioral responses from the same locust. We plan to do this and directly compare the neural and behavioral dynamics in the same locust.

      d.How legitimate is the link between POR and physiology? While their changes can show a nice correlation, the fact the data were taken from separate animals makes them less compelling than they would be otherwise. How feasible is it to capture POR and physiology in the same prep?

      This would be most helpful, but I suspect may be too technically challenging to be within scope.

      The antennal lobe activity in the input about the volatile chemicals encountered by the locust. The POR is a behavioral output. Hence, we believe that examining the correlation between the olfactory system's input and output is a valid approach. However, we have only compared the mean trends in neural and behavioral datasets, and dynamics on a much slower timescale. We are currently developing the capability to record neural responses in behaving animals. This turned out to be a bit more challenging than we had envisioned. We plan to do fine-grained comparisons of the neural and behavioral dynamics, recommended by this reviewer, in those preparations.

      Further, we will also be able to examine whether the variability in behavioral responses could be predicted from neural activity changes in that prep.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This manuscript investigated the mechanism underlying boundary formation necessary for proper separation of vestibular sensory end organs. In both chick and mouse embryos, it was shown that a population of cells abutting the sensory (marked by high Sox2 expression) /nonsensory cell populations (marked by Lmx1a expression) undergo apical expansion, elongation, alignment and basal constriction to separate the lateral crista (LC) from the utricle. Using Lmx1a mouse mutant, organ cultures, pharmacological and viral-mediated Rock inhibition, it was demonstrated that the Lmx1a transcription factor and Rock-mediated actomyosin contractility is required for boundary formation and LC-utricle separation.

      Strengths:

      Overall, the morphometric analyses were done rigorously and revealed novel boundary cell behaviors. The requirement of Lmx1a and Rock activity in boundary formation was convincingly demonstrated.

      Weaknesses:

      However, the precise roles of Lmx1a and Rock in regulating cell behaviors during boundary formation were not clearly fleshed out. For example, phenotypic analysis of Lmx1a was rather cursory; it is unclear how Lmx1a, expressed in half of the boundary domain, control boundary cell behaviors and prevent cell mixing between Lmx1a+ and Lmx1a- compartments? Well-established mechanisms and molecules for boundary formation were not investigated (e.g. differential adhesion via cadherins, cell repulsion via ephrin-Eph signaling). Moreover, within the boundary domain, it is unclear whether apical multicellular rosettes and basal constrictions are drivers of boundary formation, as boundary can still form when these cell behaviors were inhibited. Involvement of other cell behaviors, such as radial cell intercalation and oriented cell division, also warrant consideration. With these lingering questions, the mechanistic advance of the present study is somewhat incremental.

      We have acknowledged the lingering questions this referee points out in our Discussion and agree that the roles of differential cell adhesion and cell intercalation would be worth exploring in further studies. Despite these remaining questions, the conceptual advances are significant, since this study provides the first evidence that a tissue boundary forms in between segregating sensory organs in the inner ear (there are only a handful of embryonic tissues in which a tissue boundary has been found in vertebrates) and highlights the evolutionary conservation of this process. This work also provides a strong descriptive basis for any future study investigating the mechanisms of tissue boundary formation in the mouse and chicken embryonic inner ear. 

      Reviewer #2 (Public review):

      Summary:

      Chen et al. describe the mechanisms that separate the common pan-sensory progenitor region into individual sensory patches, which presage the formation of the sensory epithelium in each of the inner ear organs. By focusing on the separation of the anterior and then lateral cristae, they find that long supra-cellular cables form at the interface of the pansensory domain and the forming cristae. They find that at these interfaces, the cells have a larger apical surface area, due to basal constriction, and Sox2 is down-regulated. Through analysis of Lmx1 mutants, the authors suggest that while Lmx1 is necessary for the complete segregation of the sensory organs, it is likely not necessary for the initial boundary formation, and the down-regulation of Sox2.

      Strengths:

      The manuscript adds to our knowledge and provides valuable mechanistic insight into sensory organ segregation. Of particular interest are the cell biological mechanisms: The authors show that contractility directed by ROCK is important for the maintenance of the boundary and segregation of sensory organs.

      Weaknesses:

      The manuscript would benefit from a more in-depth look at contractility - the current images of PMLC are not too convincing. Can the authors look at p or ppMLC expression in an apical view? Are they expressed in the boundary along the actin cables? Does Y-27362 inhibit this expression?

      The authors suggest that one role for ROCK is the basal constriction. I was a little confused about basal constriction. Are these the initial steps in the thinning of the intervening nonsensory regions between the sensory organs? What happens to the basally constricted cells as this process continues?

      In our hands, the PMLC immunostaining gave a punctate staining in epithelial cells and was difficult to image and interpret in whole-mount preparations, which did not allow us to investigate its specific association to the actin-cable-like structures. It is a very valuable suggestion to try alternative methods of fixation to improve the quality of the staining and images in future work. 

      The basal constriction of the cells at the border of the sensory organs was not always clearly visible in freshly-fixed samples, and was absent in the majority of short-term organotypic cultures in control medium, which made it impossible to ascertain the role of ROCK in its formation using pharmacological approaches in vitro (see Figure 7 and corresponding Result section).  On the other hand, the overexpression of a dominant-negative form of ROCK (RCII-GFP) in ovo using RCAS revealed a persistence of basal constriction in transfected cells despite a disorganisation of the boundary domain (Figure 8). We conclude from these experiments that ROCK activity is not necessary for the formation and maintenance of the basal constriction. We also remain uncertain about the exact role of this basal constriction. It could be either a cause or consequence of the expansion of the apical surface of cells in the boundary domain, it could contribute to the limitation of cell intermingling and the formation of the actin-cable-like structure at the interface of Lmx1a-expressing and non-expressing cells, and may indeed prefigure some of the further changes in cell morphology occurring in non-sensory domains separating the sensory organs (cell flattening and constrictions of the epithelial walls in between sensory organs). 

      The steps the authors explore happen after boundaries are established. This correlates with a down-regulation of Sox2, and the formation of a boundary. What is known about the expression of molecules that may underlie the apparent interfacial tension at the boundaries? Is there any evidence for differential adhesion or for Eph-Ephrin signalling? Is there a role for Notch signalling or a role for Jag1 as detailed in the group's 2017 paper?

      Great questions. It is indeed likely that some form of differential cell tension and/or adhesion participates to the formation and maintenance of this boundary, and we have mentioned in the discussion some of the usual suspects (cadherins, eph/ephrin signalling,…) although it is beyond the scope of this paper to determine their roles in this context. 

      As we have discussed in this paper and in our 2017 study (see also Ma and Zhang, Development,  2015 Feb 15;142(4):763-73. doi: 10.1242/dev.113662) we believe that Notch signalling is maintaining prosensory character, and its down-regulation by Lmx1a/b expression is required for the specification of the non-sensory domains in between segregating sensory organs. Although we have not tested this directly in this study, any disruption in Notch signalling would be expected to affect indirectly the formation or maintenance of the boundary domain. 

      A comment on whether cellular intercalation/rearrangements may underlie some of the observed tissue changes.

      We have not addressed this topic directly in the present study but we have included a brief comment on the potential implication of cellular intercalation and rearrangements in the discussion: “It is also possible that the repositioning of cells through medial intercalation could contribute to the straightening of the boundary as well as the widening of the nonsensory territories in between sensory patches.”

      The change in the long axis appears to correlate with the expression of Lmx1a (Fig 5d). The authors could discuss this more. Are these changes associated with altered PCP/Vangl2 expression?

      We are not sure about the first point raised by the referee. We have quantified cell elongation and orientation in Lmx1a-GFP heterozygous and homozygous (null) mice, and our results suggest that the elongation of the cells occurs throughout the boundary domain, and is probably not dependent on Lmx1a expression (boundary cells are in fact more elongated in the Lmx1a mutant).  We have not investigated the expression of components of the planar cell polarity pathway. This is a very interesting suggestion, worth exploring in further studies.

      Reviewer #3 (Public review):

      Summary:

      Lmx1a is an orthologue of apterous in flies, which is important for dorsal-ventral border formation in the wing disc. Previously, this research group has described the importance of the chicken Lmx1b in establishing the boundary between sensory and non-sensory domains in the chicken inner ear. Here, the authors described a series of cellular changes during border formation in the chicken inner ear, including alignment of cells at the apical border and concomitant constriction basally. The authors extended these observations to the mouse inner ear and showed that these morphological changes occurred at the border of Lmx1a positive and negative regions, and these changes failed to develop in Lmx1a mutants. Furthermore, the authors demonstrated that the ROCK-dependent actomyosin contractility is important for this border formation and blocking ROCK function affected epithelial basal constriction and border formation in both in vitro and in vivo systems.

      Strengths:

      The morphological changes described during border formation in the developing inner ear are interesting. Linking these changes to the function of Lmx1a and ROCK dependent actomyosin contractile function are provocative.

      Weaknesses:

      There are several outstanding issues that need to be clarified before one could pin the morphological changes observed being causal to border formation and that Lmx1a and ROCK are involved.

      We have addressed the specific comments and suggestions of the reviewer below. We wish however to point out that we do not think that ROCK activity is required for the formation or maintenance of the basal constriction at the interface of Lmx1a-expressing and nonexpressing cells (see previous answer to referee #2)

      Reviewer #1 (Recommendations for the authors):

      Specific comments:

      (1) Figures 1 and 2, and related text. Based on the whole-mount images shown, the anterior otocyst appeared to be a stratified epithelium with multiple cell layers. If so, it should be clarified whether the x-y view of in the "apical" and "basal" plane are from cells residing in the apical and basal layers, respectively. Moreover, it would be helpful to include a "stage 4", a later stage to show if and when basal constrictions resolve.

      In fact, at these early stages of development, the otic epithelium is “pseudostratified”: it is formed by a single layer of irregularly shaped cells, each extending from the base to the apical aspect of the epithelium, but with their nuclei residing at distinct positions along this basal-apical axis as mitotic cells progress through the cell cycle.  The nuclei divide at the surface of the epithelium, then move back to the most basal planes within daughter cells during interphase. This process, known as interkinetic nuclear migration, has been well described in the embryonic neural tube and occurs throughout the developing otic epithelium (e.g. Orr, Dev Biol. 1975, 47,325-340, Ohta et al., Dev Biol. 2010 Sep 15;347(2):369–381. doi: 10.1016/j.ydbio.2010.09.002; ). Consequently, the nuclei visible in apical or basal planes in x-y views belong to cells extending from the base to the apex of the epithelium, but which are at different stages of the cell cycle. 

      We have not included a late stage of sensory organ segregation in this study (apart from a P0 stage in the mouse inner ear, see Figure 4) since data about later stages of sensory organ morphogenesis are available in other studies, including our Mann et al. eLife 2017 paper describing Lmx1a-GFP expression in the embryonic mouse inner ear.

      (2) Related to above, the observed changes in cell organization raised the possibility that the apical multicellular rosettes and basal constrictions observed in Stage 3 (and 2) could be intermediates of radial cell intercalations, which would lead to expansion of the space between sensory organs and thinning of the boundary domains. To see if it might be happening, it would be helpful to include DAPI staining to show the overall tissue architecture at different stages and use optical reconstruction to assess the thickness of the epithelium in the presumptive boundary domain over time.

      We agree with this referee. Besides cell addition by proliferation and/or changes in cell morphology, radial cell intercalations could indeed contribute to the spatial segregation of inner ear sensory organs (a brief statement on this possibility was added to the Discussion). It is clear from images shown in Figure 4 (and from other studies) that the non-sensory domain separating the cristae from the utricle gets flatter and its cells also enlarge as development proceeds. We do not think that DAPI staining is required to demonstrate this. Perhaps the best way to show that radial cell intercalations occur would be to perform liveimaging of the otic epithelium, but this is technically challenging in the mouse or chicken inner ear. An alternative model system might be the zebrafish inner ear, in which some liveimaging data have shown a progressive down-regulation of Jag1 expression during sensory organ segregation (and a flattening of “boundary domains”), suggesting a conservation of the basic mechanisms at play (Ma and Zhang, Development,  2015 Feb 15;142(4):763-73. doi: 10.1242/dev.113662).

      (3) Similarly, it would be helpful to include the DAPI counterstain in Figures 4, 7, and 8 to show the overall tissue architecture.

      We do not have DAPI staining for these particular images but in most cases, Sox2 immunostaining gives a decent indication of tissue morphology. 

      (4) Figure 2(z) and Figure 4d. The arrows pointing at the basal constrictions are obstructing the view of the basement membrane area, making it difficult to appreciate the morphological changes. They should be moved to the side. Can the authors comment whether they saw evidence for radial intercalations (e.g. thinning of the boundary domain) or partial unzippering of adjoining compartments along the basal constrictions?

      The arrows in Figure 2(z) and Figure 4d have been moved to the side of the panels. 

      See previous comment. Besides the presence of multicellular rosettes, we have not seen direct evidence of radial cell intercalation – this would be best investigated using liveimaging. As development proceeds, the epithelial domain separating adjoining sensory organs becomes wider. The cells that compose it gradually enlarge and flatten, as can be seen for example at P0 in the mouse inner ear (Figure 4g). 

      (5) Figures 3 and 5, and related text. It should be clarified whether the measurements were all taken from the surface cells. For Fig. 3e and 5d, the mean alignment angles of the cell long axis in the boundary regions should be provided in the text.

      The sensory epithelium in the otocyst is pseudostratified, hence, the measurement was taken from the surface of all epithelial cells labelled with F-actin. 

      We have added histograms representing the angular distribution of the cell long axis orientations in the boundary region to Figure 3 and Figure 5 Supplementary 1. We believe that this type of representation is more informative than the numerical value of the mean alignment angles of the cell long axis for defined sub-domains. 

      (6) It would be helpful to also quantify basal constrictions using the cell skeleton analysis. In addition, it would be helpful to show x-y views of cell morphology at the level of basal constrictions in the mouse tissue, similar to the chick otocyst shown in Figure 2.

      The data that we have collected do not allow a precise quantification of basal constrictions with cell skeleton analysis, due to the generally fuzzy nature of F-actin staining in the basal planes of the epithelium. However, we have followed the referee’s advice and analysed Factin staining in x-y views in the Lmx1a-GFP knock-in (heterozygous) mice. We found that the first signs of basal F-actin enrichment and multicellular actin-cable like structures at the interface of Lmx1a-positive and negative cells are visible at E11.5, and F-actin staining in the basal planes increases in intensity and extent at E13.5. (shown in new Figure 4 – Supplementary Figure 1).

      (7) Figure 5 and related text. It would be informative to analyze Lmx1a mutants at early stages (E11-E13) to pinpoint cell behavior defects during boundary formation.

      We chose the E15 stage because it is one at which we can unequivocally recognize and easily image and analyse the boundary domain from a cytoarchitectural point of view. We recognize that it would have been worth including earlier stages in this analysis but have not been able to perform these additional studies due to time constraints and unavailability of biological material. 

      (8) Figure 5-Figure S1, the quantifications suggest that Lmx1a loss had both cellautonomous and non-autonomous effects on boundary cell behaviors. This is an interesting finding, and its implication should be discussed.

      It is well-known that the absence of Lmx1a function induces a very complex (and variable) phenotype in terms of inner ear morphology and patterning defects. It is also clear from this study that the absence of Lmx1 causes non-cell autonomous defects in the boundary domain and we have already mentioned this in the discussion: “Finally, the patterning abnormalities in Lmx1a<sup>GFP/GFP</sup> samples occurred in both GFP-positive and negative territories, which points at some type of interaction between Lmx1a-expressing and nonexpressing cells, and the possibility that the boundary domain is also a signalling centre influencing the differentiation of adjacent territories.”

      (9) Figure 6 and related text. To correlate myosin II activity with boundary cell behaviors, it would be important to immunolocalize pMLC in the boundary domain in whole-mount otocyst preparations from stage 1 to stage 3.

      We tried to perform the suggested immunostaining experiments, but in our hands at least, the antibody used did not produce good quality staining in whole-mount preparations. We have therefore included images of sectioned otic tissue, which show some enrichment in pMLC immunostaining at the interface of segregating organs (Figure 6).

      (10) Figures 7 and 8. A caveat of long-term Rock inhibition is that it can affect cell proliferation and differentiation of both sensory and non-sensory cells, which would cause secondary effects on boundary formation. This caveat was not adequately addressed. For example, does Rock signaling control either the rate or the orientation of cell division to promote boundary formation? Together with the mild effect of acute Rock inhibition, the precise role of Rock signaling in boundary formation remains unclear.

      We absolutely agree that the exact function of ROCK could not be ascertained in the in vitro experiments, for the reasons we have highlighted in the manuscript (no clear effect in short term treatments, great level of tissue disorganisation in long-term treatments). This prompted us to turn to an in ovo approach. The picture remains uncertain in relation to the role of ROCK in regulating cell division/intercalation but we have been at least able to show a requirement for the maintenance of an organized and regular boundary. 

      (11) Figure 8. RCII-GFP likely also have non-autonomous effects on cell apical surface area. In 8d, it would be informative to include cell area quantifications of the GFP control for comparison.

      It is possible that some non-autonomous effects are produced by RCII-GFP expression, but these were not the focus of the present study and are not particularly relevant in the context of large patches of overexpression, as obtained with RCAS vectors. 

      We have added cell surface area quantifications of the control RCAS-GFP construct for comparison (Figure 8e).

      (12) The significance of the presence of cell divisions shown in Figure 9 is unclear. It would be informative to include some additional analysis, such as a) quantify orientation of cell divisions in and around the boundary domain and b) determine whether patterns of cell division in the sensory and nonsensory regions are disrupted in Lmx1a mutants.

      These are indeed fascinating questions, but which would require considerable work to answer and are beyond the scope of this paper. 

      Minor comments:

      (1) Figure 1. It should be clarified whether e', h' and k' are showing cortical F-actin of surface cells. Do the arrowheads in i' and l' correspond to the position of either of the arrowheads in h' and k', respectively?

      The epithelium in the otocyst is pseudostratified. Therefore, images e’, h’, k’ display F-actin labelling on the surface of tissue composed of a single cell layer. We have added arrows to images e”, h”, and k” to indicate the corresponding position of z-projections and included appropriate explanation in the legend of Figure 1: “Black arrows on the side of images e”, h”, and k” indicate the corresponding position of z-projections.”

      (2) Figure 3-Figure S1. Please mark the orientation of the images shown.

      We labelled the sensory organs in the figure to allow for recognizing the orientation. 

      (3) Figure 4. Orthogonal reconstructions should be labeled (z) to be consistent with other figures.

      We have corrected the labelling in the orthogonal reconstruction to (z). 

      (4) Figure 4g. It is not clear what is in the dark area between the two bands of Lmx1a+ cells next to the utricle and the LC. Are those cells Lmx1a negative? It is unclear whether a second boundary domain formed or the original boundary domain split into two between E15 and P0? Showing the E15 control tissue from Figure 5 would be more informative than P0.

      In this particular sample there seems to be a folding of the tissue (visible in z-reconstructions) that could affect the appearance of the projection shown in 4g. We believe the P0 is a valuable addition to the E15 data, showing a slightly later stage in the development of the vestibular organs.

      (5) Figure 5a, e. Magnified regions shown in b and f should be boxed correspondingly.

      This figure has been revised. We realized that the previous low-magnification shown in (e) (now h) was from a different sample than the one shown in the high-magnification view. The new figure now includes the right low-magnification sample (in h) and the regions shown in the high-magnification views have been boxed.

      (6) Figure 8f, h, j. Magnified regions shown in g, i and k should be boxed correspondingly.

      The magnified regions were boxed in Figure 8 f, h, and j. Additionally, black arrows have been placed next to images 8g", 8i", and 8k" to highlight the positions of the z-projections. An appropriate explanation has also been added to the figure legend.

      (9) Figure 8. It would be helpful to show merged images of GFP and F-actin, to better appreciate cell morphology of GFP+ and GFP- cells.

      As requested, we have added images showing overlap of GFP and F-actin channels in Figure 8.

      Reviewer #2 (Recommendations for the authors):

      The PMLC staining could be improved. Two decent antibodies are the p-MLC and pp-MLC antibodies from CST. pp-MLC works very well after TCA fixation as detailed in https://www.researchsquare.com/article/rs-2508957/latest . As phalloidin does not work well after TCA fixation, affadin works very well for segmenting cells.

      If the authors do not wish to repeat the pMLC staining, the details of the antibody used should be mentioned.

      We used mouse IgG1 Phospho-Myosin Light Chain 2 (Ser19) from Cell Signaling Technology (catalogue number #3675) in our immunohistochemistry for PMLC. This is one of the two antibodies recommended by the reviewer #2. Information about this antibody has now been included in material and methods. This antibody has been referenced by many manuscripts, but unfortunately, in our hands at least, it did not perform well in whole-mount preparations.

      A statement on the availability of the data should be included.

      We have included a statement on the data availability: “All data generated or analysed during this study is available upon request.”

      Reviewer #3 (Recommendations for the authors):

      Outstanding issues:

      (1) Morphological description: The apical alignment of epithelial cells at the border is clear but not the upward pull of the basal lamina. Very often, it seems to be the Sox2 staining that shows the upward pull better than the F-actin staining. Perhaps, adding an anti-laminin staining to indicate the basement membrane may help.

      Indeed, the upward pull of the basement membrane is not always very clear. We performed some anti-laminin immunostaining on mouse cryosections and provide below (Figure 1) an example of such experiment. The results appear to confirm an upward displacement of the basement membrane in the region separating the lateral crista from the utricle in the E13 mouse inner ear, but given the preliminary nature of these experiments, we believe that these results do not warrant inclusion in the manuscript. The term “pull” is somehow implying that the epithelial cells are responsible for the upward movement of the basement membrane, but since we do not have direct evidence that this is the case, we have replaced “pull” by “displacement” throughout the text. 

      (2) It is not clear how well the cellular changes are correlated with the timing of border formation as some of the ages shown in the study seem to be well after the sensory patches were separated and the border was established.

      For some experiments (for example E15 in the comparison of mouse Lmx1a-GFP heterozygous and homozygous inner ear tissue; E6 for the RCAS experiments), the early stages of boundary formation are not covered because we decided to focus our analysis on the late consequences of manipulating Lmx1a/ROCK activity in terms of sensory organ segregation. The dataset is more comprehensive for the control developmental series in the chicken and mouse inner ear. 

      (3) The Lmx1a data, as they currently stand could be explained by Lmx1a being required for non-sensory development and not necessarily border formation. Additionally, the relationship between ROCK and Lmx1a was not investigated. Since the investigators have established the molecular mechanisms of Lmx1 function using the chicken system previously, the authors could try to correlate the morphological events described here with the molecular evidence for Lmx1 functioning during border formation in the same chicken system. Right now, only the expression of Sox2 is used to correlate with the cellular events, and not Lmx1, Jag1 or notch.

      These are valid points. Exploring in detail the epistatic relationships between Notch signalling/Lmx1a/ROCK/boundary formation in the chicken model would be indeed very interesting but would require extensive work using both gain and loss-of-function approaches, combined with the analysis of multiple markers (Jag1/Sox2/Lmx1b/PMLC/Factin..). At this point, and in agreement with the referee’s comment, we believe that Lmx1a is above all required for the adoption of the non-sensory fate. The loss of Lmx1a function in the mouse inner ear produce defects in the patterning and cellular features of the boundary domain, but these may be late consequences of the abnormal differentiation of the nonsensory domains that separate sensory organs. Furthermore, ROCK activity does not appear to be required for Sox2 expression (i.e. adoption or maintenance of the sensory fate) since the overexpression of RCII-GFP does not prevent Sox2 expression in the chicken inner ear. This fits with a model in which Notch/Lmx1a regulate cell differentiation whilst ROCK acts independently or downstream of these factors during boundary formation. 

      Specific comments:

      (1) Figure 1. The downregulation of Sox2 is consistent between panels h and k, but not between panels e and h. The orthogonal sections showing basal constriction in h' and k' are not clear.

      The downregulation is noticeable along the lower edge of the crista shown in h; the region selected for the high-magnification view sits at an intermediate level of segregation (and Sox2 downregulation). 

      The basal constriction is not very clear in h, but becomes easier to visualize in k. We have displaced the arrow pointing at the constriction, which hopefully helps. 

      (2) Figure 2. Where was the Z axis taken from? One seems to be able to imagine the basal constriction better in the anti-Sox2 panel than the F-actin panel. A stain outlining the basement membrane better could help.

      Arrows have been added on the side of the horizontal views to mark the location of the zreconstruction. See our previous replies to comments addressing the upward displacement of the basement membrane.

      (3) Figure 4

      I question the ROI being chosen in this figure, which seems to be in the middle of a triad between LC, prosensory/utricle and the AC, rather than between AC and LC. If so, please revise the title of the figure. This could also account for the better evidence of the apical alignment in the upper part of the f panel.

      We have corrected the text. 

      In this figure, the basal constriction is a little clearer in the orthogonal cuts, but it is not clear where these sections were taken from.

      We have added black arrows next to images 4c’, 4f’, and 4i’ to indicate the positions of the zprojections.  

      By E13.5, the LC is a separate entity from the utricle, it makes one wonder how well the basal constriction is correlated with border formation. The apical alignment is also present by P0, which raises the question that the apical alignment and basal restriction may be more correlated with differentiation of non-sensory tissue rather than associated with border formation.

      We agree E13.5 is a relatively late stage, and the basal constriction was not always very pronounced. The new data included in the revised version include images of basal planes of the boundary domain at E11.5, which reveal F-actin enrichment and the formation of an actin-cable-like structure (Figure 4 suppl. Fig1). Furthermore, the chicken dataset shows that the changes in cell size, alignment, and the formation of actin-cable-like structure precede sensory patch segregation and are visible when Sox2 expression starts to be downregulated in prospective non-sensory tissue (Figure 1, Figure 2). Considering the results from both species, we conclude that these localised cellular changes occur relatively early in the sequence of events leading to sensory patch segregation, as opposed to being a late consequence of the differentiation of the non-sensory territories.  

      I don't follow the (x) cuts for panels h and I, as to where they were taken from and why there seems to be an epithelial curvature and what it was supposed to represent.

      We have added black arrows next to the panels 4c’, 4f’, and 4i’ to indicate the positions of the z-projections and modified the legend accordingly. The epithelial curvature is probably due to the folding of the tissue bordering the sensory organs during the manipulation/mounting of the tissue for imaging.

      (4) Figure 5 The control images do not show the apical alignment and the basal constriction well. This could be because of the age of choice, E15, was a little late. Unfortunately, the unclarity of the control results makes it difficult for illustrating the lack of cellular changes in the mutant. The only take-home message that one could extract from this figure is a mild mixing of Sox2 and Lmx1a-Gfp cells in the mutant and not much else. Also, please indicate the level where (x) was taken from.

      Black arrows have been placed next to images 5e and 5l to highlight the positions of the zprojections. The stage E15 chosen for analysis was appropriate to compare the boundary domains once segregation is normally completed. We believe the results show some differences in the cellular features of the boundary domain in the Lmx1a-null mouse, and we have in fact quantified this using Epitool in Figure 5 – Suppl. Fig 1. Cells are more elongated and better aligned in the Lmx1a-null than in the heterozygous samples.  

      (5) Figure 7. I think the cellular disruption caused by the ROCK inhibitor, shown in q', is too severe to be able to pin to a specific effect of ROCK on border formation. In that regard, the ectopic expression of the dominant negative form of ROCK using RCAS approach is better, even though because it is a replication competent form of RCAS, it is still difficult to correlate infected cells to functional disruption.

      We used a replication-competent construct to induce a large patch of infection, increasing our chances of observing a defect in sensory organ segregation and boundary formation. We agree that this approach does not allow us to control the timing of overexpression, but the mosaicism in gene expression, allowing us to compare in the same tissue large regions with/without perturbed ROCK activity, proved more informative than the pharmacological/in vitro experiments.

      (6) Figure 8. Outline the ROI of i in h, and k in j. Outline in k the comparable region in k'. In k", F-actin staining is not uniform. Indicate where (x) was taken from in K.

      The magnified regions were boxed in Figure 8 f, h, and j. Region outlined in figures k’-k” has also been outlined in corresponding region in figure k. Additionally, black arrows have been placed next to images 8g", 8i", and 8k" to highlight the positions of the z-projections. An appropriate explanation has also been added to the figure legend.

      Minor comments:

      (1) P.18, 1st paragraph, extra bracket at the end of the paragraph.

      Bracket removed

      (2) P.22, line 11, in ovo may be better than in vivo in this case.

      We agree, this has been corrected. 

      (3) P.25, be consistent whether it is GFP or EGFP.

      Corrected to GFP.

      (4) P.26, line 5. Typo on "an"

      Corrected to “and”

      Author response image 1.

      Expression of Laminin and Sox2 in the E13 mouse inner ear. a-a’’’) Low magnification view of the utricle, the lateral crista, and the non-sensory (Sox2-negative) domain separating these. Laminin staining is detected at relatively high levels in the basement membrane underneath the sensory patches. At higher magnification (b-b’’’), an upward displacement of the basement membrane (arrow) is visible in the region of reduced Sox2 expression, corresponding to the “boundary domain” (bracket). 

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary: As TDP-43 mislocalization is a hallmark of multiple neurodegenerative diseases, the authors seek to identify pathways that modulate TDP-43 levels. To do this, they use a FACS based genome wide CRISPR KD screen in a Halo tagged TDP-43 KI iPSC line. Their screen identifies a number of genetic modulators of TDP-43 expression including BORC which plays a role in lysosome transport.

      Strengths:

      Genome wide CRISPR based screen identifies a number of modulators of TDP-43 expression to generate hypotheses regarding RNA BP regulation and perhaps insights into disease.

      Weaknesses:

      It is unclear how altering TDP-43 levels may relate to disease where TDP-43 is not altered in expression but mislocalized. This is a solid cell biology study, but the relation to disease is not clear without providing evidence of BORC alterations in disease or manipulation of BORC reversing TDP-43 pathology in disease.

      We thank the reviewer for this comment and have updated the discussion to include more discussion of the role TDP-43 may play in the BORCS8-associated neurodegenerative disorder and how understanding how lysosome localization changing TDP-43 levels may help patients (lines 313-321).

      The mechanisms by which BORC and lysosome transport modulate TDP-43 expression are unclear. Presumably, this may be through altered degradation of TDP protein but this is not addressed.

      We agree with the reviewer that understanding the mechanism by which lysosome transport regulates TDP-43 levels is important and plan to examine this in future studies.

      Previous studies have demonstrated that TDP-43 levels can be modulated by altering lysosomal degradation so the identification of lysosomal pathways is not particularly novel.

      We thank the reviewer for this comment and have updated the text to make this clearer (lines 310-313). What hasn’t been observed previously is a change in lysosome localization affecting TDP-43 levels.

      It is unclear whether this finding is specific to TDP-43 levels or whether lysosome localization may more broadly impact proteostasis in particular of other RNA BPs linked to disease.

      We agree that this is an interesting question and something that should be investigated in future studies.

      Unclear whether BORC depletion alters lysosome function or simply localization.

      We thank the reviewer for this comment. Lysosome function related to protein turnover has not yet been examined in the literature after loss of BORC, but other aspects of lysosome function (including lipid metabolism and autophagic flux) have been shown to be disrupted upon loss of BORC. We have updated the discussion to address this (lines 292-296).

      Reviewer #2 (Public review):

      Summary: The authors employ a novel CRISPRi FACS screen and uncover the lysosomal transport complex BORC as a regulator of TDP-43 protein levels in iNeurons. They also find that BORC subunit knockouts impair lysosomal function, leading to slower protein turnover and implicating lysosomal activity in the regulation of TDP-43 levels. This is highly significant for the field given that a) other proteins could also be regulated in this way, b) understanding mechanisms that influence TDP-43 levels are significant given that its dysregulation is considered a major driver of several neurodegenerative diseases and c) the novelty of the proposed mechanism.

      Strengths:

      The novelty and information provided by the CRISPRi screen. The authors provide evidence indicating that BORC subunit knockouts impair lysosomal function, leading to slower protein turnover and implicating lysosomal activity in the regulation of TDP-43 levels and show a mechanistic link between lysosome mislocalization and TDP-43 dysregulation. The study highlights the importance of localized lysosome activity in axons and suggests that lysosomal dysfunction could drive TDP-43 pathologies associated with neurodegenerative diseases like FTD/ALS. Further, the methods and concepts will have an impact to the larger community as well. The work also sets up for further work to understand the somewhat paradoxical findings that even though the tagged TDP-43 protein is reduced in the screen, it does not alter cryptic exon splicing and there is a longer TDP-43 half-life with BORC KD.

      Weaknesses:

      While the data is very strong, the work requires some additional clarification.

      We thank the reviewer for these comments. Our detailed responses are included below in the “recommendations for authors” section.

      Reviewer #3 (Public review):

      Summary: In this work, Ryan et al. have performed a state-of-the-art full genome CRISP-based screen of iNeurons expressing a tagged version of TDP-43 in order to determine expression modifiers of this protein. Unexpectedly, using this approach the authors have uncovered a previously undescribed role of the BORC complex in affecting the levels of TDP-43 protein, but not mRNA expression. Taken together, these findings represent a very solid piece of work that will certainly be important for the field.

      Strengths:

      BORC is a novel TDP-43 expression modifier that has never been described before and it seemingly acts on regulating protein half life rather than transcriptome level. It has been long known that different labs have reported different half-lives for TDP-43 depending on the experimental system but no work has ever explained these discrepancies. Now, the work of Ryan et al. has for the time identified one of these factors which could account for these differences and play an important role in disease (although this is left to be determined in future studies).

      The genome wide CRISPR screening has demonstrated to yield novel results with high reproducibility and could eventually be used to search for expression modifiers of many other proteins involved in neurodegeneration or other diseases

      Weaknesses:

      The fact that TDP-43 mRNA does not change following BORCS6 KD is based on a single qRT- PCR that does not really cover all possibilities. For example, the mRNA total levels may not change but the polyA sites may have switched from the highly efficient pA1 to the less efficient and nuclear retained pA4. There are therefore a few other experiments that could have been performed to make this conclusion more compelling, maybe also performing RNAscope experiments to make sure that no change occurred in TDP-43 mRNA localisation in cells.

      We thank the reviewer for this comment. To address this point, we performed an analysis of polyA sites on our RNA sequencing data using REPAC and did not find a change in TDP-43 poly adenylation after BORC KD (Figure S6C). Other transcripts do have altered polyA sites, which are summarized in Figure S6C. We also performed HCR FISH for TARDBP mRNA in TDP-43 and BORC KD neurons. While we did not see a difference in RNA localization (see A below, numbers on brackets indicate p-values), we also were not able to detect a significant difference in total TARDBP mRNA levels upon TDP-43 KD (see B below, numbers on brackets indicate p-values), suggesting that some of the signal detected is non-specific to TARDBP. Because of this, we cannot conclusively say that BORC KD does not alter TARDBP mRNA localization using the available tools.

      Author response image 1.

      Even assuming that the mRNA does not change, no explanation for the change in TDP-43 protein half life has been proposed by the authors. This will presumably be addressed in future studies: for example, are mutants that lack different domains of TDP-43 equally affected in their half-lives by BORC KD?. Alternatively, can a mass-spec be attempted to see whether TDP-43 PTMs change following BORCS6 KD?

      We agree with the reviewer that these are important experiments that could be done in the future to further examine the mechanism by which loss of BORC alters TDP-43 half-life. We examined our proteomics data for differential phosphorylation and ubiquitination in NT vs BORC KD (Figure S7G-H). We were unable to detect PTMs on TDP-43, so we cannot say if they contribute to the change in TDP-43 half-life we observed.

      Reviewer #1 (Recommendations for the authors):

      Recommendations are detailed in the public review.

      Reviewer #2 (Recommendations for the authors):

      Ryan et al, employ a CRISPRi FACS screen and uncover the lysosomal transport complex BORC as a regulator of TDP-43 protein levels in iNeurons. The authors provide strong evidence indicating that BORC subunit knockouts impair lysosomal function, leading to slower protein turnover and implicating lysosomal activity in the regulation of TDP-43 levels. The authors then provided additional evidence of TDP-43 perturbations under lysosome-inhibiting drug conditions, underscoring a mechanistic link between lysosome mislocalization and TDP-43 dysregulation. The study highlights the importance of localized lysosome activity in axons and suggests that lysosomal dysfunction could drive TDP-43 pathologies associated with neurodegenerative diseases like FTD/ALS. The work is exciting and could be highly informative for the field.

      Concerns: There are some disconnects between the figures and the main text that can benefit from refining of the figures to align better with the main text. This does not require additional experiments other than perhaps Figure 4B. The impact of the work could be further discussed - it is an interesting disconnect between the fact BORC KD causes decreased IF of the Halo-tagged TDP-43 and lysosomal transport, however this reduction does not impact cryptic exon expression and also increases TDP-43 half life (and of other proteins). It is a very interesting and potentially informative part of the manuscript.

      We thank the reviewer for their detailed reading of our manuscript. We have endeavored to better match the figures and the text and have added more discussion of the impact of the work.

      Minor:

      (1) Suggestion: relating to the statement "Gene editing was efficient, with almost all selected clones correctly edited." - please provide values or %.

      We updated the text to remove the statement about the editing efficiency, instead saying we identified a clone that was correct for both sequence and karyotype (lines 83-85).

      (2) Relating to Figure 1A: Please provide clarification regarding tagging strategy with the halotag - e.g. why in front of exon2.

      We updated the figure legend to reflect that the start codon for TDP-43 is in exon 2, hence why we placed the HaloTag there.

      (3) Relating to Figure S1: A and B seems to have been swapped.

      We thank the reviewer for catching this mistake and have fixed the figure/text.

      (4) Relating to Figure 1B: figure legend does not indicate grayscale coloring of TDP-43 signal.

      We have added text in the figure legend to indicate that the Halo signal is shown in grayscale in the left-handed panels.

      (5) Relating to Figure 1C: can the authors clarify abbreviation for 'NT' in text and legend.

      We thank the reviewer for catching this and have indicated in the text and figure legend that NT refers to the non-targeting sgRNA that was used as a control for comparison to the TDP-43 KD sgRNA.

      (6) Relating to figure 2B and S2A: main text mentioned "Non-targeting Guides" however the figure does not show non-targeting guides to confirm.

      We thank the reviewer for catching this oversight, we updated the figure legends for these figures to indicate that the non-targeting (NT) guides are shown in gray on the rank plot. They cluster towards the middle, more horizontal portion of the graphs, showing that the more vertical sections of the graph are hits.

      (7) Suggestion: To make it easier on the reader, please provide overlap numbers for the following statement ..."In comparing the top GO terms associated with genes that increase or decrease Halo-TDP-43 levels in iNeurons, we found that almost none altered Halo-TDP-43 levels in iPSCs...".

      We thank the reviewer for this comment and have updated the text to indicate that only a single term is shared between the iPSC and iNeuron screens (lines 113-117).

      (8) Relating to the statement "We cloned single sgRNA plasmids for 59 genes that either increased or decreased Halo-TDP-43 in iNeurons but not in iPSCs." Can the authors provide a list of the 59 genes.

      We have included a new column in the supplemental table S1 indicating the result of the Halo microscopy validation to hopefully clarify which genes lead to a validated phenotype and which did not.

      (9) Relating to the statement "To rule out the possibility of neighboring gene or off-target effects of CRISPRi, as has been reported previously15, we examined the impact of BORC knockout (KO) on TDP-43 levels. Using the pLentiCRISPR system, which expresses the sgRNA of interest on the same plasmid as an active Cas916 we found that KO of BORCS7 using two different sgRNAs decreased TDP-43 levels by immunofluorescence (Figure 5C-D)." Please provide clarification as to why BORCS7 was chosen out of all the BORCS? From the data presentation thus far (Figure 4B & 5A), the reader might have anticipated testing BORCS6 for panels 5C-D.

      We thank the reviewer for this comment. We tried a couple of BORCs with the pLentiCRISPR system, but BORCS7 was the only one we were convinced we got functional knockout for based on lysosome localization. We think that either the guides were not ideal for the other BORC components we tried, or we did not get efficient gene editing across the population of cells tested. Because we had previously been working with knock down and CRISPRi guides are not the same as CRISPR knock out guides, we couldn’t use the existing guide sequences we know work well for BORC. Since loss of one BORC gene causes functional loss of the complex and restricts lysosomes to the soma, we did not feel it necessary to assay all 8 genes.

      (10) Relating to the statement "We treated Halo-TDP-43 neurons with various drugs that disrupt distinct processes in the lysosome pathway and asked if Halo-TDP-43 levels changed. Chloroquine (decreases lysosomal acidity), CTSBI (inhibits cathepsin B protease), ammonium chloride (NH4Cl, inhibits lysosome-phagosome fusion), and GPN (ruptures lysosomal membranes) all consistently decreased Halo-TDP-43 levels (Figure 6A-B, S5A-C)" Please provide interpretations for Figures S5A and S5C in text.

      We thank the reviewer for catching this oversight and have updated the text accordingly (lines 183-191).

      (11) Relating to figure 6E: please provide in legend what the different colors used correlate with (i.e. green/brown for BORCS7 KD)?

      We thank the reviewer for pointing this out. These colors were mistakenly left in the figure from a version looking to see if the observed effects were driven by a single replicate rather than a consistent change (each replicate has a slightly different color). As the colors are intermingled and not separated, we concluded the effect was not driven by a single replicate. The colors have been removed from the updated figure for simplicity.

      (12) Relating to the statement "We observed a similar trend for many proteins in the proteome (Figure 8B)" This statement can benefit from stating which trend the authors are referring to, it is currently unclear from the volcano plot shown for Figure 8B.

      We thank the reviewer for catching this and have updated the text accordingly.

      (13) Relating to the statement "For almost every gene, we observed an increase or decrease in Halo-TDP-43 levels without a change in Halo-TDP-43 localization or compartment specific level changes (Figure 4B)." Please provide: (1) the number of genes examined, (2) additional clarification of "localization" and "compartment specific" level changes, (3) some quantification and or additional supporting data of the imaging results. Figures 5A-B presents with the same concern relating to the comment "To determine if results from Halo-TDP-43 expression assays also applied to endogenous, untagged TDP-43 levels, we selected 22 genes that passed Halo validation and performed immunofluorescence microscopy for endogenous (untagged) TDP-43 (Figure 4D-G,5A-B, S4E-F)." please clarify further.

      We thank the reviewer for requesting this clarification. This statement refers to all 59 genes tested by Halo imaging; only one (MFN2) showed any hints of aggregation or changes in localization, every other gene (58) showed what appeared to be global changes in Halo-TDP-43 levels. We were initially intrigued by the MFN2 phenotype; however, we were unable to replicate it on endogenous TDP-43 and thus concluded that this might be an effect specific to the tagged protein. The representative images shown in Figure 4B are representative of the changes we observed across all 59 genes tested (if changes were present). From the 59 genes that we observed a change in Halo-TDP-43 levels by microscopy, we selected a smaller number to move forward to immunofluorescence for TDP-43. We picked a subset of genes from each of the different categories we had identified (mitochondria, m6A, ubiquitination, and some miscellaneous) to validate by immunofluorescence, thinking that genes in the same pathway would act similarly. We have added a column to the supplemental table S1 indicating which genes were tested by immunofluorescence and what the result was. We have also attempted to clarify the results section to make the above clearer.

      (14) Relating to the statement "To determine if results from Halo-TDP-43 expression assays also applied to endogenous, untagged TDP-43 levels, we selected 22 genes that passed Halo validation and performed immunofluorescence microscopy for endogenous (untagged) TDP-43 (Figure 4D-G, 5A-B, S4E-F). Of these, 18 (82%) gene knockdowns showed changes in endogenous TDP-43 levels (Figure 4D-G, S4E-F)." It is difficult to identify the 18 or 22 genes in the figures as described in the main text.

      We added columns to the supplemental table S1 listing the genes and the result in each assay.

      (15) Relating to figures S7A and 8A and the first part of the section "TDP-43, like the proteome, shows longer turnover time in BORC KD neurons" Can the authors provide clarification why the SunTag assay was performed with BORCS6 KD (S7A) but the follow-up experiment (8A) was performed with BORCS7 KD. Does BORCS6 KD show similar results as BORCS7 with the SunTag assay, and does TDP-43 protein abundance with BORCS7 KD show similar results as BORCS6?

      Because loss of any of the 8 BORC genes causes functional loss of BORC and lysosomes to be restricted to the peri-nuclear space, we used BORC KDs interchangeably. Additionally, all BORC KDs had similar effects on Halo-TDP-43 levels.

      Reviewer #3 (Recommendations for the authors):

      Adding more control experiments that TDP-43 mRNA is really not affected following BORC KD

      We performed a FISH experiment to examine TARDBP mRNA localization upon BORC KD but were unable to conclusively say whether BORC KD changes TARDBP mRNA localization (see above). We also analyzed our RNA sequencing experiment for alternative polyadenylation sites upon BORC KD. Results are in Figure S6C.

      Although this could be part of a future study, the authors should try and determine what are the changes to TDP-43 that drive a change in the half-life.

      We agree with the reviewer that these are important experiments and hope to figure this out in the future.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Weakness:

      Although a familiarity preference is not found, it is possible that this is related to the nature of the stimuli and the amount of learning that they offer. While infants here are exposed to the same perceptual stimulus repeatedly, infants can also be familiarised to more complex stimuli or scenarios. Classical statistical learning studies for example expose infants to specific pseudo-words during habituation/familiarisation, and then test their preference for familiar vs novel streams of pseudo-words. The amount of learning progress in these probabilistic learning studies is greater than in perceptual studies, and familiarity preferences may thus be more likely to emerge there. For these reasons, I think it is important to frame this as a model of perceptual habituation. This would also fit well with the neural net that was used, which is processing visual stimuli rather than probabilistic structures. If statements in the discussion are limited to perceptual paradigms, they would make the arguments more compelling. 

      Thank you for your thoughtful feedback. We have now qualified our claims more explicitly throughout the manuscript to clarify the scope of our study. Specifically, we have made the following revisions:

      (1) Title Update: We have modified the title to “A stimulus-computable rational model of visual habituation in infants and adults” to explicitly specify the domain of our model.

      (2) Qualifying Language Throughout Introduction: We have refined our language throughout the introduction to ensure the scope of our claims is clear. Specifically, we have emphasized that our model applies to visual habituation paradigms by incorporating qualifying language where relevant. At the end of Section 1, we have revised the statement to: "Habituation and dishabituation to sequential visual stimuli are well described by a rational analysis of looking time." This clarification makes sure that our model is framed within the context of visual habituation paradigms, particularly those involving structured sequences of stimuli, while acknowledging that habituation extends beyond the specific cases we study.

      (3) New Paragraph on Scope in the Introduction: We have added language in the Introduction acknowledging that while visual habituation is a fundamental mechanism for learning, it is not the only form of habituation. Specifically, we highlight that: “While habituation is a broadly studied phenomenon across cognitive domains—including language acquisition, probabilistic learning, and concept formation—our focus here is on visual habituation, where infants adjust their attention based on repeated exposure to a visual stimulus.”

      (4) New Paragraph on Scope in the General Discussion: We have also revisited this issue in the General Discussion. We added a dedicated paragraph discussing the scope: “This current work focuses on visual habituation, a fundamental but specific form of habituation that applies to sequential visual stimuli. While habituation has been studied across various domains, our model is specifically designed to account for looking time changes in response to repeated visual exposure. This focus aligns with our choice of perceptual representations derived from CNNs, which process visual inputs rather than abstract probabilistic structures. Visual habituation plays a foundational role in infant cognition, as it provides a mechanism for concept learning based on visual experience. However, it does not encompass all forms of habituation, particularly those involving complex rule learning or linguistic structures. Future work should investigate whether models like RANCH can be extended to capture habituation mechanisms in other learning contexts.”

      Reviewer #2 (Public review):

      There are no formal tests of the predictions of RANCH against other leading hypotheses or models of habituation. This makes it difficult to evaluate the degree to which RANCH provides an alternative account that makes distinct predictions from other accounts. I appreciate that because other theoretical descriptions haven't been instantiated in formal models this might be difficult, but some way of formalising them to enable comparison would be useful. 

      We appreciate the reviewer's concern regarding formal comparisons between RANCH and other leading hypotheses of habituation. A key strength of RANCH is that it provides quantitative, stimulus-computable predictions of looking behavior—something that existing theoretical accounts do not offer. Because previous models can not generate predictions about behaviors, we can not directly compare the previous model with RANCH. 

      The one formal model that the reviewer might be referring to is the Goldilocks model, discussed in the introduction and shown in Figure 1. We did in fact spend considerable time in an attempt to implement a version of the Goldilocks model as a stimulus-computable framework for comparison. However, we found that it required too many free parameters, such as the precise shape of the inverted U-shape that the Goldilocks model postulates, making it difficult to generate robust predictions that we would feel confident attributing to this model specifically. This assertion may come as a surprise to a reader who expects that formal models should be able to make predictions across many situations, but prior models 1) cannot be applied to specific stimuli, and 2) do not generate dynamics of looking time within each trial. These are both innovations of our work. Instead, even prior formal proposals derive metrics (e.g., surprisal) that can only be correlated with aggregate looking time. And prior, non-formalized theories, such as the Hunter and Ames model, are simply not explicit enough to implement. 

      To clarify this point, we have now explicitly stated in the Introduction that existing models are not stimulus-computable and do not generate predictions for looking behavior at the level of individual trials: 

      “Crucially, RANCH is the first stimulus-computable model of habituation, allowing us to derive quantitative predictions from raw visual stimuli. Previous theoretical accounts have described broad principles of habituation, but they do not generate testable, trial-by-trial predictions of looking behavior. As a result, direct comparisons between RANCH and these models remain challenging: existing models do not specify how an agent decides when to continue looking or disengage, nor do they provide a mechanistic link between stimulus properties and looking time. By explicitly modeling these decision processes, RANCH moves beyond post-hoc explanations and offers a computational framework that can be empirically validated and generalized to new contexts.” 

      We also highlight that our empirical comparisons in Figure 1 evaluate theoretical predictions based on existing conceptual models using behavioral data, rather than direct model-to-model comparisons: 

      “Addressing these three challenges allowed us to empirically test competing hypotheses about habituation and dishabituation using our experimental data (Figure

      \ref{fig:conceptual}). However, because existing models do not generate quantitative predictions, we could not directly compare RANCH to alternative computational models. Instead, we evaluated whether RANCH accurately captured key behavioral patterns in looking time.”

      The justification for using the RMSEA fitting approach could also be stronger - why is this the best way to compare the predictions of the formal model to the empirical data? Are there others? As always, the main issue with formal models is determining the degree to which they just match surface features of empirical data versus providing mechanistic insights, so some discussion of the level of fit necessary for strong inference would be useful. 

      Thank you for recommending additional clarity on our choice of evaluation metrics. RMSE is a very standard measure (for example, it’s the error metric used in fitting standard linear regression!). On the other hand, it captures absolute rather than relative errors. Correlation-based measures (e.g., r and r<sup>2</sup>-type measures) provide a measure of relative distance between predictive measures. In our manuscript we reported both RMSE and R². In the revised manuscript, we have now:

      (1) Added a paragraph in the main text explaining that RMSE captures the absolute error in the same units as looking time, whereas r² reflects the relative proportion of variance explained by the model: 

      “RANCH predictions qualitatively matched habituation and dishabituation in both infants and adults. To quantitatively evaluate these predictions, we fit a linear model (adjusting model‐generated samples by an intercept and scaling factor) and then assessed two complementary metrics. First, the root mean squared error (RMSE) captures the absolute error in the same units as looking time. Second, the coefficient of determination ($R^2$) measures the relative variation in looking time that is explained by the scaled model predictions. Since each metric relies on different assumptions and highlights distinct aspects of predictive accuracy, they together provide a more robust assessment of model performance. We minimized overfitting by employing cross‐validation—using a split‐half design for infant data and ten‐fold for adult data—to compute both RMSE and $R^2$ on held‐out samples.”

      (2) We updated Table 1 to include both RMSE and R² for each model variant and linking hypothesis. We now reported both RMSE and R² across the two experiments. 

      We hope these revisions address your concerns by offering a more comprehensive and transparent assessment of our model’s predictive accuracy.

      Regarding your final question, the desired level of fit for insight, our view is that – at least in theory development – measures of fit should always be compared between alternatives (rather than striving for some absolute level of prediction). We have attempted to do this by comparing fit within- and across-samples and via various ablation studies. We now make this point explicit in the General Discussion:

      More generally, while there is no single threshold for what constitutes a “good” model fit, the strength of our approach lies in the relative comparisons across model variants, linking hypotheses, and ablation studies. In this way, we treat model fit not as an absolute benchmark, but as an empirical tool to adjudicate among alternative explanations and assess the mechanistic plausibility of the model’s components.

      The difference in model predictions for identity vs number relative to the empirical data seems important but isn't given sufficient weight in terms of evaluating whether the model is or is not providing a good explanation of infant behavior. What would falsification look like in this context? 

      We appreciate the reviewer’s observation regarding the discrepancy between model predictions and the empirical data for identity vs.~number violations. We were also very interested in this particular deviation and we discuss it in detail in the General Discussion, noting that RANCH is currently a purely perceptual model, whereas infants’ behavior on number violations may reflect additional conceptual factors. Moreover, because this analysis reflects an out-of-sample prediction, we emphasize the overall match between RANCH and the data (see our global fit metrics) rather than focusing on a single data point. Infant looking time data also exhibit considerable noise, so we caution against over-interpreting small discrepancies in any one condition. In principle, a more thorough “falsification” would involve systematically testing whether larger deviations persist across multiple studies or stimulus sets, which is beyond the scope of the current work. 

      For the novel image similarity analysis, it is difficult to determine whether any differences are due to differences in the way the CNN encodes images vs in the habituation model itself - there are perhaps too many free parameters to pinpoint the nature of any disparities. Would there be another way to test the model without the CNN introducing additional unknowns? 

      Thank you for raising this concern. In our framework, the CNN and the habituation model operate jointly to generate predictions, so it can be challenging to parse out whether any mismatches arise specifically from one component or the other. However, we are not worried that the specifics of our CNN procedure introduces free parameters because:

      (1) The  CNN introduces no additional free parameters in our analyses, because it is a pre‐trained model not fitted to our data. 

      (2) We tested multiple CNN embeddings and observed similar outcomes, indicating that the details of the CNN are unlikely to be driving performance (Figure 12).

      Moreover, the key contribution of our second study is precisely that the model can generalize to entirely novel stimuli without any parameter adjustments. By combining a stable, off‐the‐shelf CNN with our habituation model, we can make out‐of‐sample predictions—an achievement that, to our knowledge, no previous habituation model has demonstrated.

      Related to that, the model contains lots of parts - the CNN, the EIG approach, and the parameters, all of which may or may not match how the infant's brain operates. EIG is systematically compared to two other algorithms, with KL working similarly - does this then imply we can't tell the difference between an explanation based on those two mechanisms? Are there situations in which they would make distinct predictions where they could be pulled apart? Also in this section, there doesn't appear to be any formal testing of the fits, so it is hard to determine whether this is a meaningful difference. However, other parts of the model don't seem to be systematically varied, so it isn't always clear what the precise question addressed in the manuscript is (e.g. is it about the algorithm controlling learning? or just that this model in general when fitted in a certain way resembles the empirical data?) 

      Thank you for highlighting these points about the model’s components and the comparison of EIG- vs. KL-based mechanisms. Regarding the linking hypotheses (EIG, KL, and surprisal), our primary goal was to assess whether rational exploration via noisy perceptual sampling could account for habituation and dishabituation phenomena in a stimulus-computable fashion. Although RANCH contains multiple elements—including the CNN for perceptual embedding, the learning model, and the action policy (EIG or KL)—we did systematically vary the “linking hypothesis” (i.e., whether sampling is driven by EIG, KL, or surprisal). We found that EIG and KL gave very similar fits, while surprisal systematically underperformed.

      We agree that future experiments could be designed to produce diverging predictions between EIG and KL, but examining these subtle differences is beyond the scope of our current work. Here, we sought to establish that a rational model of habituation, driven by noisy perceptual sampling, can deliver strong quantitative predictions—even for out-of-sample stimuli—rather than to fully disentangle forward- vs. backward-looking information metrics.

      We disagree, however, that we did not evaluate or formally compare other aspects of the model. In Table 1 we report ablation studies of different aspects of the model architecture (e.g., removal of learning and noise components). Further, the RMSE and R² values reported in Table 1 and Section 4.2.3 can be treated as out-of-sample estimates of performance and used for direct comparison (because Table 1 uses cross-validation and Section 4.2.3 reports out of sample predictions). 

      Perhaps the reviewer is interested in statistical hypothesis tests, but we do not believe these are appropriate here. Cross-validation provides a metric of out-of-sample generalization and model selection based on the resulting numerical estimates. Significance testing is not typically recommended, except in a limited subset of cases (see e.g. Vanwinckelen & Blokeel, 2012 and Raschka, 2018).

      Reviewer #1 (Recommendations for the authors):

      "We treat the number of samples for each stimulus as being linearly related to looking time duration." Looking times were not log transformed? 

      Thank you for your question. The assumption of a linear relationship between the model’s predicted number of samples and looking time duration is intended as a measurement transformation, not a strict assumption about the underlying distribution of looking times. This linear mapping is used simply to establish a direct proportionality between model-generated samples and observed looking durations.

      However, in our statistical analyses, we do log-transform the empirical looking times to account for skewness and stabilize variance. This transformation is standard practice when analyzing infant looking time data but is independent of how we map model predictions to observed times. Since there is no a priori reason to assume that the number of model samples must relate to looking time in a strictly log-linear way, we retained a simple linear mapping while still applying a log transformation in our analytic models where appropriate.

      It would be nice to have figures showing the results of the grid search over the parameter values. For example, a heatmap with sigma on x and eta on y, and goodness of fit indicated by colour, would show the quality of the model fit as a function of the parameters' values, but also if the parameters estimates are correlated (they shouldn't be). 

      Thank you for the suggestion. We agree that visualizing the grid search results can provide a clearer picture of how different parameter values affect model fit. In the supplementary materials, we already present analyses where we systematically search over one parameter at a time to find the best-fitting values.

      We also explored alternative visualizations, including heatmaps where sigma and eta are mapped on the x and y axes, with goodness-of-fit indicated by color. However, we found that the goodness of fit was very similar across parameter settings, making the heatmaps difficult to interpret due to minimal variation in color. This lack of variation in fit reflects the observation that our model predictions are robust to changes in parameter settings, which allows us to report strong out of sample predictions in Section 4. Instead, we opted to use histograms to illustrate general trends, which provide a clearer and more interpretable summary of the model fit across different parameter settings. Please see the heatmaps below, if you are interested. 

      Author response image 1.

      Model fit (measured by RMSE) across a grid of prior values for Alpha, Beta, and V shows minimal variation. This indicates that the model’s performance is robust to changes in prior assumptions.

      Regarding section 5.4, paragraph 2: It might be interesting to notice that a potential way to decorrelate these factors is to look at finer timescales (see Poli et al., 2024, Trends in Cognitive Sciences), which the current combination of neural nets and Bayesian inference could potentially be adapted to do. 

      Thank you for this insightful suggestion. We agree that examining finer timescales of looking behavior could provide valuable insights into the dynamics of attention and learning. In response, we have incorporated language in Section 5.4 to highlight this as a potential future direction: 

      Another promising direction is to explore RANCH’s applicability to finer timescales of looking behavior, enabling a more detailed examination of within-trial fluctuations in attention. Recent work suggests that analyzing moment-by-moment dynamics can help disentangle distinct learning mechanisms \autocite{poli2024individual}.Since RANCH models decision-making at the level of individual perceptual samples, it is well-suited to capture these fine-grained attentional shifts.

      Previous work integrating neural networks with Bayesian (like) models could be better acknowledged: Blakeman, S., & Mareschal, D. (2022). Selective particle attention: Rapidly and flexibly selecting features for deep reinforcement learning. Neural Networks, 150, 408-421. 

      Thank you for this feedback. We have now incorporated this citation into our discussion section: 

      RANCH integrates structured perceptual representations with Bayesian inference, allowing for stimulus-computable predictions of looking behavior and interpretable parameters at the same time. This integrated approach has been used to study selective attention \autocite{blakeman2022selective}.

      Unless I missed it, I could not find an OSF repository (although the authors refer to an OSF repository for a previous study that has not been included). In general, sharing the code would greatly help with reproducibility. 

      Thanks for this comment. We apologize that – although all of our code and data were available through github, we did not provide links in the manuscript. We have now added this at the end of the introduction section. 

      Reviewer #2 (Recommendations for the authors):

      Page 7 "infants clearly dishabituated on trials with longer exposures" - what are these stats comparing? Novel presentation to last familiar? 

      Thank you for pointing out this slightly confusing passage. The statistics reported are comparing looking time in looking time between the novel and familiar test trials after longer exposures. We have now added the following language: 

      Infants clearly dishabituated on trials with longer exposures, looking longer at the novel stimulus than the familiar stimulus after long exposure.

      Order effects were covaried in the model - does the RANCH model predict similar order effects to those observed in the empirical data, ie can it model more generic changes in attention as well as the stimulus-specific ones? 

      Thank you for this question. If we understand correctly, you are asking whether RANCH can capture order effects over the course of the experiment, such as general decreases in attention across blocks. Currently, RANCH does not model these block-level effects—it is designed to predict stimulus-driven looking behavior rather than more general attentional changes that occur over time such as fatigue. In our empirical analysis, block number was included as a covariate to account for these effects statistically, but RANCH itself does not have a mechanism to model block-to-block attentional drift independent of stimulus properties. This is an interesting direction for future work, where a model could integrate global attentional dynamics alongside stimulus-specific learning. To address this, we have added a sentence in the General Discussion saying:

      Similarly, RANCH does not capture more global attention dynamics, such as block-to-block attentional drift independent of stimulus properties.

      "We then computed the root mean squared error (RMSE) between the scaled model results and the looking time data." Why is this the most appropriate approach to considering model fit? Would be useful to have a brief explanation. 

      Thank you for pointing this out. We believe that we have now addressed this issue in Response to Comment #2 from Reviewer 1. 

      The title of subsection 3.3 made me think that you would be comparing RANCH to alternate hypotheses or models but this seems to be a comparison of ways of fitting parameters within RANCH - I think worth explaining that. 

      We have now added a sentence in the subsection to make the content of the comparison more explicit: 

      Here we evaluated different ways of specifying RANCH's decision-making mechanism (i.e., different "linking hypotheses" within RANCH).

      3.5 would be useful to have some statistics here - does performance significantly improve? 

      As discussed above, we systematically compared model variants using cross-validated RMSE and R² values, which provide quantitative evidence of improved performance. While these differences are substantial, we do not report statistical hypothesis tests, as significance testing is not typically appropriate for model comparison based on cross-validation (see Vanwinckelen & Blockeel, 2012; Raschka, 2018). Instead, we rely on out-of-sample predictive performance as a principled basis for evaluating model variants.

      It would be very helpful to have a formal comparison of RANCH and other models - this seems to be largely descriptive at the moment (3.6).

      We believe that we have now addressed this issue in our response to the first comment.

      Does individual infant data show any nonlinearities? Sometimes the position of the peak look is very heterogenous and so overall there appears to be no increase but on an individual level there is. 

      Thank you for your question. Given our experimental design, each exposure duration appears in separate blocks rather than in a continuous sequence for each infant. Because of this, the concept of an individual-level nonlinear trajectory over exposure durations does not directly apply. Instead, each infant contributes looking time data to multiple distinct conditions, rather than following a single increasing-exposure sequence. Any observed nonlinear trend across exposure durations would therefore be a group-level effect rather than a within-subject pattern.

      In 4.1, why 8 or 9 exposures rather than a fixed number? 

      We used slightly variable exposure durations to reduce the risk that infants develop fixed expectations about when a novel stimulus will appear. We have now clarified this point in the text.

      Why do results differ for the model vs empirical data for identity? Is this to do with semantic processing in infants that isn't embedded in the model? 

      Thank you for your comment. The discrepancy between the model and empirical data for identity violations is related to the discrepancy we discussed for number violations in the General Discussion. As noted there, RANCH relies on perceptual similarity derived from CNN embeddings, which may not fully capture distinctions that infants make.

      The model suggests the learner’s prior on noise is higher in infants than adults, so produces potentially mechanistic insights. 

      We agree! One of the key strengths of RANCH is its ability to provide mechanistic insights through interpretable parameters. The finding that infants have a higher prior on perceptual noise than adults aligns with previous research suggesting that early visual processing in infants is more variable and less precise.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      LRRK2 protein is familially linked to Parkinson's disease by the presence of several gene variants that all confer a gain-of-function effect on LRRK2 kinase activity. 

      The authors examine the effects of BDNF stimulation in immortalized neuron-like cells, cultured mouse primary neurons, hIPSC-derived neurons, and synaptosome preparations from the brain. They examine an LRRK2 regulatory phosphorylation residue, LRRK2 binding relationships, and measures of synaptic structure and function. 

      Strengths: 

      The study addresses an important research question: how does a PD-linked protein interact with other proteins, and contribute to responses to a well-characterized neuronal signalling pathway involved in the regulation of synaptic function and cell health? 

      They employ a range of good models and techniques to fairly convincingly demonstrate that BDNF stimulation alters LRRK2 phosphorylation and binding to many proteins. Some effects of BDNF stimulation appear impaired in (some of the) LRRK2 knock-out scenarios (but not all). A phosphoproteomic analysis of PD mutant Knock-in mouse brain synaptosomes is included. 

      We thank this Reviewer for pointing out the strengths of our work. 

      Weaknesses: 

      The data sets are disjointed, conclusions are sweeping, and not always in line with what the data is showing. Validation of 'omics' data is very light. Some inconsistencies with the major conclusions are ignored. Several of the assays employed (western blotting especially) are likely underpowered, findings key to their interpretation are addressed in only one or other of the several models employed, and supporting observations are lacking. 

      We appreciate the Reviewer’s overall evaluaVon. In this revised version, we have provided several novel results that strengthen the omics data and the mechanisVc experiments and make the conclusions in line with the data.

      As examples to aid reader interpretation: (a) pS935 LRRK2 seems to go up at 5 minutes but goes down below pre-stimulation levels after (at times when BDNF-induced phosphorylation of other known targets remains very high). This is ignored in favour of discussion/investigation of initial increases, and the fact that BDNF does many things (which might indirectly contribute to initial but unsustained changes to pLRRK2) is not addressed.  

      We thank the Reviewer for raising this important point, which we agree deserves additional investigation. Although phosphorylation does decrease below pre-stimulation levels, a reduction is also observed for ERK/AKT upon sustained exposure to BDNF in our experimental paradigm (figure 1F-G). This phenomenon is well known in response to a number of extracellular stimuli and can be explained by mechanisms related to cellular negative feedback regulation, receptor desensitization (e.g. phosphorylation or internalization), or cellular adaptation. The effect on pSer935, however, is peculiar as phosphorylation goes below the unstimulated level, as pointed by the reviewer. In contrast to ERK and AKT whose phosphorylation is almost absent under unstimulated conditions (Figure 1F-G), the stoichiometry of Ser935 phosphorylation under unstimulated conditions is high. This observation is consistent with MS determination of relative abundance of pSer935 (e.g. in whole brain LRRK2 is nearly 100% phosphorylated at Ser935, see Nirujogi et al., Biochem J 2021).  Thus we hypothesized that the modest increase in phosphorylation driven by BDNF likely reflects a saturation or ceiling effect, indicating that the phosphorylation level is already near its maximum under resting conditions. Prolonged BDNF stimulation would bring phosphorylation down below pre-stimulation levels, through negative feedback mechanisms (e.g. phosphatase activity) explained above. To test this hypothesis, we conducted an experiment in conditions where LRRK2 is pretreated for 90 minutes with MLi-2 inhibitor, to reduce basal phosphorylation of S935. After MLi-2 washout, we stimulated with BDNF at different time points. We used GFP-LRRK2 stable lines for this experiment, since the ceiling effect was particularly evident (Figure S1A) and this model has been used for the interactomic study. As shown below (and incorporated in Fig. S1B in the manuscript), LRRK2 responds robustly to BDNF stimulation both in terms of pSer935 and pRABs. Phosphorylation peaks at 5-15 mins, while it decreases to unstimulated levels at 60 and 180 minutes. Notably, while the peak of pSer935 at 5-15 mins is similar to the untreated condition (supporting that Ser935 is nearly saturated in unstimulated conditions), the phosphorylation of RABs during this time period exceeds unstimulated levels. These findings support the notion that, under basal conditions, RAB phosphorylation is far from saturation. The antibodies used to detect RAB phosphorylation are the following: RAB10 Abcam # ab230261 e RAB8 (pan RABs) Abcam # ab230260.

      Given the robust response of RAB10 phosphorylation upon BDNF stimulation, we further investigated RAB10 phosphorylation during BDNF stimulation in naïve SH-SY5Y cells. We confirmed that the increase in pSer935 is coupled to increase in pT73-RAB10. Also in this case, RAB10 phosphorylation does not go below the unstimulated level, which aligns with the  low pRAB10 stoichiometry in brain (Nirujogi et al., Biochem J 2021). This experiment adds the novel and exciting finding that BDNF stimulation increases LRRK2 kinase activity (RAB phosphorylation) in neuronal cells. 

      Note that new supplemental figure 1 now includes: A) a comparison of LRRK2 pS935 and total protein levels before and after RA differentiation; B) differentiated GFP-LRRK2 SH-SY5Y (unstimulated, BDNF, MLi-2, BDNF+MLi-2); C) the kinetic of BDNF response in differentiated GFP-LRRK2 SH-SY5Y.

      (b) Drebrin coIP itself looks like a very strong result, as does the increase after BDNF, but this was only demonstrated with a GFP over-expression construct despite several mouse and neuron models being employed elsewhere and available for copIP of endogenous LRRK2. Also, the coIP is only demonstrated in one direction. Similarly, the decrease in drebrin levels in mice is not assessed in the other model systems, coIP wasn't done, and mRNA transcripts are not quantified (even though others were). Drebrin phosphorylation state is not examined.  

      We appreciate the Reviewer suggestions and provided additional experimental evidence supporting the functional relevance of LRRK2-drebrin interaction.

      (1) As suggested, we performed qPCR and observed that 1 month-old KO midbrain and cortex express lower levels of Dbn1 as compared to WT brains (Figure 5G). This result is in agreement with the western blot data (Figure 5H). 

      (2)To further validate the physiological relevance of LRRK2-drebrin interaction we performed two experiments:

      i) Western blots looking at pSer935 and pRab8 (pan Rab) in Dbn1 WT and knockout brains. As reported and quantified in Figure 2I, we observed a significant decrease in pSer935 and a trend decrease in pRab8 in Dbn1 KO brains. This finding supports the notion that Drebrin forms a complex with LRRK2 that is important for its activity, e.g. upon BDNF stimulation. 

      ii) Reverse co-immunoprecipitation of YFP-drebrin full-length, N-terminal domain (1-256 aa) and C-terminal domain (256-649 aa) (plasmids kindly received from Professor Phillip R. Gordon-Weeks, Worth et al., J Cell Biol, 2013) with Flag-LRRK2 co-expressed in HEK293T cells. As shown in supplementary Fig. S2C, we confirm that YFP-drebrin binds LRRK2, with the Nterminal region of drebrin appearing to be the major contributor to this interaction. This result is important as the N-terminal region contains the ADF-H (actin-depolymerising factor homology) domain and a coil-coil region known to directly bind actin (Shirao et al., J Neurochem 2017; Koganezawa et al., Mol Cell Neurosci. 2017). Interestingly, both full-length Drebrin and its truncated C-terminal construct cause the same morphological changes in Factin, indicating that Drebrin-induced morphological changes in F-actin are mediated by its N-terminal domains rather than its intrinsically disordered C-terminal region (Shirao et al., J Neurochem, 2017; Koganezawa et al., Mol Cell Neurosci. 2017). Given the role of LRRK2 in actin-cytoskeletal dynamics and its binding with multiple actin-related protein binding (Fig. 2 and Meixner et al., Mol Cell Proteomics. 2011; Parisiadou and Cai, Commun Integr Biol 2010), these results suggest the possibility that LRRK2 controls actin dynamics by competing with drebrin binding to actin and open new avenues for futures studies.

      (3) To address the request for examining drebrin phosphorylation state, we decided to perform another phophoproteomic experiment, leveraging a parallel analysis incorporated in our latest manuscript (Chen et al., Mol Theraphy 2025). In this experiment, we isolated total striatal proteins from WT and G2019S KI mice and enriched the phospho-peptides. Unlike the experiment presented in Fig. 7, phosphopeptides were enriched from total striatal lysates rather than synaptosomal fractions, and phosphorylation levels were normalized to the corresponding total protein abundance. This approach was intended to avoid bias toward synaptic proteins, allowing for the analysis of a broader pool of proteins derived from a heterogeneous ensemble of cell types (neurons, glia, endothelial cells, pericytes etc.). We were pleased to find that this new experiment confirmed drebrin S339 as a differentially phosphorylated site, with a 3.7 fold higher abundance in G2019S Lrrk2 KI mice. The fact that this experiment evidenced an increased phosphorylation stoichiometry in G2019S mice rather than a decreased is likely due to the normalization of each peptide by its corresponding total protein. Gene ontology analysis of differentially phosphorylated proteins using stringent term size (<200 genes) showed post-synaptic spines and presynaptic active zones as enriched categories (Fig. 3F). A SynGO analysis confirms both pre and postsynaptic categories, with high significance for terms related to postsynaptic cytoskeleton (Fig. 3G). As pointed, this is particularly interesting as the starting material was whole striatal tissue – not synaptosomes as previously – indicating that most significant phosphorylation differences occur in synaptic compartments. This once again reinforces our hypothesis that LRRK2 has a prominent role in the synapse. Overall, we confirmed with an independent phosphoproteomic analysis that LRRK2 kinase activity influences the phosphorylation state of proteins related to synaptic function, particularly postsynaptic cytoskeleton. For clarity in data presentation, as mentioned by the Reviewers, we removed Figure 7 and incorporated this new analysis in figure 3, alongside the synaptic cluster analysis. 

      Altogether, three independent OMICs approaches – (i) experimental LRRK2 interactomics in neuronal cells, (ii) a literature-based LRRK2 synaptic/cytoskeletal interactor cluster, and (iii) a phospho-proteomic analysis of striatal proteins from G2019S KI mice (to model LRRK2 hyperactivity) – converge to synaptic actin-cytoskeleton as a key hub of LRRK2 neuronal function.

      (c) The large differences in the CRISPR KO cells in terms of BDNF responses are not seen in the primary neurons of KO mice, suggesting that other differences between the two might be responsible, rather than the lack of LRRK2 protein. 

      Considering that some variability is expected for these type of cultures and across different species, any difference in response magnitude and kinetics could be attributed to the levels of TrKB  and downstream components expressed by the two cell types. 

      We are confident that differentiated SH-SY5Y cells provide a reliable model for our study as we could translate the results obtained in SH-SY5Y cells in other models. However, to rule out the possibility that the more pronounced effect observed in SH-SY5Y KO cells as respect to Lrrk2 KO primary neurons was due to CRISPR off-target effect, we performed an off-target analysis. Specifically, we selected the first 8 putative off targets exhibiting a CDF (Cutting Frequency Determination) off-target-score >0.2. 

      As shown in supplemental file 1, sequence disruption was observed only in the LRRK2 ontarget site in LRRK2 KO SH-SY5Y cells, while the 8 off-target regions remained unchanged across the genotypes and relative to the reference sequence. 

      (d) No validation of hits in the G2019S mutant phosphoproteomics, and no other assays related to the rest of the paper/conclusions. Drebrin phosphorylation is different but unvalidated, or related to previous data sets beyond some discussion. The fact that LRRK2 binding occurs, and increases with BDNF stimulation, should be compared to its phosphorylation status and the effects of the G2019S mutation. 

      As illustrated in the response to point (b), we performed a new phosphoproteomics investigation – with total striatal lysates instead of striatal synaptosomes and normalization phospho-peptides over total proteins – and found that S339 phosphorylation increases when LRRK2 kinase activity increases (G2019S). To address the request of validating drebrin phosphorylation, the main limitation is that there are no available antibodies against Ser339. While we tried phos-Tag gels in striatal lysates, we could not detect any reliable and specific signal with the same drebrin antibody used for western blot (Thermo Fisher Scientific: MA120377) due to technical limitations of the phosTag method. We are confident that phosphorylation at S339 has a physiological relevance, as it was identified 67 times across multiple proteomic discovery studies and they are placed among the most frequently phosphorylated sites in drebrin (https://www.phosphosite.org/proteinAction.action?id=2675&showAllSites=true).

      To infer a possible role of this phosphorylation, we looked at the predicted pathogenicity of using AlphaMissense (Cheng et al., Science 2023). included as supplementary figure (Fig. S3), aminoacid substitutions within this site are predicted not to be pathogenic, also due to the low confidence of the AlphaFold structure. 

      Ser339 in human drebrin is located just before the proline-rich region (PP domain) of the protein. This region is situated between the actin-binding domains and the C-terminal Homerbinding sequences and plays a role in protein-protein interactions and cytoskeletal regulation (Worth et al., J Cell Biol, 2013). Of interest, this region was previously shown to be the interaction site of adafin (ADFN), a protein involved in multiple cytoskeletal-related processes, including synapse formation and function by regulating puncta adherentia junctions, presynaptic differentiation, and cadherin complex assembly, which are essential for hippocampal excitatory synapses, spine formation, and learning and memory processes (Beaudoin, G. M., 3rd et al., J Neurosci, 2013). Of note, adafin is in the list of LRRK2 interacting proteins (https://www.ebi.ac.uk/intact/home), supporting a possible functional relevance of LRRK2-mediated drebrin phosphorylation in adafin-drebrin complex formation. This has been discussed in the discussion section.

      The aim of this MS analysis in G2019S KI mice – now included in figure 3 – was to further validate the crucial role of LRRK2 kinase activity in the context of synaptic regulation, rather than to discover and characterize novel substrates. Consequently, Figure 7 has been eliminated. 

      Reviewer #2 (Public Review):  

      Taken as a whole, the data in the manuscript show that BDNF can regulate PD-associated kinase LRRK2 and that LRRK2 modifies the BDNF response. The chief strength is that the data provide a potential focal point for multiple observations across many labs. Since LRRK2 has emerged as a protein that is likely to be part of the pathology in both sporadic and LRRK2 PD, the findings will be of broad interest. At the same time, the data used to imply a causal throughline from BDNF to LRRK2 to synaptic function and actin cytoskeleton (as in the title) are mostly correlative and the presentation often extends beyond the data. This introduces unnecessary confusion. There are also many methodological details that are lacking or difficult to find. These issues can be addressed. 

      We appreciate the Reviewer’s positive feedback on our study. We also value the suggestion to present the data in a more streamlined and coherent way. In response, we have updated the title to better reflect our overall findings: “LRRK2 Regulates Synaptic Function through Modulation of Actin Cytoskeletal Dynamics.” Additionally, we have included several experiments that we believe enhance and unify the study.

      (1) The writing/interpretation gets ahead of the data in places and this was confusing. For example, the abstract highlights prior work showing that Ser935 LRRK2 phosphorylation changes LRRK2 localization, and Figure 1 shows that BDNF rapidly increases LRRK2 phosphorylation at this site. Subsequent figures highlight effects at synapses or with synaptic proteins. So is the assumption that LRRK2 is recruited to (or away from) synapses in response to BDNF? Figure 2H shows that LRRK2-drebrin interactions are enhanced in response to BDNF in retinoic acid-treated SH-SY5Y cells, but are synapses generated in these preps? How similar are these preps to the mouse and human cortical or mouse striatal neurons discussed in other parts of the paper (would it be anticipated that BDNF act similarly?) and how valid are SHSY5Y cells as a model for identifying synaptic proteins? Is drebrin localization to synapses (or its presence in synaptosomes) modified by BDNF treatment +/- LRRK2? Or do LRRK2 levels in synaptosomes change in response to BDNF? The presentation requires re-writing to stay within the constraints of the data or additional data should be added to more completely back up the logic. 

      We thank the Reviewer for the thorough suggestions and comments. We have extensively revised the text to accurately reflect our findings without overinterpreting. In particular, we agree with the Reviewer that differentiated SH-SY5Y cells are not  identical to primary mouse or human neurons; however both neuronal models respond to BDNF. Supporting our observations, it is known that SH-SY5Y cells respond to BDNF.  In fact, a common protocol for differentiating SH-SY5Y cells involve BDNF in combination with retinoic acid (Martin et al., Front Pharmacol, 2022; Kovalevich et al., Methods in mol bio, 2013). Additionally, it has been reported that SH-SY5Y cells can form functional synapses (Martin et al., Front Pharmacol, 2022). While we are aware that BDNF, drebrin or LRRK2 can also affect non-synaptic pathways, we focused on synapses when moved to mouse models since: (i) MS and phosphoMS identified several cytoskeletal proteins enriched at the synapse, (ii) we and others have previously reported a role for LRRK2 in governing synaptic and cytoskeletal related processes; (iii) the synapse is a critical site that becomes dysfunctional in the early  stages of PD. We have now clarified and adjusted the text as needed. We have also performed additional experiments to address the Reviewer’s concern:

      (1) “Is the assumption that LRRK2 is recruited to (or away from) synapses in response to BDNF”? This is a very important point. There is consensus in the field that detecting endogenous LRRK2 in brain slices or in primary neurons via immunofluorescence is very challenging with the commercially available  antibodies (Fernandez et al., J Parkinsons Dis, 2022). We established a method in our previous studies to detect LRRK2 biochemically in synaptosomes (Cirnaru et al., Front Mol Neurosci, 2014; Belluzzi et al., Mol Neurodegener., 2016). While these data indicate LRRK2 is present in the synaptic compartments, it would be quite challenging to apply this method to the present study. In fact, applying acute BDNF stimulation in vivo and then isolate synaptosomes is a complex experiment beyond the timeframe of the revision due to the need of mouse ethical approvals. However, this is definitely an intriguing angle to explore in the future.

      (2)“Is drebrin localization to synapses (or its presence in synaptosomes) modified by BDNF treatment +/- LRRK2?” To try and address this question, we adapted a previously published assay to measure drebrin exodus from dendritic spines. During calcium entry and LTP, drebrin exits dendritic spines and accumulates in the dendritic shafts and cell body (Koganezawa et al., 2017). This facilitates the reorganization of the actin cytoskeleton (Shirao et al., 2017). Given the known role of drebrin and its interaction with LRRK2, we hypothesized that LRRK2 loss might affect drebrin relocalization during spine maturation.

      To test this, we treated DIV14 primary cortical neurons from Lrrk2 WT and KO mice with BDNF for 5, 15, and 24 hours, then performed confocal imaging of drebrin localization (Author response image 1). Neurons were transfected at DIV4 with GFP (cell filler) and PSD95 (dendritic spines) for visualization, and endogenous drebrin was stained with an anti-drebrin antibody. We then measured drebrin's overlap with PSD95-positive puncta to track its localization at the spine.

      In Lrrk2 WT neurons, drebrin relocalized from spines after BDNF stimulation, peaking at 15 minutes and showing higher co-localization with PSD95 at 24 hours, indicating the spine remodeling occurred. In contrast, Lrrk2 KO neurons showed no drebrin exodus. These findings support the notion that LRRK2's interaction with drebrin is important for spine remodeling via BDNF. However, additional experiments with larger sample sizes are needed, which were not feasible within the revision timeframe (here n=2 experiments with independent neuronal preparations, n=4-7 neurons analyzed per experiment). Thus, we included the relevant figure as Author response image 1 but chose not to add it in the manuscript (figure 3).

      Author response image 1.

      Lrrk2 affects drebrin exodus from dendritic spines. After the exposure to BDNF for different times (5 minutes, 15 minutes and 24 hours), primary neurons from Lrrk2 WT and KO mice have been transfected with GFP and PSD95 and stained for endogenous drebrin at DIV4. The amount of drebrin localizing in dentritic spines outlined by PSD95 has been assessed at DIV14. The graph shows a pronounced decrease in drebrin content in WT neurons during short time treatments and an increase after 24 hours. KO neurons present no evident variations in drebrin localization upon BDNF stimulation. Scale bar: 4 μm.<br />

      (2) The experiments make use of multiple different kinds of preps. This makes it difficult at times to follow and interpret some of the experiments, and it would be of great benefit to more assertively insert "mouse" or "human" and cell type (cortical, glutamatergic, striatal, gabaergic) etc. 

      We thank the Reviewer for pointing this out. We have now more clearly specified the cell type and species identity throughout the text to improve clarity and interpretation.

      (3) Although BDNF induces quantitatively lower levels of ERK or Akt phosphorylation in LRRK2KO preps based on the graphs (Figure 4B, D), the western blot data in Figure 4C make clear that BDNF does not need LRRK2 to mediate either ERK or Akt activation in mouse cortical neurons and in 4A, ERK in SH-SY5Y cells. The presentation of the data in the results (and echoed in the discussion) writes of a "remarkably weaker response". The data in the blots demand more nuance. It seems that LRRK2 may potentiate a response to BDNF that in neurons is independent of LRRK2 kinase activity (as noted). This is more of a point of interpretation, but the words do not match the images.  

      We thank the Reviewer for pointing this out. We have rephrased our data  presentation to better convey  our findings. We were not surprised to find that loss of LRRK2 causes only a reduction of ERK and AKT activation upon BDNF rather than a complete loss. This is because these pathways are complex and redundant and are activated by a number of cellular effectors. The fact that LRRK2 is one among many players whose function can be compensated by other signaling molecules is also supported by the phenotype of Lrrk2 KO mice that is measurable at 1 month but disappears with adulthood (4 and 18 months) (figure 5).

      Moreover, we removed the sentence “Of note, 90 mins of Lrrk2 inhibition (MLi-2) prior to BDNF stimulation did not prevent phosphorylation of Akt and Erk1/2, suggesting that LRRK2 participates in BDNF-induced phosphorylation of Akt and Erk1/2 independently from its kinase activity but dependently from its ability to be phosphorylated at Ser935 (Fig. 4C-D and Fig. 1B-C)” since the MLi-2 treatment prior to BDNF stimulation was not quantified and our new data point to an involvement of LRRK2 kinase activity upon BDNF stimulation.

      (4) Figure 4F/G shows an increase in PSD95 puncta per unit length in response to BDNF in mouse cortical neurons. The data do not show spine induction/dendritic spine density/or spine morphogenesis as suggested in the accompanying text (page 8). Since the neurons are filled/express gfp, spine density could be added or spines having PSD95 puncta. However, the data as reported would be expected to reflect spine and shaft PSDs and could also include some nonsynaptic sites. 

      The Reviewer is right. We have rephrased the text to reflect an increase in postsynaptic density (PSD) sites, which may include both spine and shaft PSDs, as well as potential nonsynaptic sites.

      (5) Experimental details are missing that are needed to fully interpret the data. There are no electron microscopy methods outside of the figure legend. And for this and most other microscopy-based data, there are few to no descriptions of what cells/sites were sampled, how many sites were sampled, and how regions/cells were chosen. For some experiments (like Figure 5D), some detail is provided in the legend (20 segments from each mouse), but it is not clear how many neurons this represents, where in the striatum these neurons reside, etc. For confocal z-stacks, how thick are the optical sections and how thick is the stack? The methods suggest that data were analyzed as collapsed projections, but they cite Imaris, which usually uses volumes, so this is confusing. The guide (sgRNA) sequences that were used should be included. There is no mention of sex as a biological variable. 

      We thank the Reviewer for pointing out this missing information. We have now included:

      (1) EM methods (page 24)

      (2) Methods for ICC and confocal microscopy now incorporates the Z-stack thickness (0.5 μm x 6 = 3 μm) on page 23.

      (3) Methods for Golgi-Cox staining now incorporates the Z-stack thickness and number of neurons and segments per neuron analyzed. 

      (4) The sex of mice is mentioned in the material and methods (page 17): “Approximately equal numbers of males and females were used for every experiment”.

      (6) For Figures 1F, G, and E, how many experimental replicates are represented by blots that are shown? Graphs/statistics could be added to the supplement. For 1C and 1I, the ANOVA p-value should be added in the legend (in addition to the post hoc value provided). 

      The blots relative to figure 1F,G and E are representative of several blots (at least n=5). The same redouts are part of figure 4 where quantifications are provided. We added the ANOVA p-value in the legend for figure 1C, 1I and 1K.

      (7) Why choose 15 minutes of BDNF exposure for the mass spec experiments when the kinetics in Figure 1 show a peak at 5 mins?  

      This is an important point. We repeated the experiment in GFP-LRRK2 SH-SY5Y cells (figure S1C) and included the 15 min time point. In addition to confirming that pSer935 increases similarly at 5 and 15 minutes, we also observed an increase in RAB phosphorylation at these time points. As mentioned in our response to Reviewer’s 1, we pretreated with MLi-2 for 90 minutes in this experiment to reduce the high basal phosphorylation stoichiometry of pSer935. 

      (8) The schematic in Figure 6A suggests that iPSCs were plated, differentiated, and cultured until about day 70 when they were used for recordings. But the methods suggest they were differentiated and then cryopreserved at day 30, and then replated and cultured for 40 more days. Please clarify if day 70 reflects time after re-plating (30+70) or total time in culture (70). If the latter, please add some notes about re-differentiation, etc. 

      We thank the reviewer for providing further clarity on the iPSC methodology. In the submitted manuscript 70DIV represents the total time in vitro and the process involved a cryostorage event at 30DIV, with a thaw of the cells and a further 40 days of maturation before measurement.  We have adjusted the methods in both the text and figure (new schematic) to clarify this.  The cryopreservation step has been used in other iPSC methods to great effect (Drummond et al., Front Cell Dev Biol, 2020). Due to the complexity and length of the iPSC neuronal differentiation process, cryopreservation represents a useful method with which to shorten and enhance the ability to repeat experiments and reduce considerable variation between differentiations. User defined differences in culture conditions for each batch of neurons thawed can usefully be treated as a new and separate N compared to the next batch of neurons.

      (9) When Figures 6B and 6C are compared it appears that mEPSC frequency may increase earlier in the LRRK2KO preps than in the WT preps since the values appear to be similar to WT + BDNF. In this light, BDNF treatment may have reached a ceiling in the LRRK2KO neurons.

      We thank the reviewer for his/her comment and observations about the ceiling effects. It is indeed possible that the loss of LRRK2 and the application of BDNF could cause the same elevation in synaptic neurotransmission. In such a situation, the increased activity as a result of BDNF treatment would be masked by the increased activity  observed as a result of LRRK2 KO. To better visualize the difference between WT and KO cultures and the possible ceiling effect, we merged the data in one single graph.  

      (10) Schematic data in Figures 5A and C and Figures 5B and E are too small to read/see the data. 

      We thank the Reviewer for this suggestion. We have now enlarged figure 5A and moved the graph of figure 5D in supplemental figure S5, since this analysis of spine morphology is secondary to the one shown in figure 5C.

      Reviewer #1 (Recommendations For The Authors): 

      Please forgive any redundancy in the comments, I wanted to provide the authors with as much information as I had to explain my opinion. 

      Primary mouse cortical neurons at div14, 20% transient increase in S935 pLRRK2 5min after BDNF, which then declines by 30 minutes (below pre-stim levels, and maybe LRRK2 protein levels do also). 

      In differentiated SHSY5Y cells there is a large expected increase in pERK and pAKT that is sustained way above pre-stim for 60 minutes. There is a 50% initial increase in pLRRK2 (but the blot is not very clear and no double band in these cells), which then looks like reduced well below pre-stim by 30 & 60 minutes. 

      We thank the Reviewer for bring up this important point. We have extensively addressed this issue in the public review rebuttal. In essence, the phosphorylation of Ser935 is near saturation under unstimulated conditions, as evidenced by its high basal stoichiometry, whereas Rab phosphorylation is far from saturation, showing an increase upon BDNF stimulation before returning to baseline levels. This distinction highlights that while pSer935 exhibits a ceiling effect due to its near-maximal phosphorylation at rest, pRab responds dynamically to BDNF, indicating low basal phosphorylation and a significant capacity for increase. Figure 1 in the rebuttal summarizes the new data collected. 

      GFP-fused overexpressed LRRK2 coIPs with drebrin, and this is double following 15 min BDNF. Strong result.

      We thank the Reviewer.

      BDNF-induced pAKT signaling is greatly impaired, and pERK is somewhat impaired, in CRISPR LKO SHSY5Y cells. In mouse primaries, both AKT and Erk phosph is robustly increased and sustained over 60 minutes in WT and LKO. This might be initially less in LKO for Akt (hard to argue on a WB n of 3 with huge WT variability), regardless they are all roughly the same by 60 minutes and even look higher in LKO at 60. This seems like a big disconnect and suggests the impairment in the SHSy5Y cells might have more to do with the CRISPR process than the LRRK2. Were the cells sequenced for off-target CRISPR-induced modifications?  

      Following the Reviewer suggestion – and as discussed in the public review section - we performed an off-target analysis. Specifically, we selected the first 8 putative off targets exhibiting a CDF (Cutting Frequency Determination) off-target-score >0.2. As shown in supplemental file 1, sequence disruption was observed only in the LRRK2 on-target site in LRRK2 KO SH-SY5Y cells, while the 8 off-target regions remained unchanged across the genotypes and relative to the reference sequence.  

      No difference in the density of large PSD-95 puncta in dendrites of LKO primary relative to WT, and the small (10%) increase seen in WT after BDNF might be absent in LKO (it is not clear to me that this is absent in every culture rep, and the data is not highly convincing). This is also referred to as spinogenesis, which has not been quantified. Why not is confusing as they did use a GFP fill... 

      The Reviewer is right that spinogenesis is not the appropriate term for the process analyzed. We replaced “spinogenesis” with “morphological alternation of dendritic protrusions” or “synapse maturation” which is correlated with the number of PSD95 positive puncta (ElHusseini et al., Science, 2000) . 

      There is a difference in the percentage of dendritic protrusions classified as filopodia to more being classified as thin spines in LKO striatal neurons at 1 month, which is not seen at any other age, The WT filopodia seems to drop and thin spine percent rise to be similar to LKO at 4 months. This is taken as evidence for delayed maturation in LKO, but the data suggest the opposite. These authors previously published decreased spine and increased filopodia density at P15 in LKO. Now they show that filopodia density is decreased and thin spine density increased at one month. How is that shift from increased to decreased filopodia density in LKO (faster than WT from a larger initial point) evidence of impaired maturation? Again this seems accelerated? 

      We agree with the Reviewer that the initial interpretation was indeed confusing. To adhere closely to our data and avoid overinterpretation – as also suggested by Reviewer 2 – we revised  the text and moved figure 5D to supplementary materials. In essence, our data point out to alterations in the structural properties of dendritic protrusions in young KO mice, specifically a reduction in  their size (head width and neck height) and a decrease in postsynaptic density (PSD) length, as observed with TEM. These findings suggest that LRRK2 is involved in morphological processes during spine development. 

      Shank3 and PSD95 mRNA transcript levels were reduced in the LKO midbrain, only shank3 was reduced in the striatum and only PSD was reduced in the cortex. No changes to mRNA of BDNF-related transcripts. None of these mRNA changes protein-validated. Drebrin protein (where is drebrin mRNA?) levels are reduced in LKO at 1&4 but not clearly at 18 months (seems the most robust result but doesn't correlate with other measures, which here is basically a transient increase (1m) in thin striatal spines).  

      As illustrated before, we performed qPCR for Dbn1 and found that its expression is significantly reduced in the cortex and midbrain and non-significantly reduced in the striatum (1 months old mice, a different cohort as those used for the other analysis in figure 5).  

      24h BDNF increases the frequency of mEPSCs on hIPSC-derived cortical-like neurons, but not LKO, which is already high. There are no details of synapse number or anything for these cultures and compares 24h treatment. BDNF increases mEPSC frequency within minutes PMC3397209, and acute application while recording on cells may be much more informative (effects of BDNF directly, and no issues with cell-cell / culture variability). Calling mEPSC "spontaneous electrical activity" is not standard.  

      We thank the reviewer for this point. We provided information about synapse number (Bassoon/Homer colocalization) in supplementary figure S7. The lack of response of LRRK2 KO cultures in terms of mEPSC is likely due to increase release probability as the number of synapses does not change between the two genotypes. 

      The pattern of LRRK2 activation is very disconnected from that of BDNF signalling onto other kinases. Regarding pLRRK2, s935 is a non-autophosph site said to be required for LRRK2 enzymatic activity, that is mostly used in the field as a readout of successful LRRK2 inhibition, with some evidence that this site regulates LRRK2 subcellular localization (which might be more to do with whether or not it is p at 935 and therefor able to act as a kinase). 

      The authors imply BDNF is activating LRRK2, but really should have looked at other sites, such as the autophospho site 1292 and 'known' LRRK2 substrates like T73 pRab10 (or other e.g., pRab12) as evidence of LRRK2 activation. One can easily argue that the initial increase in pLRRK2 at this site is less consequential than the observation that BDNF silences LRRK2 activity based on p935 being sustained to being reduced after 5 minutes, and well below the prestim levels... not that BDNF activates LRRK2. 

      As described above, we have collected new data showing that BDNF stimulation increases LRRK2 kinase activity toward its physiological substrates Rab10 and Rab8 (using a panphospho-Rab antibody) (Figure 1 and Figure S1). Additionally, we have also extensively commented the ceiling effect of pS935.

      BDNF does a LOT. What happens to network activity in the neural cultures with BDNF application? Should go up immediately. Would increasing neural activity (i.e., through depolarization, forskolin, disinhibition, or something else without BDNF) give a similar 20% increase in pS935 LRRK2? Can this be additive, or occluded? This would have major implications for the conclusions that BDNF and pLRRK2 are tightly linked (as the title suggests).  

      These are very valuable observations; however, they fall outside the scope and timeframe of this study. We agree that future research should focus on gaining a deeper mechanistic understanding of how LRRK2 regulates synaptic activity, including vesicle release probability and postsynaptic spine maturation, independently of BDNF.

      Figures 1A & H "Western blot analysis revealed a rapid (5 mins) and transient increase of Ser935 phosphorylation after BDNF treatment (Fig. 1B and 1C). Of interest, BDNF failed to stimulate Ser935 phosphorylation when neurons were pretreated with the LRRK2 inhibitor MLi-2" . The first thing that stands out is that the pLRRK2 in WB is not very clear at all (although we appreciate it is 'a pig' to work with, I'd hope some replicates are clearer); besides that, the 20% increase only at 5min post-BDNF stimulation seems like a much less profound change than the reduction from base at 60 and more at 180 minutes (where total LRRK2 protein is also going down?). That the blot at 60 minutes in H is representative of a 30% reduction seems off... makes me wonder about the background subtraction in quantification (for this there is much less pLRRK2 and more total LRRK2 than at 0 or 5). LRRK2 (especially) and pLRRK2 seem very sketchy in H. Also, total LRRK2 appears to increase in the SHSY5Y cell not the neurons, and this seems even clearer in 2 H. 

      To better visualize the dynamics of pS935 variation relative to time=0, we presented the data as the difference between t=0 and t=x. It clearly shows that pSe935 goes below prestimulation levels, whereas pRab10 does not. The large difference in the initial stoichiometry of these two phosphorylation is extensively discussed above.

      That MLi2 eliminates pLRRK2 (and seems to reduce LRRK2 protein?) isn't surprising, but a 90min pretreatment with MLi-2 should be compared to MLi-2's vehicle alone (MLi-2 is notoriously insoluble and the majority of diluents have bioactive effects like changing activity)... especially if concluding increased pLRRK2 in response to BDNF is a crucial point (when comparing against effects on other protein modifications such as pAKT). This highlights a second point... the changes to pERK and pAKT are huge following BDNF (nothing to massive quantities), whereas pLRRK2 increases are 20-50% at best. This suggests a very modest effect of BDNF on LRRK in neurons, compared to the other kinases. I worry this might be less consequential than claimed. Change in S1 is also unlikely to be significant... 

      These comments have been thoroughly addressed in the previous responses. Regarding fig. S1, we added an additional experiment (Figure S1C) in GFP-LRRK2 cells showing robust activation of LRRK2 (pS935, pRabs) at the timepoint of MS (15 min).

      "As the yields of endogenous LRRK2 purification were insufficient for AP-MS/MS analysis, we generated polyclonal SH-SY5Y cells stably expressing GFP-LRRK2 wild-type or GFP control (Supplementary Fig. 1)" . I am concerned that much is being assumed regarding 'synaptic function' from SHSY5Y cells... also overexpressing GFP-LRRK2 and looking at its binding after BDNF isn't synaptic function.  

      We appreciate the reviewer’s comment. We would like to clarify that the interactors enriched upon BDNF stimulation predominantly fall into semantic categories related to the synapse and actin cytoskeleton. While this does not imply that these interactors are exclusively synaptic, it suggests that this tightly interconnected network likely plays a role in synaptic function. This interpretation is supported by several lines of evidence: (1) previous studies have demonstrated the relevance of this compartment to LRRK2 function; (2) our new phosphoproteomics data from striatal lysate highlight enrichment of synaptic categories; and (3) analysis of the latest GWAS gene list (134 genes) also indicates significant enrichment of synapse-related categories. Taken together, these findings justify further investigation into the role of LRRK2 in synaptic biology, as discussed extensively in the manuscript’s discussion section.

      Figure 2A isn't alluded to in text and supplemental table 1 isn't about LRRK2 binding, but mEPSCs. 

      We have added Figure 2A and added supplementary .xls table 1, which refers to the excel list of genes with modulated interaction upon BDNF (uploaded in the supplemental material).

      We added the extension .xls also for supplementary table 2 and 3. 

      Figure 2A is useless without some hits being named, and the donut plots in B add nothing beyond a statement that "35% of 'genes' (shouldn't this be proteins?) among the total 207 LRRK2 interactors were SynGO annotated" might as well [just] be the sentence in the text. 

      We have now included the names of the most significant hits, including cytoskeletal and translation-related proteins, as well as known LRRK2 interactors. We decided to retain the donut plots, as we believe they simplify data interpretation for the reader, reducing the need to jump back and forth between the figures and the text.

      Validation of drebrin binding in 2H is great... although only one of 8 named hits; could be increased to include some of the others. A concern alludes to my previous point... there is no appreciable LRRK2 in these cells until GFP-LRRK2 is overexpressed; is this addressed in the MS? Conclusions would be much stronger if bidirectional coIP of these binding candidates were shown with endogenous (GFP-ve) LRRK2 (primaries or hIPSCs, brain tissue?) 

      To address the Reviewer’s concerns to the best of our abilities, we have added a blot in Supplemental figure S1A showing how the expression levels of LRRK2 increase after RA differentiation. Moreover, we have included several new data further strengthening the functional link between LRRK2 and drebrin, including qPCR of Dbn1 in one-month old Lrrk2 KO brains, western blots of Lrrk2 and Rab in Dbn1 KO brains, and co-IP with drebrin N- and Cterm domains. 

      Figures 3 A-C are not informative beyond the text and D could be useful if proteins were annotated. 

      To avoid overcrowding, proteins were annotated in A and the same network structure reported for synaptic and actin-related interactors. 

      Figure 4. Is this now endogenous LRRK2 in the SHSY5Y cells? Again not much LRRK2 though, and no pLRRK shown. 

      We confirm that these are naïve SH-SY5Y cells differentiated with RA and LRRK2 is endogenous. We did not assess pS935 in this experiment, as the primary goal was to evaluate pAKT and pERK1/2 levels. To avoid signal saturation, we loaded less total protein (30 µg instead of the 80 µg typically required to detect pS935). pS935 levels were extensively assessed in Figure 1. This experimental detail has now been added in the material and methods section (page 18).

      In C (primary neurons) There is very little increase in pLRRK2 / LRRK2 at 5 mins, and any is much less profound a change than the reduction at 30 & 60 mins. I think this is interesting and may be a more substantial consequence of BDNF treatment than the small early increase. Any 5 min increase is gone by 30 and pLRRK2 is reduced after. This is a disconnect from the timing of all the other pProteins in this assay, yet pLRRK2 is supposed to be regulating the 'synaptic effects'? 

      The first part of the question has already been extensively addressed. Regarding the timing, one possibility is that LRRK2 is activated upstream of AKT and ERK1/2, a hypothesis supported by the reduced activation of AKT and ERK1/2 observed in LRRK2 KO cells, as discussed in the manuscript, and in MLi-2 treated cells (Author response image 2). Concerning the synaptic effects, it is well established that synaptic structural and functional plasticity occurs downstream of receptor activation and kinase signaling cascades. These changes can be mediated by both rapid mechanisms (e.g., mobilization of receptor-containing endosomes via the actin cytoskeleton) and slower processes involving gene transcription of immediate early genes (IEGs). Since structural and functional changes at the synapse generally manifest several hours after stimulation, we typically assessed synaptic activity and structure 24 hours post-stimulation.

      Akt Erk1&2 both go up rapidly after BDNF in WT, although Akt seems to come down with pLRRK2. If they aren't all the same Akt is probably the most different between LKO and WT but I am very concerned about an n=3 for wb, wb is semi-quantitative at best, and many more than three replicates should be assessed, especially if the argument is that the increases are quantitively different between WT v KO (huge variability in WT makes me think if this were done 10x it would all look same). Moreover, this isn't similar to the LKO primaries  "pulled pups" pooled presumably. 

      Despite some variability in the magnitude of the pAKT/pERK response in naïve SH-SY5Y cells, all three independent replicates consistently showed a reduced response in LRRK2 KO cells, yielding a highly significant result in the two-way ANOVA test. In contrast, the difference in response magnitude between WT and LRRK2 KO primary cultures was less pronounced, which justified repeating the experiments with n=9 replicates. We hope the Reviewer acknowledges the inherent variability often observed in western blot experiments, particularly when performed in a fully independent manner (different cultures and stimulations, independent blots).

      To further strengthen the conclusion that this effect is reproducible and dependent on LRRK2 kinase activity upstream of AKT and ERK, we probed the membranes in figure 1H with pAKT/total AKT and pERK/total ERK. All things considered and consistent with our hypothesis, MLi-2 significantly reduced BDNF-mediated AKT and ERK1/2 phosphorylation levels (Author response image 2). 

      Author response image 2.

      Western blot (same experiments as in figure 1) was performed using antibodies against phospho-Thr202/185 ERK1/2, total ERK1/2 and phospho-Ser473 AKT, total AKT protein levels Retinoic acid-differentiated SH-SY5Y cells stimulated with 100 ng/mL BDNF for 0, 5, 30, 60 mins. MLi-2 was used at 500 nM for 90 mins to inhibit LRRK2 kinase activity.

      G lack of KO effect seems to be skewed from one culture in the plot (grey). The scatter makes it hard to read, perhaps display the culture mean +/- BDNF with paired bars. The fact that one replicate may be changing things is suggested by the weirdly significant treatment effect and no genotype effect. Also, these are GFP-filled cells, the dendritic masks should be shown/explained, and I'm very surprised no one counted the number (or type?) of protrusions, especially as the text describes this assay (incorrectly) as spinogenesis... 

      As suggested by the Reviewer we have replotted the results as bar graphs. Regarding the number of protrusions, we initially counted the number of GFP+ puncta in the WT and did not find any difference (Author response image 3). Due to our imaging setup (confocal microscopy rather than super-resolution imaging and Imaris 3D reconstruction), we were unable to perform a fine morphometric analysis. However, this was not entirely unexpected, as BDNF is known to promote both the formation and maturation of dendritic spines. Therefore, we focused on quantifying PSD95+ puncta as a readout of mature postsynaptic compartments. While we acknowledge that we cannot definitively conclude that each PSD95+ punctum is synaptically connected to a presynaptic terminal, the data do indicate an increase in the number of PSD95+ structures following BDNF stimulation.

      Author response image 3.

      GFP+ puncta per unit of neurite length (µm) in DIV14 WT primary neurons untreated or upon 24 hour of BDNF treatment (100 ng/ml). No significant difference were observed (n=3).

      Figure 5. "Dendritic spine maturation is delayed in Lrrk2 knockout mice". The only significant change is at 1 month in KO which shows fewer filopodia and increased thin spines (50% vs wt). At 4 months the % of thin spines is increased to 60% in both... Filopodia also look like 4m in KO at 1m... How is that evidence for delayed maturation? If anything it suggests the KO spines are maturing faster. "the average neck height was 15% shorter and the average head width was 27% smaller, meaning that spines are smaller in Lrrk2 KO brains" - it seems odd to say this before saying that actually there are just MORE thin spines, the number of mature "mushroom' is same throughout, and the different percentage of thin comes from fewer filopodia. This central argument that maturation is delayed is not supported and could be backwards, at least according to this data. Similarly, the average PSD length is likely impacted by a preponderance of thin spines in KO... which if mature were fewer would make sense to say delayed KO maturation, but this isn't the case, it is the fewer filopodia (with no PSD) that change the numbers. See previous comments of the preceding manuscript. 

      We agree that thin spines, while often considered more immature, represent an intermediate stage in spine development. The data showing an increase in thin spines at 1 month in the KO mice, along with fewer filopodia, could suggest a faster stabilization of these spines, which might indeed be indicative of premature maturation rather than delayed maturation. This change in spine morphology may indicate that the dynamics of synaptic plasticity are affected. Regarding the PSD length, as the Reviewer pointed out, the increased presence of thin spines in KO might account for the observed changes in PSD measurements, as thin spines typically have smaller PSDs. This further reinforces the idea that the overall maturation process may be altered in the KO, but not necessarily delayed. 

      We rephrase the interpretation of these data, and moved figure 5D as supplemental figure S4.

      "To establish whether loss of Lrrk2 in young mice causes a reduction in dendritic spines size by influencing BDNF-TrkB expression" - there is no evidence of this.  

      We agree and reorganized the text, removing this sentence.  

      Shank and PSD95 mRNA changes being shown without protein adds very little. Why is drebrin RNA not shown? Also should be several housekeeping RNAs, not one (RPL27)? 

      We measured Dbn1 mRNA, which shows a significant reduction in midbrain and cortex. Moreover we have now normalized the transcript levels against the geometrical means of three housekeeping genes (RPL27, actin, and GAPDH) relative abundance.

      Drebrin levels being lower in KO seems to be the strongest result of the paper so far (shame no pLRRK2 or coIP of drebrin to back up the argument). DrebrinA KO mice have normal spines, what about haploinsufficient drebrin mice (LKO seem to have half derbrin, but only as youngsters?)  

      As extensively explained in the public review, we used Dbn1 KO mouse brains and were able to show reduced Lrrk2 activity.

      Figure 6. hIPSC-derived cortical neurons. The WT 'cortical' neurons have a very low mEPSC frequency at 0.2Hz relative to KO. Is this because they are more or less mature? What is the EPSC frequency of these cells at 30 and 90 days for comparison? Also, it is very very hard to infer anything about mEPSC frequency in the absence of estimates of cell number and more importantly synapse number. Furthermore, where are the details of cell measures such as capacitance, resistance, and quality control e.g., Ra? Table s1 seems redundant here, besides suggesting that the amplitude is higher in KO at base. 

      We agree that the developmental trajectory of iPSC-derived neurons is critical to accurately interpreting synaptic function and plasticity. In response, we have included additional data now presented in the supplementary figure S7 and summarize key findings below:

      At DIV50, both WT and LRRK2 KO neurons exhibit low basal mEPSC activity (~0.5 Hz) and no response to 24 h BDNF stimulation (50 ng/mL).

      At DIV70 WT neurons show very low basal activity (~0.2 Hz), which increases ~7.5-fold upon BDNF treatment (1.5 Hz; p < 0.001), and no change in synapse number. KO neurons display elevated basal activity (~1 Hz) similar to BDNF-treated WT neurons, with no further increase upon BDNF exposure (~1.3 Hz) and no change in synapse number.

      At DIV90, no significant effect of BDNF in both WT and KO, indicating a possible saturation of plastic responses. The lack of BDNF response at DIV90 may be due to endogenous BDNF production or culture-based saturation effects. While these factors warrant further investigation (e.g., ELISA, co-culture systems), they do not confound the key conclusions regarding the role of LRRK2 in synaptic development and plasticity:

      LRRK2 Enables BDNF-Responsive Synaptic Plasticity. In WT neurons, BDNF induces a significant increase in neurotransmitter release (mEPSC frequency) with no reduction in synapse number. This dissociation suggests BDNF promotes presynaptic functional potentiation. KO neurons fail to show changes in either synaptic function or structure in response to BDNF, indicating that LRRK2 is required for activity-dependent remodeling.

      LRRK2 Loss Accelerates Synaptic Maturation. At DIV70, KO neurons already exhibit high spontaneous synaptic activity equivalent to BDNF-stimulated WT neurons. This suggests that LRRK2 may act to suppress premature maturation and temporally gate BDNF responsiveness, aligning with the differences in maturation dynamics observed in KO mice (Figure 5).  

      As suggested by the reviewer we reported the measurement of resistance and capacitance for all DIV (Table 1, supplemental material). A reduction in capacitance was observed in WT neurons at DIV90, which may reflect changes in membrane complexity. However, this did not correlate with differences in synapse number and is unlikely to account for the observed differences in mEPSC frequency. To control for cell number between groups, cell count prior to plating was performed (80k/cm2; see also methods) on the non-dividing cells to keep cell number consistent.

      The presence of BDNF in WT seems to make them look like LKO, in the rest of the paper the suggestion is that the LKO lack a response to BDNF. Here it looks like it could be that BDNF signalling is saturated in LKO, or they are just very different at base and lack a response.

      Knowing which is important to the conclusions, and acute application (recording and BDNF wash-in) would be much more convincing.

      We agree with the Reviewer’s point that saturation of BDNF could influence the interpretation of the data if it were to occur. However, it is important to note that no BDNF exists in the media in base control and KO neuronal culture conditions. This is  different from other culture conditions and allows us to investigate the effects of  BDNF treatment. Thus, the increased mEPSC frequency observed in KO neurons compared to WT neurons is defined only by the deletion of the gene and not by other extrinsic factors which were kept consistent between the groups. The lack of response or change in mEPSC frequency in KO is proposed to be a compensatory mechanism due to the loss of LRRK2. Of Note, LRRK2 as a “synaptic break” has already been described (Beccano-Kelly et al., Hum Mol Gen, 2015). However, a comprehensive analysis of the underlying molecular mechanisms will  require future studies beyond  with the scope of this paper.

      "The LRRK2 kinase substrates Rabs are not present in the list of significant phosphopeptides, likely due to the low stoichiometry and/or abundance" Likely due to the fact mass spec does not get anywhere near everything. 

      We removed this sentence in light of the new phosphoproteomic analysis.

      Figure 7 is pretty stand-alone, and not validated in any way, hard to justify its inclusion?  

      As extensively explained we removed figure 7 and included the new phospho-MS as part of figure. 3

      Writing throughout shows a very selective and shallow use of the literature.  

      We extensively reviewed the citations.

      "while Lrrk1 transcript in this region is relatively stable during development" The authors reference a very old paper that barely shows any LRRK1 mRNA, and no protein. Others have shown that LRRK1 is essentially not present postnatally PMC2233633. This isn't even an argument the authors need to make. 

      We thank the reviewer and included this more appropriate citation. 

      Reviewer #2 (Recommendations For The Authors): 

      Cyfip1 (Fig 3A) is part of the WAVE complex (page 13). 

      We thank the reviewer and specified it.

      The discussion could be more focused. 

      We extensively revised the discussion to keep it more focused.

      Note that we updated the GO ontology analyses to reflect the updated information present in g:Profiler.

      References.

      Nirujogi, R. S., Tonelli, F., Taylor, M., Lis, P., Zimprich, A., Sammler, E., & Alessi, D. R. (2021). Development of a multiplexed targeted mass spectrometry assay for LRRK2phosphorylated Rabs and Ser910/Ser935 biomarker sites. The Biochemical journal, 478(2), 299–326. https://doi.org/10.1042/BCJ20200930

      Worth, D. C., Daly, C. N., Geraldo, S., Oozeer, F., & Gordon-Weeks, P. R. (2013). Drebrin contains a cryptic F-actin-bundling activity regulated by Cdk5 phosphorylation. The Journal of cell biology, 202(5), 793–806. https://doi.org/10.1083/jcb.201303005

      Shirao, T., Hanamura, K., Koganezawa, N., Ishizuka, Y., Yamazaki, H., & Sekino, Y. (2017). The role of drebrin in neurons. Journal of neurochemistry, 141(6), 819–834. https://doi.org/10.1111/jnc.13988

      Koganezawa, N., Hanamura, K., Sekino, Y., & Shirao, T. (2017). The role of drebrin in dendritic spines. Molecular and cellular neurosciences, 84, 85–92. https://doi.org/10.1016/j.mcn.2017.01.004

      Meixner, A., Boldt, K., Van Troys, M., Askenazi, M., Gloeckner, C. J., Bauer, M., Marto, J. A., Ampe, C., Kinkl, N., & Ueffing, M. (2011). A QUICK screen for Lrrk2 interaction partners--leucine-rich repeat kinase 2 is involved in actin cytoskeleton dynamics. Molecular & cellular proteomics: MCP, 10(1), M110.001172. https://doi.org/10.1074/mcp.M110.001172

      Parisiadou, L., & Cai, H. (2010). LRRK2 function on actin and microtubule dynamics in Parkinson disease. Communicative & integrative biology, 3(5), 396–400. https://doi.org/10.4161/cib.3.5.12286

      Chen, C., Masotti, M., Shepard, N., Promes, V., Tombesi, G., Arango, D., Manzoni, C., Greggio, E., Hilfiker, S., Kozorovitskiy, Y., & Parisiadou, L. (2024). LRRK2 mediates haloperidol-induced changes in indirect pathway striatal projection neurons. bioRxiv : the preprint server for biology, 2024.06.06.597594. https://doi.org/10.1101/2024.06.06.597594

      Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytė, A., Applebaum, T., Pritzel, A.,Wong, L. H., Zielinski, M., Sargeant, T., Schneider, R. G., Senior, A. W., Jumper, J., Hassabis, D., Kohli, P., & Avsec, Ž. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (New York, N.Y.), 381(6664), eadg7492. https://doi.org/10.1126/science.adg7492

      Beaudoin, G. M., 3rd, Schofield, C. M., Nuwal, T., Zang, K., Ullian, E. M., Huang, B., & Reichardt, L. F. (2012). Afadin, a Ras/Rap effector that controls cadherin function, promotes spine and excitatory synapse density in the hippocampus. The Journal of neuroscience : the official journal of the Society for Neuroscience, 32(1), 99–110. https://doi.org/10.1523/JNEUROSCI.4565-11.2012

      Fernández, B., Chittoor-Vinod, V. G., Kluss, J. H., Kelly, K., Bryant, N., Nguyen, A. P. T., Bukhari, S. A., Smith, N., Lara Ordóñez, A. J., Fdez, E., Chartier-Harlin, M. C., Montine, T. J., Wilson, M. A., Moore, D. J., West, A. B., Cookson, M. R., Nichols, R. J., & Hilfiker, S. (2022). Evaluation of Current Methods to Detect Cellular Leucine-Rich Repeat Kinase 2 (LRRK2) Kinase Activity. Journal of Parkinson's disease, 12(5), 1423–1447. https://doi.org/10.3233/JPD-213128

      Cirnaru, M. D., Marte, A., Belluzzi, E., Russo, I., Gabrielli, M., Longo, F., Arcuri, L., Murru, L., Bubacco, L., Matteoli, M., Fedele, E., Sala, C., Passafaro, M., Morari, M., Greggio, E., Onofri, F., & Piccoli, G. (2014). LRRK2 kinase activity regulates synaptic vesicle trafficking and neurotransmitter release through modulation of LRRK2 macromolecular complex. Frontiers in molecular neuroscience, 7, 49. https://doi.org/10.3389/fnmol.2014.00049

      Belluzzi, E., Gonnelli, A., Cirnaru, M. D., Marte, A., Plotegher, N., Russo, I., Civiero, L., Cogo, S., Carrion, M. P., Franchin, C., Arrigoni, G., Beltramini, M., Bubacco, L., Onofri, F., Piccoli, G., & Greggio, E. (2016). LRRK2 phosphorylates pre-synaptic Nethylmaleimide sensitive fusion (NSF) protein enhancing its ATPase activity and SNARE complex disassembling rate. Molecular neurodegeneration, 11, 1. https://doi.org/10.1186/s13024-015-0066-z

      Martin, E. R., Gandawijaya, J., & Oguro-Ando, A. (2022). A novel method for generating glutamatergic SH-SY5Y neuron-like cells utilizing B-27 supplement. Frontiers in pharmacology, 13, 943627. https://doi.org/10.3389/fphar.2022.943627

      Kovalevich, J., & Langford, D. (2013). Considerations for the use of SH-SY5Y neuroblastoma cells in neurobiology. Methods in molecular biology (Clifton, N.J.), 1078, 9–21. https://doi.org/10.1007/978-1-62703-640-5_2

      Drummond, N. J., Singh Dolt, K., Canham, M. A., Kilbride, P., Morris, G. J., & Kunath, T. (2020). Cryopreservation of Human Midbrain Dopaminergic Neural Progenitor Cells Poised for Neuronal Differentiation. Frontiers in cell and developmental biology, 8, 578907. https://doi.org/10.3389/fcell.2020.578907

      Tao, X., Finkbeiner, S., Arnold, D. B., Shaywitz, A. J., & Greenberg, M. E. (1998). Ca2+ influx regulates BDNF transcription by a CREB family transcription factor-dependent mechanism. Neuron, 20(4), 709–726. https://doi.org/10.1016/s0896-6273(00)810107

      El-Husseini, A. E., Schnell, E., Chetkovich, D. M., Nicoll, R. A., & Bredt, D. S. (2000). PSD95 involvement in maturation of excitatory synapses. Science (New York, N.Y.), 290(5495), 1364–1368.

      Glebov OO, Cox S, Humphreys L, Burrone J. Neuronal activity controls transsynaptic geometry. Sci Rep. 2016 Mar 8;6:22703. doi: 10.1038/srep22703. Erratum in: Sci Rep. 2016 May 31;6:26422. doi: 10.1038/srep26422. PMID: 26951792; PMCID: PMC4782104.

      Beccano-Kelly DA, Volta M, Munsie LN, Paschall SA, Tatarnikov I, Co K, Chou P, Cao LP, Bergeron S, Mitchell E, Han H, Melrose HL, Tapia L, Raymond LA, Farrer MJ, Milnerwood AJ. LRRK2 overexpression alters glutamatergic presynaptic plasticity, striatal dopamine tone, postsynaptic signal transduction, motor activity and memory. Hum Mol Genet. 2015 Mar 1;24(5):1336-49. doi: 10.1093/hmg/ddu543. Epub 2014 Oct 24. PMID: 25343991.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors use anatomical tracing and slice physiology to investigate the integration of thalamic (ATN) and retrosplenial cortical (RSC) signals in the dorsal presubiculum (PrS). This work will be of interest to the field, as the postsubiculum is thought to be a key region for integrating internal head direction representations with external landmarks. The main result is that ATN and RSC inputs drive the same L3 PrS neurons, which exhibit superlinear summation to near-coincident inputs. Moreover, this activity can induce bursting in L4 PrS neurons, which can pass the signals LMN (perhaps gated by cholinergic input).

      Strengths:

      The slice physiology experiments are carefully done. The analyses are clear and convincing, and the figures and results are well-composed. Overall, these results will be a welcome addition to the field.

      We thank this reviewer for the positive comment on our work.

      Weaknesses:

      The conclusions about the circuit-level function of L3 PrS neurons sometimes outstrip the data, and their model of the integration of these inputs is unclear. I would recommend some revision of the introduction and discussion. I also had some minor comments about the experimental details and analysis.

      Specific major comments:

      (1) I found that the authors' claims sometimes outstrip their data, given that there were no in vivo recordings during behavior. For example, in the abstract, their results indicate "that layer 3 neurons can transmit a visually matched HD signal to medial entorhinal cortex", and in the conclusion they state "[...] cortical RSC projections that carry visual landmark information converge on layer 3 pyramidal cells of the dorsal presubiculum". However, they never measured the nature of the signals coming from ATN and RSC to L3 PrS (or signals sent to downstream regions). Their claim is somewhat reasonable with respect to ATN, where the majority of neurons encode HD, but neurons in RSC encode a vast array of spatial and non-spatial variables other than landmark information (e.g., head direction, egocentric boundaries, allocentric position, spatial context, task history to name a few), so making strong claims about the nature of the incoming signals is unwarranted.

      We agree of course that RSC does not only encode landmark information. We have clarified this point in the introduction (line 69-70) and formulated more carefully in the abstract (removed the word ‘landmark’ in line 17) and in the  introduction (line 82-83). In the discussion we explicitly state that ‘In our slice work we are blind to the exact nature of the signal that is carried by ATN and RSC axons’ (line 522-523).

      (2) Related to the first point, the authors hint at, but never explain, how coincident firing of ATN and RSC inputs would help anchor HD signals to visual landmarks. Although the lesion data (Yoder et al. 2011 and 2015) support their claims, it would be helpful if the proposed circuit mechanism was stated explicitly (a schematic of their model would be helpful in understanding the logic). For example, how do neurons integrate the "right" sets of landmarks and HD signals to ensure stable anchoring? Moreover, it would be helpful to discuss alternative models of HD-to-landmark anchoring, including several studies that have proposed that the integration may (also?) occur in RSC (Page & Jeffrey, 2018; Yan, Burgess, Bicanski, 2021; Sit & Goard, 2023). Currently, much of the Discussion simply summarizes the results of the study, this space could be better used in mapping the findings to the existing literature on the overarching question of how HD signals are anchored to landmarks.

      We agree with the reviewer on the importance of the question, how do neurons integrate the “right” sets of landmarks and HD signals to ensure stable anchoring? Based on our results we provide a schematic to illustrate possible scenarios, and we include it as a supplementary figure (Figure 1, to be included in the ms as Figure 7—figure supplement 2), as well as a new paragraph in the discussion section (line 516-531).  We point out that critical information on the convergence and divergence of functionally defined inputs is still lacking, both for principal cells and interneurons

      Interestingly, recent evidence from functional ultrasound imaging and electrical single cell recording demonstrated that visual objects may refine head direction coding, specifically in the dorsal presubiculum (Siegenthaler et al. bioRxiv 2024.10.21.619417; doi: https://doi.org/10.1101/2024.10.21.619417). The increase in firing rate for HD cells whose preferred firing direction corresponds to a visual landmark could be supported by the supralinear summation of thalamic HD signals and retrosplenial input described in our study. We include this point in the discussion (line 460-462), and hope that our work will spur further investigations.

      Reviewer #2 (Public Review):

      Richevaux et al investigate how anterior thalamic (AD) and retrosplenial (RSC) inputs are integrated by single presubicular (PrS) layer 3 neurons. They show that these two inputs converge onto single PrS layer 3 principal cells. By performing dual-wavelength photostimulation of these two inputs in horizontal slices, the authors show that in most layer 3 cells, these inputs summate supra-linearly. They extend the experiments by focusing on putative layer 4 PrS neurons, and show that they do not receive direct anterior thalamic nor retrosplenial inputs; rather, they are (indirectly) driven to burst firing in response to strong activation of the PrS network.

      This is a valuable study, that investigates an important question - how visual landmark information (possibly mediated by retrosplenial inputs) converges and integrates with HD information (conveyed by the AD nucleus of the thalamus) within PrS circuitry. The data indicate that near-coincident activation of retrosplenial and thalamic inputs leads to non-linear integration in target layer 3 neurons, thereby offering a potential biological basis for landmark + HD binding.

      The main limitations relate to the anatomical annotation of 'putative' PrS L4 neurons, and to the presentation of retrosplenial/thalamic input modularity. Specifically, more evidence should be provided to convincingly demonstrate that the 'putative L4 neurons' of the PrS are not distal subicular neurons (as the authors' anatomy and physiology experiments seem to indicate). The modularity of thalamic and retrosplenial inputs could be better clarified in relation to the known PrS modularity.

      We thank the reviewer for their important feedback. We discuss what defines presubicular layer 4 in horizontal slices, cite relevant literature, and provide new and higher resolution images. See below for detailed responses to the reviewer’s comments, in the section ‘recommendations to authors’.

      Reviewer #3 (Public Review):

      Summary:

      The authors sought to determine, at the level of individual presubiculum pyramidal cells, how allocentric spatial information from the retrosplenial cortex was integrated with egocentric information from the anterior thalamic nuclei. Employing a dual opsin optogenetic approach with patch clamp electrophysiology, Richevaux, and colleagues found that around three-quarters of layer 3 pyramidal cells in the presubiculum receive monosynaptic input from both brain regions. While some interesting questions remain (e.g. the role of inhibitory interneurons in gating the information flow and through different layers of presubiculum, this paper provides valuable insights into the microcircuitry of this brain region and the role that it may play in spatial navigation).

      Strengths:

      One of the main strengths of this manuscript was that the dual opsin approach allowed the direct comparison of different inputs within an individual neuron, helping to control for what might otherwise have been an important source of variation. The experiments were well-executed and the data was rigorously analysed. The conclusions were appropriate to the experimental questions and were well-supported by the results. These data will help to inform in vivo experiments aimed at understanding the contribution of different brain regions in spatial navigation and could be valuable for computational modelling.

      Weaknesses:

      Some attempts were made to gain mechanistic insights into how inhibitory neurotransmission may affect processing in the presubiculum (e.g. Figure 5) but these experiments were a little underpowered and the analysis carried out could have been more comprehensively undertaken, as was done for other experiments in the manuscript.

      We agree that the role of interneurons for landmark anchoring through convergence in Presubiculum requires further investigation. In our latest work on the recruitment of VIP interneurons we begin to address this point in slices (Nassar et al., 2024 Neuroscience. doi: 10.1016/j.neuroscience.2024.09.032.); more work in behaving animals will be needed.

      Reviewer #1 (Recommendations For The Authors):

      Full comments below. Beyond the (mostly minor) issues noted below, this is a very well-written paper and I look forward to seeing it in print.

      Major comments:

      (1) I found that the authors' claims sometimes outstrip their data, given that there were no in vivo recordings during behavior. For example, in the abstract, their results indicate "that layer 3 neurons can transmit a visually matched HD signal to medial entorhinal cortex", and in the conclusion they state "[...] cortical RSC projections that carry visual landmark information converge on layer 3 pyramidal cells of the dorsal presubiculum". However, they never measured the nature of the signals coming from ATN and RSC to L3 PrS (or signals sent to downstream regions). Their claim is somewhat reasonable with respect to ATN, where the majority of neurons encode HD, but neurons in RSC encode a vast array of spatial and non-spatial variables other than landmark information (e.g., head direction, egocentric boundaries, allocentric position, spatial context, task history to name a few), so making strong claims about the nature of the incoming signals is unwarranted.

      Our study was motivated by the seminal work from Yoder et al., 2011 and 2015, indicating that visual landmark information is processed in PoS and from there transmitted to the LMN.  Based on that, and in the interest of readability, we may have used an oversimplified shorthand for the type of signal carried by RSC axons. There are numerous studies indicating a role for RSC in encoding visual landmark information (Auger et al., 2012; Jacob et al., 2017; Lozano et al., 2017; Fischer et al., 2020; Keshavarzi et al., 2022; Sit and Goard, 2023); we agree of course that this is certainly not the only variable that is represented. Therefore we change the text to make this point clear:

      Abstract, line 17: removed the word ‘landmark’

      Introduction, line 69: added “...and supports an array of cognitive functions including memory, spatial and non-spatial context and navigation (Vann et al., 2009; Vedder et al., 2017). ”

      Introduction, line 82: changed “...designed to examine the convergence of visual landmark information, that is possibly integrated in the RSC, and vestibular based thalamic head direction signals”.

      Discussion, line 522-523: added “In our slice work we are blind to the exact nature of the signal that is carried by ATN and RSC axons.”

      (2) Related to the first point, the authors hint at, but never explain, how coincident firing of ATN and RSC inputs would help anchor HD signals to visual landmarks. Although the lesion data (Yoder et al., 2011 and 2015) support their claims, it would be helpful if the proposed circuit mechanism was stated explicitly (a schematic of their model would be helpful in understanding the logic). For example, how do neurons integrate the "right" sets of landmarks and HD signals to ensure stable anchoring? Moreover, it would be helpful to discuss alternative models of HD-to-landmark anchoring, including several studies that have proposed that the integration may (also?) occur in RSC (Page & Jeffrey, 2018; Yan, Burgess, Bicanski, 2021; Sit & Goard, 2023). Currently, much of the Discussion simply summarizes the results of the study, this space could be better used in mapping the findings to the existing literature on the overarching question of how HD signals are anchored to landmarks.

      We suggest a physiological mechanism for inputs to be selectively integrated and amplified, based on temporal coincidence. Of course there are still many unknowns, including the divergence of connections from a single thalamic or retrosplenial input neuron. The anatomical connectivity of inputs will be critical, as well as the subcellular arrangement of synaptic contacts. Neuromodulation and changes in the balance of excitation and inhibition will need to be factored in. While it is premature to provide a comprehensive explanation for landmark anchoring of HD signals in PrS, our results have led us to include a schematic, to illustrate our thinking (Figure 1, see below).

      Do HD tuned inputs from thalamus converge on similarly tuned HD neurons only? Is divergence greater for the retrosplenial inputs? If so, thalamic input might pre-select a range of HD neurons, and converging RSC input might narrow down the precise HD neurons that become active (Figure 1). In the future, the use of activity dependent labeling strategies might help to tie together information on the tuning of pre-synaptic neurons, and their convergence or divergence onto functionally defined postsynaptic target cells. This critical information is still lacking, for principal cells, and also for interneurons. 

      Interneurons may have a key role in HD-to-landmark anchoring. SST interneurons support stability of HD signals (Simonnet et al., 2017) and VIP interneurons flexibly disinhibit the system (Nassar et al., 2024). Could disinhibition be a necessary condition to create a window of opportunity for updating the landmark anchoring of the attractor? Single PV interneurons might receive thalamic and retrosplenial inputs non-specifically. We need to distinguish the conditions for when the excitation-inhibition balance in pyramidal cells may become tipped towards excitation, and the case of coincident, co-tuned thalamic and retrosplenial input may be such a condition. Elucidating the principles of hardwiring of inputs, as for example, selective convergence, will be necessary. Moreover, neuromodulation and oscillations may be critical for temporal coordination and precise temporal matching of HD-to-landmark signals.

      We note that matching directional with visual landmark information based on temporal coincidence as described here does not require synaptic plasticity. Algorithms for dynamic control of cognitive maps without synaptic plasticity have been proposed (Whittington et al., 2025, Neuron): information may be stored in neural attractor activity, and the idea that working memory may rely on recurrent updates of neural activity might generalize to the HD system. We include these considerations in the discussion (line 497-501; 521-531) and hope that our work will spur further experimental investigations and modeling work.

      While the focus of our work has been on PrS, we agree that RSC also treats HD and landmark signals. Possibly the RSC registers a direction to a landmark rather than comparing it with the current HD (Sit & Goard, 2023). We suggest that this integrated information then reaches PrS. In contrast to RSC, PrS is uniquely positioned to update the signal in the LMN (Yoder et al., 2011), cf. discussion (line 516-520).

      Minor comments:

      (1) Fig 1 - Supp 1: It appears there is a lot of input to PrS from higher visual regions, could this be a source of landmark signals?

      Yes, higher visual regions projecting to PrS may also be a source of landmark information, even if the visual signal is not integrated with HD at that stage (Sit & Goard 2023). The anatomical projection from the visual cortex was first described by Vogt & Miller (1983), but not studied on a functional level so far.

      (2) Fig 2F, G: Although the ATN and RSC measurements look quite similar, there are no stats included. The authors should use an explicit hypothesis test.

      We now compare the distributions of amplitudes and of latencies, using the Mann-Whitney U test. No significant difference between the two groups were found. Added in the figure legend: 2F, “Mann-Whitney U test revealed no significant difference (p = 0.95)”. 2G, “Mann-Whitney U test revealed no significant difference (p = 0.13)”.

      (3) Fig 2 - Supp 2A, C: Again, no statistical tests. This is particularly important for panel A, where the authors state that the latencies are similar but the populations appear to be different.

      Inputs from ATN and RSC have a similar ‘jitter’ (latency standard deviation) and ‘tau decay’. We added in the Fig 2 - Supp 2 figure legend: A, “Mann-Whitney U test revealed no significant difference (p = 0.26)”. C, “Mann-Whitney U test revealed no significant difference (p = 0.87)”.

      As a complementary measure for the reviewer, we performed the Kolmogorov-Smirnov test which confirmed that the populations’ distributions for ‘jitter’ were not significantly different, p = 0.1533.

      (4) Fig 4E, F: The statistics reporting is confusing, why are asterisks above the plots and hashmarks to the side?

      Asterisks refer to a comparison between ‘dual’ and ‘sum’ for each of the 5 stimulations in a Sidak multiple comparison test. Hashmarks refer to comparison of the nth stimulation to the 1st one within dual stimulation events (Friedman + Dunn’s multiple comparison test). We mention the two-way ANOVA p-value in the legend (Sum v Dual, for both Amplitude and Surface).

      (5) Fig 5C: I was confused by the 2*RSC manipulation. How do we know if there is amplification unless we know what the 2*RSC stim alone looks like?

      We now label the right panel in Fig 5C as “high light intensity” or “HLI”. Increasing the activation of Chrimson increases the amplitude of the summed EPSP that now exceeds the threshold for amplification of synaptic events. Amplification refers to the shape of the plateau-like prolongation of the peak, most pronounced on the second EPSP, now indicated with an arrow.  We clarify this also in the text (line 309-310).

      (6) Fig 6D (supplement 1): Typo, "though" should be "through"

      Yes, corrected (line 1015).

      (7) Fig 6G (supplement 1): Typo, I believe this refers to the dotted are in panel F, not panel A.

      Yes, corrected (line 1021).

      (8) Fig 7: The effect of muscarine was qualitatively described in the Results, but there is no quantification and it is not shown in the Figure. The results should either be reported properly or removed from the Results.

      We remove the last sentence in the Results.

      (9) Methods: The age and sex of the mice should be reported. Transgenic mouse line should be reported (along with stock number if applicable).

      We used C57BL6 mice with transgenic background (Ai14 mice, Jax n007914  reporter line) or C57BL6 wild type mice. This is now indicated in the Methods (lines 566-567).

      (10) Methods: If the viruses are only referred to with their plasmid number, then the capsid used for the viruses should be specified. For example, I believe the AAV-CAG-tomato virus used the retroAAV capsid, which is important to the experiment.

      Thank you for pointing this out. Indeed the AAV-CAG-tdTom virus used the retroAAV capsid, (line 575).

      (11) Data/code availability: I didn't see any sort of data/code availability statement, will the data and code be made publicly available?

      Data are stored on local servers at the SPPIN, Université Paris Cité, and are made available upon reasonable request. Code for intrinsic properties analysis is available on github (https://github.com/schoki0710/Intrinsic_Properties). This information is now included (line 717-720).

      (12) Very minor (and these might be a matter of opinion), but I believe "records" should be "recordings", and "viral constructions" should be "viral constructs".

      The text had benefited from proofreading by Richard Miles, who always preferred “records” to “recordings” in his writings. We choose to keep the current wording.

      Reviewer #2 (Recommendations For The Authors):

      Below are two major points that require clarification.

      (1) In the last set of experiments presented by the authors (Figs 6 onwards) they focus on 'putative L4' PrS cells. For several lines of evidence (outlined below), I am convinced that these neurons are not presubicular, but belong to the subiculum. I think this is a major point that requires substantial clarification, in order to avoid confusion in the field (see also suggestions on how to address this comment at the end of this section).

      Several lines of evidence support the interpretation that, what the authors call 'L4 PrS neurons', are distal subicular cells:

      (1.1) The anatomical location of the retrogradely-labelled cells (from mammillary bodies injections), as shown in Figs 6B, C, and Fig. 6_1B, very clearly indicates that they belong to the distal subiculum. The subicular-to-PrS boundary is a sharp anatomical boundary that follows exactly the curvature highlighted by the authors' red stainings. The authors could also use specific subicular/PrS markers to visualize this border more clearly - e.g. calbindin, Wfs-1, Zinc (though I believe this is not strictly necessary, since from the pattern of AD fibers, one can already draw very clear conclusions, see point 1.3 below).

      Our criteria to delimit the presubiculum are the following: First and foremost, we rely on the defining presence of antero-dorsal thalamic fibers that target specifically the presubiculum and not the neighbouring subiculum (Simonnet et al., 2017, Nassar et al., 2018, Simonnet and Fricker, 2018; Jiayan Liu et al., 2021). This provides the precise outline of the presubicular superficial layers 1 to 3. It may have been confusing to the reviewer that our slicing angle gives horizontal sections. In fact, horizontal sections are favourable to identify the layer structure of the PrS,  based on DAPI staining and the variations in cell body size. The work by Ishihara and Fukuda (2016) illustrates in their Figure 12 that the presubicular layer 4 lies below the presubicular layer 3, and forms a continuation with the subiculum (Sub1). Their Figure 4 indicates with a dotted line the “generally accepted border between the (distal) subiculum and PreS”, and it runs from the proximal tip of superficial cells of the PrS toward the white matter, among the radial direction of the cortical tissue.  We agree with this definition. Others have sliced coronally (Cembrowski et al., 2018) which renders a different visualization of the border region with the subiculum.

      Second, let me explain the procedure for positioning the patch electrode in electrophysiological experiments on horizontal presubicular slices. Louis Richevaux, the first author, who carried out the layer 4 cell recordings, took great care to stay very close (<50 µm) to the lower limit of the zone where the GFP labeled thalamic axons can be seen. He was extremely meticulous about the visualization under the microscope, using LED illumination, for targeting. The electrophysiological signature of layer 4 neurons with initial bursts (but not repeated bursting, in mice) is another criterion to confirm their identity (Huang et al., 2017). Post-hoc morphological revelation showed their apical dendrites, running toward the pia, sometimes crossing through the layer 3, sometimes going around the proximal tip, avoiding the thalamic axons (Figure 6D). For example the cell in Figure 6, suppl. 1 panel D, has an apical dendrite that runs through layer 3 and layer 1. 

      Third, retrograde labeling following stereotaxic injection into the LMN is another criterion to define PrS layer 4. This approach is helpful for visualization, and is based on the defining axonal projection of layer 4 neurons (Yoder and Taube, 2011; Huang et al., 2017). Due to the technical challenge to stereotaxically inject only into LMN, the resultant labeling may not be limited to PrS layer 4. We cannot entirely exclude some overflow of retrograde tracers (B) or retrograde virus (C) to the neighboring MMN. This would then lead to co-labeling of the subiculum. In the main Figure 6, panels B and C, we agree that for this reason the red labelled cell bodies likely include also subicular neurons, on the proximal side, in addition to L4 presubicular neurons. We now point out this caveat in the main text (line 324-326) and in the methods (line 591-592).

      (1.2) Consistent with their subicular location, neuronal morphologies of the 'putative L4 cells' are selectively constrained within the subicular boundaries, i.e. they do not cross to the neighboring PrS (maybe a minor exception in Figs. 6_1D2,3). By definition, a neuron whose morphology is contained within a structure belongs to that structure.

      From a functional point of view, for the HD system, the most important criterion for defining presubicular layer 4 neurons is their axonal projection to the LMN (Yoder and Taube 2011). From an electrophysiological standpoint, it is the capacity of layer 4 neurons to fire initial bursts (Simonnet et al., 2013; Huang et al., 2017).  Anatomically, we note that the expectation that the apical dendrite should go straight up into layer 3 might not be a defining criterion in this curved and transitional periarchicortex. Presubicular layer 4 apical dendrites may cross through layer 3 and exit to the side, towards the subiculum (This is the red dendritic staining at the proximal end of the subiculum, at the frontier with the subiculum, Figure 6 C).

      (1.3) As acknowledged by the authors in the discussion (line 408): the PrS is classically defined by the innervation domain of AD fibers. As Figure 6B clearly indicates, the retrogradely-labelled cells ('putative L4') are convincingly outside the input domain of the AD; hence, they do not belong to the PrS.

      The reviewer is mistaken here, the deep layers 4 and 5/6 indeed do not lie in the zone innervated by the thalamic fibers (Simonnet et al., 2017; Nassar et al., 2018; Simonnet and Fricker, 2018) but still belong to the presubiculum. The presubicular deep layers are located below the superficial layers, next to, and in continuation of the subiculum. This is in agreement with work by Yoder and Taube 2011; Ishihara and Fukuda 2016; Boccara, … Witter, 2015; Peng et al., 2017 (Fig 2D); Yoshiko Honda et al., (Marmoset, Fig 2A) 2022; Balsamo et al., 2022 (Figure 2B).

      (1.4) Along with the above comment: in my view, the optogenetic stimulation experiments are an additional confirmation that the 'putative L4 cells' are subicular neurons, since they do not receive AD inputs at all (hence, they are outside of the PrS); they are instead only indirectly driven upon strong excitation of the PrS. This indirect activation is likely to occur via PrS-to-Subiculum 'back-projections', the existence of which is documented in the literature and also nicely shown by the authors (see Figure 1_1 and line 109).

      See above. Only superficial layers 1-3 of the presubiculum receive direct AD input.

      (1.5) The electrophysiological properties of the 'putative L4 cells' are consistent with their subicular identity, i.e. they show a sag current and they are intrinsically bursty.

      Presubicular layer 4 cells also show bursting behaviour and a sag current (Simonnet et al., 2013; Huang et al., 2017).

      From the above considerations, and the data provided by the authors, I believe that the most parsimonious explanation is that these retrogradely-labelled neurons (from mammillary body injections), referred to by the authors as 'L4 PrS cells', are indeed pyramidal neurons from the distal subiculum.

      We agree that the retrograde labeling is likely not limited to the presubicular layer 4 cells, and we now indicate this in the text (line 324-326). However, the portion of retrogradely labeled neurons that is directly below the layer 3 should be considered as part of the presubiculum.

      I believe this is a fundamental issue that deserves clarification, in order to avoid confusion/misunderstandings in the field. Given the evidence provided, I believe that it would be inaccurate to call these cells 'L4 PrS neurons'. However, I acknowledge the fact that it might be difficult to convincingly and satisfactorily address this issue within the framework of a revision. For example, it is possible that these 'putative L4 cells' might be retrogradely-labelled from the Medial Mammillary Body (a major subicular target) since it is difficult to selectively restrict the injection to the LMN, unless a suitable driver line is used (if available). The authors should also consider the possibility of removing this subset of data (referring to putative L4), and instead focus on the rest of the story (referring to L3)- which I think by itself, still provides sufficient advance.

      We agree with the reviewer that it is difficult to provide a satisfactory answer. To some extent, the reviewer’s comments target the nomenclature of the subicular region. This transitional region between the hippocampus and the entorhinal cortex has been notoriously ill defined, and the criteria are somewhat arbitrary for determining exactly where to draw the line. Based on the thalamic projection, presubicular layers 1-3 can now be precisely outlined, thanks to the use of viral labeling. But the presubicular layer 4 had been considered to be cell-free in early works, and termed ‘lamina dissecans’ (Boccara 2010), as the limit between the superficial and deep layers. Then it became of great interest to us and to the field, when the PrS layer 4 cells were first identified as LMN projecting neurons (Yoder and Taube 2011). This unique back-projection to the upstream region of the HD system is functionally very important, closing the loop of the Papez circuit (mammillary bodies - thalamus - hippocampal structures).

      We note that the reviewer does not doubt our results, rather questions the naming conventions. We therefore maintain our data. We agree that in the future a genetically defined mouse line would help to better pin down this specific neuronal population.

      We thank the reviewer for sharing their concerns and giving us the opportunity to clarify our experimental approach to target the presubicular layer 4. We hope that these explanations will be helpful to the readers of eLife as well.

      (2) The PrS anatomy could be better clarified, especially in relation to its modular organization (see e.g. Preston-Ferrer et al., 2016; Ray et al., 2017; Balsamo et al., 2022). The authors present horizontal slices, where cortical modularity is difficult to visualize and assess (tangential sections are typically used for this purpose, as in classical work from e.g. barrel cortex). I am not asking the authors to validate their observations in tangential sections, but just to be aware that cortical modules might not be immediately (or clearly) apparent, depending on the section orientation and thickness. The authors state that AD fibers were 'not homogeneously distributed' in L3 (line 135) and refer to 'patches of higher density in deep L3' (line 136). These statements are difficult to support unless more convincing anatomy and  . I see some L3 inhomogeneity in the green channel in Fig. 1G (last two panels) and also in Fig. 1K, but this seems to be rather upper L3. I wonder how consistent the pattern is across different injections and at what dorsoventral levels this L3 modularity is observed (I think sagittal sections might be helpful). If validated, these observations could point to the existence of non-homogeneous AD innervation domains in L3 - hinting at possible heterogeneity among the L3 pyramidal cell targets. Notably, modularity in L2 and L1 is not referred to. The authors state that AD inputs 'avoid L2' (line 131) but this statement is not in line with recent work (cited above) and is also not in line with their anatomy data in Fig. 1G, where modularity is already quite apparent in L2 (i.e. there are territories avoided by the AD fibers in L2) and in L1 (see for example the last image in Fig. 1G). This is the case also for the RSC axons (Fig. 1H) where a patchy pattern is quite clear in L1 (see the last image in panel H). Higher-mag pictures might be helpful here. These qualitative observations imply that AD and RSC axons probably bear a precise structural relationship relative to each other, and relative to the calbindin patch/matrix PrS organization that has been previously described. I am not asking the authors to address these aspects experimentally, since the main focus of their study is on L3, where RSC/AD inputs largely converge. Better anatomy pictures would be helpful, or at least a better integration of the authors' (qualitative) observations within the existing literature. Moreover, the authors' calbindin staining in Fig. 1K is not particularly informative. Subicular, PaS, MEC, and PrS borders should be annotated, and higher-resolution images could be provided. The authors should also check the staining: MEC appears to be blank but is known to strongly express calb1 in L2 (see 'island' by Kitamura et al., Ray et al., Science 2014; Ray et al., frontiers 2017). As additional validation for the staining: I would expect that the empty L2 patches in Figs. 1G (last two panels) would stain positive for Calbindin, as in previous work (Balsamo et al. 2022).

      We now provide a new figure showing the pattern of AD innervation in PrS superficial layers 1 to 3, with different dorso-ventral levels and higher magnification (Figure 2). Because our work was aimed at identifying connectivity between long-range inputs and presubicular neurons, we chose to work with horizontal sections that preserve well the majority of the apical dendrites of presubicular pyramidal neurons. We feel it is enriching for the presubicular literature to show the cytoarchitecture from different angles and to show patchiness in horizontal sections. The non-homogeneous AD innervation domains (‘microdomains’) in L3 were consistently observed across different injections in different animals.

      Author response image 1.

      Thalamic fiber innervation pattern. A, ventral, and B, dorsal horizontal section of the Presubiculum containing ATN axons expressing GFP. Patches of high density of ATN axonal ramifications in L3 are indicated as “ATN microdomains”. Layers 1, 2, 3, 4, 5/6 are indicated.  C, High magnification image (63x optical section)(different animal).<br />

      We also provide a supplementary figure with images of horizontal sections of calbindin staining in PrS, with a larger crop, for the reviewer to check (Figure 3, see below). We thank the reviewer for pointing out recent studies using tangential sections. Our results agree with the previous observation that AD axons are found in calbindin negative territories (cf Fig 1K). Calbindin+ labeling is visible in the PrS layer 2 as well as in some patches in the MEC (Figure 3 panel A). Calbindin staining tends to not overlap with the territories of ATN axonal ramification. We indicate the inhomogeneities of anterior thalamic innervation that form “microdomains” of high density of green labeled fibers, located in layer 1 and layer 3 (Figure 3, Panel A, middle). Panel B shows another view of a more dorsal horizontal section of the PrS, with higher magnification, with a big Calbindin+ patch near the parasubiculum.

      The “ATN+ microdomains” possess a high density of axonal ramifications from ATN, and have been previously documented in the literature. They are consistently present. Our group had shown them in the article by Nassar et al., 2018, at different dorsoventral levels (Fig 1 C (dorsal) and 1D (ventral) PrS). See also Simonnet et al., 2017, Fig 2B, for an illustration of the typical variations in densities of thalamic fibers, and supplementary Figure 1D. Also Jiayan Liu et al., 2021 (Figure 2 and Fig 5) show these characteristic microzones of dense thalamic axonal ramifications, with more or less intense signals across layers 1, 2, and 3.  While it is correct that thalamic axons can be seen to cross layer 2 to ramify in layer 1, we maintain that AD axons typically do not ramify in layer 2. We modify the text to say, “mostly” avoiding L2 (line 130).

      The reviewer is correct in pointing out that the 'patches of higher density in deep L3' are not only in the deep L3, as in the first panel in Fig 1G, but in the more dorsal sections they are also found in the upper L3. We change the text accordingly (line 135-136) and we provide the layer annotation in Figure 1G. We further agree with the reviewer that RSC axons also present a patchy innervation pattern. We add this observation in the text (line 144).

      It is yet unclear whether anatomical microzones of dense ATN axon ramifications in L3 might fulfill the criteria of a functional modularity, as it is the case for the calbindin patch/matrix PrS organization (Balsamo et al., 2022). As the reviewer points out, this will require more information on the precise structural relationship of AD and RSC axons relative to each other, as well as functional studies. Interestingly, we note a degree of variation in the amplitudes of oEPSC from different L3 neurons (Fig. 2F, discussion line 420; 428), which might be a reflection of the local anatomo-functional micro-organization.

      Minor points:

      (1) The pattern or retrograde labelling, or at least the way is referred to in the results (lines 104ff), seems to imply some topography of AD-to-PreS projections. Is it the case? How consistent are these patterns across experiments, and individual injections? Was there variability in injection sites along the dorso-ventral and possibly antero-posterior PrS axes, which could account for a possibly topographical AD-to-PrS input pattern? It would be nice to see a DAPI signal in Fig. 1B since the AD stands out quite clearly in DAPI (Nissl) alone.

      Yes, we find a consistent topography for the AD-to-PrS projection, for similar injection sites in the presubiculum. The coordinates for retrograde labeling were as indicated -4.06 (AP), 2.00 (ML) and -2.15 mm (DV) such that we cannot report on possible variations for different injection sites.

      (2) Fig. 2_2KM: this figure seems to show the only difference the authors found between AD and RS input properties. The authors could consider moving these data into main Fig. 2 (or exchanging them with some of the panels in F-O, which instead show no difference between AD and RSC). Asterisks/stats significance is not visible in M.

      For space reasons we leave the panels of Fig. 2_2KM in the supplementary section. We increased the size of the asterisk in M.

      (3) The data in Fig. 1_1 are quite interesting, since some of the PrS projection targets are 'non-canonical'. Maybe the authors could consider showing some injection sites, and some fluorescence images, in addition to the schematics. Maybe the authors could acknowledge that some of these projection targets are 'putative' unless independently verified by e.g. retrograde labeling. Unspecific white matter labelling and/or spillover is always a potential concern.

      We now include the image of the injection site for data in Fig. 1_1 as a supplementary Fig. 1_2. The Figure 1_1 shows the retrogradely labeled upstream areas of Presubiculum.

      Author response image 2.

      Retrobeads were injected in the right Presubiculum.<br />

      (4) The authors speculate that the near-coincident summation of RS + AD inputs in L3 cells could be a potential mechanism for the binding of visual + HD information in PrS. However, landmarks are learned, and learning typically implies long-term plasticity. As the authors acknowledge in the discussion (lines 493ff) GluR1 is not expressed in PrS cells. What alternative mechanics could the authors envision? How could the landmark-update process occur in PrS, if is not locally stored? RSC could also be involved (Jakob et al) as acknowledged in the introduction - the authors should keep this possibility open also in the discussion.

      A similar point has been raised by Reviewer 1, please check our answer to their point 2. Briefly, our results indicate that HD-to-landmark updating is a multi-step process. RSC may be one of the places where landmarks are learned. The subsequent temporal mapping of HD to landmark signals in PrS might be plasticity-free, as matching directional with visual landmark information based on temporal coincidence does not necessarily require synaptic plasticity.  It seems likely that there is no local storage and no change in synaptic weights in PrS. The landmark-anchored HD signals reach LMN via L4 neurons, sculpting network dynamics across the Papez circuit. One possibility is that the trace of a landmark that matches HD may be stored as patterns of neural activity that could guide navigation (cf. El-Gaby et al., 2024, Nature) Clearly more work is needed to understand how the HD attractor is updated on a mechanistic level. Recent work in prefrontal cortex mentions “activity slots” and delineates algorithms for dynamic control of cognitive maps without synaptic plasticity (Whittington et al., 2025, Neuron): information may be stored in neural attractor activity, and the idea that working memory may rely on recurrent updates of neural activity might generalize to the HD system. We include these considerations in the discussion (line 499-503; 523-533) and also point to alternative models (line 518 -522) including modeling work in the retrosplenial cortex.

      (5) The authors state that (lines 210ff) their cluster analysis 'provided no evidence for subpopulations of layer 3 cells (but see Balsamo et al., 2022)' implying an inconsistency; however, Balsamo et al also showed that the (in vivo) ephys properties of the two HD cell 'types' are virtually identical, which is in line with the 'homogeneity' of L3 ephys properties (in slice) in the authors' data. Regarding the possible heterogeneity of L3 cells: the authors report inhomogeneous AD innervation domains in L3 (see also main comment 2) and differences in input summation (some L3 cells integrate linearly, some supra-linearly; lines 272) which by itself might already imply some heterogeneity. I would therefore suggest rewording the statements to clarify what the lack of heterogeneity refers to.

      We agree. In line 212 we now state “cluster analysis (Figure 2D) provided no evidence for subpopulations of layer 3 cells in terms of intrinsic electrophysiological properties (see also Balsamo et al., 2022).”

      (6) n=6 co-recorded pairs are mentioned at line 348, but n=9 at line 366. Are these numbers referring to the same dataset? Please correct or clarify

      Line 349 refers to a set of 6 co-recorded pairs (n=12 neurons) in double injected mice with Chronos injected in ATN and Chrimson in RSC (cf. Fig. 7E). The 9 pairs mentioned in line 367 refer to another type of experiment where we stimulated layer 3 neurons by depolarizing them to induce action potential firing while recording neighboring layer 4 neurons to assess connectivity. Line 367  now reads: “In n = 9 paired recordings, we did not detect functional synapses between layer 3 and layer 4 neurons.”

      Reviewer #3 (Recommendations For The Authors):

      Questions for the authors/points for addressing:

      I found that the slice electrophysiology experiments were not reported with sufficient detail. For example, in Figure 2, I am assuming that the voltage clamp experiments were carried out using the Cs-based recording solution, while the current clamp experiments were carried out using the K-Gluc intracellular solution. However, this is not explicitly stated and it is possible that all of these experiments were performed using the K-Gluc solution, which would give slightly odd EPSCs due to incomplete space/voltage clamp. Furthermore, the method states that gabazine was used to block GABA(A) receptor-mediated currents, but not when this occurred. Was GABAergic neurotransmission blocked for all measurements of EPSC magnitude/dynamics? If so, why not block GABA(B) receptors? If not blocking GABAergic transmission for measuring EPSCs, why not? This should be stated explicitly either way.

      The addition of drugs or difference of solution is indicated in the figure legend and/or in the figure itself, as well as in the methods. We now state explicitly: “In a subset of experiments, the following drugs were used to modulate the responses to optogenetic stimulations; the presence of these drugs is indicated in the figure and figure legend, whenever applicable.” (line 632). A Cs-based internal solution and gabazine were used in Figure 5, this is now indicated in the Methods section (line 626). All other experiments were performed using K-Gluc as an internal solution and ACSF.

      Methods: The experiments involving animals are incompletely reported. For example, were both sexes used? The methods state "Experiments were performed on wild‐type and transgenic C57Bl6 mice" - what transgenic mice were used and why is this not reported in detail (strain, etc)? I would refer the authors to the ARRIVE guidelines for reporting in vivo experiments in a reproducible manner (https://arriveguidelines.org/).

      We now added this information in the methods section, subsection “Animals” (line 566-567). Animals of both sexes were used. The only transgenic mouse line used was the Ai14 reporter line (no phenotype), depending on the availability in our animal facility.

      For experiments comparing ATN and RSC inputs onto the same neuron (e.g. Figure 2 supplement 2 G - J), are the authors certain that the observed differences (e.g. rise time and paired-pulse facilitation on the ATN input) are due to differences in the synapses and not a result of different responses of the opsins? Refer to https://pubmed.ncbi.nlm.nih.gov/31822522/ from Jess Cardin's lab. This could easily be tested by switching which opsin is injected into which nucleus (a fair amount of extra work) or comparing the Chrimson synaptic responses with those evoked using Chronos on the same projection, as used in Figure 2 (quite easy as authors should already have the data).

      We actually did switch the opsins across the two injection sites. In Figure 2 - supplement 2G-J, the values linked by a dashed line result from recordings in the switched configuration with respect to the original configuration (in full lines, Chronos injected in RSC and Chrimson in ATN). The values from switched configuration followed the trend of the main configuration and were not statistically different (Mann-Whitney U test).

      Statistical reporting: While the number of cells is generally reported for experiments, the number of slices and animals is not. While slice ephys often treat cells as individual biological replicates, this is not entirely appropriate as it could be argued that multiple cells from a single animal are not independent samples (some sort of mixed effects model that accounts for animals as a random effect would be better). For the experiments in the manuscript, I don't think this is necessary, but it would certainly reassure the reader to report how many animals/slices each dataset came from. At a bare minimum, one would want any dataset to be taken from at least 3 animals from 2 different litters, regardless of how many cells are in there.

      Our slice electrophysiology experiments include data from 38 successfully injected animals: 14 animals injected in ATN, 20 animals injected in RSC, and 4 double injected animals. Typically, we recorded 1 to 3 cells per slice. We now include this information in the text or in the figure legends (line 159, 160, 297, 767, 826, 831, 832, 839, 845, 901, 941).

      For the optogenetic experiments looking at the summation of EPSPs (e.g. figure 4), I have two questions: why were EPSPs measured and not EPSCs? The latter would be expected to give a better readout of AMPA receptor-mediated synaptic currents. And secondly, why was 20 Hz stimulation used for these experiments? One might expect theta stimulation to be a more physiologically-relevant frequency of stimulation for comparing ATN and RSC inputs to single neurons, given the relevance with spatial navigation and that the paper's conclusions were based around the head direction system. Similarly, gamma stimulation may also have been informative. Did the authors try different frequencies of stimulation?

      Question 1. The current clamp configuration allows to measure  EPSPamplification/prolongation by NMDA or persistent Na currents (cf.  Fricker and Miles 2000), which might contribute to supralinearity.

      Question 2. In a previous study from our group about the AD to PrS connection (Nassar et al., 2018), no significant difference was observed on the dynamics of EPSCs between stimulations at 10 Hz versus 30 Hz. Therefore we chose 20 Hz. This value is in the range of HD cell firing (Taube 1995, 1998 (peak firing rates, 18 to 24 spikes/sec in RSC; 41 spikes/sec in AD)(mean firing rates might be lower), Blair and Sharp 1995). In hindsight, we agree that it would have been useful to include 8Hz or 40Hz stimulations. 

      The GABA(A) antagonist experiments in Figure 5 are interesting but I have concerns about the statistical power of these experiments - n of 3 is absolutely borderline for being able to draw meaningful conclusions, especially if this small sample of cells came from just 1 or 2 animals. The number of animals used should be stated and/or caution should be applied when considering the potential mechanisms of supralinear summation of EPSPs. It looks like the slight delay in RSC input EPSP relative to ATN that was in earlier figures is not present here - could this be the loss of feedforward inhibition?

      The current clamp experiments in the presence of QX314 and a Cs gluconate based internal solution were preceded by initial experiments using puff applications of glutamate to the recorded neurons (not shown). Results from those experiments had pointed towards a role for TTX resistant sodium currents and for NMDA receptor activation as a factor favoring the amplification and prolongation of glutamate induced events. They inspired the design of the dual wavelength stimulation experiments shown in Figure 5, and oriented our discussion of the results. We agree of course that more work is required to dissect the role of disinhibition for EPSP amplification. This is however beyond the present study.

      Concerning the EPSP onset delays following RSC input stimulation:  In this set of experiments, we compensated for the notoriously longer delay to EPSP onset, following RSC axon stimulation, by shifting the photostimulation (red) of RSC fibers to -2 ms, relative to the onset of photostimulation of ATN fibers (blue). This experimental trick led to an improved  alignment of the onset of the postsynaptic response, as shown in the figure below for the reviewer.

      Author response image 3.

      In these experiments, the onset of RSC photostimulation was shifted forward in time by -2 ms, in an attempt to better align the EPSP onset to the one evoked by ATN stimulation.<br />

      We insert in the results a sentence to indicate that experiments illustrated in Figure 5 were performed in only a small sample of 3 cells that came from 2 mice (line 297), so caution should be applied. In the discussion we  formulate more carefully, “From a small sample of cells it appears that EPSP amplification may be facilitated by a reduction in synaptic inhibition (n = 3; Figure 5)” (line 487).

      Figure 7: I appreciate the difficulties in making dual recordings from older animals, but no conclusion about the RSC input can legitimately be made with n=1.

      Agreed. We want to avoid any overinterpretation, and point out in the results section that the RSC stimulation data is from a single cell pair. The sentence now reads : “... layer 4 neurons occurred after firing in the layer 3 neuron, following ATN afferent stimuli, in 4 out of 5 cell pairs. We also observed this sequence when RSC input was activated, in one tested pair.” line (347-349)

      Minor points:

      Line 104: 'within the two subnuclei that form the anterior thalamus' - the ATN actually has three subdivisions (AD, AV, AM) so this should state 'two of the three nuclei that form the anterior thalamus...'

      Corrected, line 103

      Line 125: should read "figure 1F" and not "figure 2F".

      Corrected, line 124

      Line 277-280: Why were two different posthoc tests used on the same data in Figures 3E & F?

      We used Sidak’s multicomparison test to compare each event Sum vs. Dual (two different configurations at each time point - asterisks) and Friedman’s and Dunn’s to compare the nth EPSP amplitude to the first one for Dual events (same configuration between time points - hashmarks). We give two-way ANOVA results in the legend.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Major concerns:

      (1) Is the direct binding of MCAK to the microtubule cap important for its in vivo function?

      a.The authors claim that their "study provides mechanistic insights into understanding the end-binding mechanism of MCAK". I respectfully disagree. My concern is that the paper offers limited insights into the physiological significance of direct end-binding for MCAK activity, even in vitro. The authors estimate that in the absence of other proteins in vitro, ~95% of MCAK molecules arrive at the tip by direct binding in the presence of ~ physiological ATP concentration (1 mM). In cells, however, the major end-binding pathway may be mediated by EB, with the direct binding pathway contributing little to none. This is a reasonable concern because the apparent dissociation constant measured by the authors shows that MCAK binding to microtubules in the presence of ATP is very weak (69 uM). This concern should be addressed by 1) calculating relative contributions of direct and EB-dependent pathways based on the affinities measured in this and other published papers and estimated intracellular concentrations. Although there are many unknowns about these interactions in cells, a modeling-based analysis may be revealing. 2) the recapitulation of these pathways using purifying proteins in vitro is also feasible. Ideally, some direct evidence should be provided, e.g. based on MCAK function-separating mutants (GDP-Pi tubulin binding vs. catalytic activity at the curled protofilaments) that contribution from the direct binding of MCAK to microtubule cap in EB presence is significant.

      We thank the reviewer for the thoughtful comments.

      (1) We think that the end-binding affinity of MCAK makes a significant contribution for its cellular functions. To elucidate this concept, we now use a simple model shown in Supplementary Appendix-2 (see pages 49-51, lines 1246-1316). In this model, we simplified MCAK and EB1 binding to microtubule ends by considering only these two proteins while neglecting other factors (e.g. XMAP215). Specifically, we considered two scenarios: one in which both proteins freely diffuse in the cytoplasm and another where MCAK is localized to specific cellular structures, such as the centrosome or centromere. Based on the modeling results, we argue that MCAK's functional impact at microtubule ends derives both from its intrinsic end-binding capacity and its ability to strengthen the EB1-mediated end association pathway.

      (2) We agree with the reviewer that MCAK exhibiting a lower end-binding affinity (69 µM) is indeed intriguing, as one might intuitively expect a stronger affinity, e.g. in the nanomolar range. Several factors may contribute to this observation. First, this could be partly due to the in vitro system employed, which may not perfectly replicate in vivo conditions, especially when considering cellular processes quantitatively. Variations in medium composition can significantly influence the binding state. For example, reducing salt concentration leads to a marked increase in MCAK’s binding affinity (Helenius et al., 2006; Maurer et al., 2011; McHugh et al., 2019). Additionally, while numerous binding events with short durations were detected, we excluded transient interactions from our analysis to facilitate quantification. This likely leads to an underestimation of the on-rate and, consequently, the binding affinity. Moreover, to minimize the interference of purification tags (His-tag), we ensured their complete removal during protein sample preparation. Previous studies reported that retaining the His-tag of MAPs affects the binding affinity to microtubules (Maurer et al., 2011; Zhu et al., 2009). Finally, a low affinity is not necessarily unexpected. Considering the microtubule end as a receptor with multiple binding sites for MCAK, the overall binding affinity is in the nanomolar range (260 nM). This does not necessarily contradict MCAK being a microtubule dynamics regulator as only a few MCAK molecules may suffice to induce microtubule catastrophe (as discussed on page 13, lines 408-441).

      (3) Ideally, we would search for mutants that specifically interfere with the binding of GDP-Pi-tubulin or the curled protofilaments. However, the mutant we tested significantly impacts the overall affinity of MCAK to microtubules (both end and lattice), making it challenging to isolate and discuss the function of MCAK with respect to the binding to GDP-Pi-tubulin alone. Additionally, we also think that the GDP-Pi-tubulin in the EB cap and the tubulin in the curved protofilaments may share structural similarities. For instance, the tubulin dimers in both states may be less compact compared to those in the lattice, which could explain why MCAK recognizes both simultaneously (Manka and Moores, 2018). However, this remains a conjecture, as there is currently no direct evidence to support it.

      b. As mentioned in the Discussion, preferential MCAK binding to tubulins near the MT tip may enhance MCAK targeting of terminal tubulins AFTER the MCAK has been "delivered" to the distal cap via the EB-dependent mechanism. This is a different targeting mechanism than the direct MCAK-binding. However, the measured binding affinity between MCAK and GMPCPP tubulins is so weak (69 uM), that this effect is also unlikely to have any impact because the binding events between MCAK and microtubule should be extremely rare. Without hard evidence, the arguments for this enhancement are very speculative.

      Please see our response to the comment No. 1. Additionally, we have revised our discussion to discuss the end-binding affinity of MCAK as well as its physiological relevance (please see page 13, lines 408-441; and see Supplementary Appendix-2 in pages 49-51, lines 1246-1316).

      (2) The authors do not provide sufficient justification and explanation for their investigation of the effects of different nucleotides in MCAK binding affinity. A clear summary of the nucleotide-dependent function of MCAK (introduction with references to prior affinity measurements and corresponding MCAK affinities), the justifications for this investigation, and what has been learned from using different nucleotides (discussion) should be provided. My take on these results is that by far the strongest effect on microtubule wall and tip binding is achieved by adding any adenosine, whereas differences between different nucleotides are relatively minor. Was this expected? What can be learned from the apparent similarity between ATP and AMPPNP effects in some assays (Fig 1E, 4C, etc) but not others (Fig 1D,F, etc)?

      We thank the reviewer for this suggestion. We have revised the manuscript accordingly, and below are the main points of our response

      (1) The experiment investigating the effects of different nucleotides on MCAK binding affinity was inspired by the previous studies demonstrating that kinesin-13 interactions with microtubules are highly dependent on their adenosine-bound states. For example, kinesin-13s tightly bind microtubules and prefer to form protofilament curls or rings with tubulin in the AMPPNP state, whereas kinesin-13s are considered to move along the microtubule lattice via one-dimensional diffusion in the ADP·Pi state (Asenjo et al., 2013; Benoit et al., 2018; Friel and Howard, 2011; Helenius et al., 2006). Based on these observations, we wondered whether MCAK's adenosine-bound states might similarly affect its binding preference for growing microtubule ends. We have made the motivation clear in the revised manuscript (please see page 7, lines 199-209).

      (2) Our main finding regarding the effects of nucleotides is that MCAK shows differential end-binding affinity and preference based on its nucleotide state. First, MCAK shows the greatest preference for growing microtubule ends in the ATP state, supporting the idea that diffusive MCAK (MCAK·ATP) can directly bind to growing microtubule ends. Second, MCAK·ATP also demonstrates a binding preference for GTPγS microtubules and the ends of GMPCPP microtubules. The similar trends in binding preference suggest that the affinity for GDP·Pi-tubulin and GTP-tubulin likely underpins MCAK’s preference for growing microtubule ends. To clarify these points, we have added further discussions in the manuscript (please see page 8, lines 230-233; page9, lines 258-270 and pages 13-14, lines 443-458).

      (3) It is not clear why the authors decided to use these specific mutant MCAK proteins to advance their arguments about the importance of direct tip binding. Both mutants are enzymatically inactive. Both show roughly similar tip interactions, with some (minor) differences. Without a clear understanding of what these mutants represent, the provided interpretations of the corresponding results are not convincing.

      We thank the reviewer for this comment. In the revised manuscript, we no longer draw conclusions about the importance of end-binding based on the mutant data. Instead, we think that the mutant data provide insights into the structural basis of the end-binding preference. Therefore, we have rewritten the results in this section to more accurately reflect these findings (please see page 10, lines 295-327).

      (4) GMPCPP microtubules are used in the current study to represent normal dynamic microtubule ends, based on some published studies. However, there is no consensus in the field regarding the structure of growing vs. GMPCPP-stabilized microtubule ends, which additionally may be sensitive to specific experimental conditions (buffers, temperature, age of microtubules, etc). To strengthen the authors' argument, Taxol-stabilized microtubules should be used as a control to test if the effects are specific. Additionally, the authors should consider the possibility that stronger MCAK binding to the ends of different types of microtubules may reflect MCAK-dependent depolymerization events on a very small scale (several tubulin rows). These nano-scale changes to tubulins and the microtubule end may lead to the accumulation of small tubulin-MCAK aggregates, as is seen with other MAPs and slowly depolymerizing microtubules. These effects for MCAK may also depend on specific nucleotides, further complicating the interpretation. This possibility should be addressed because it provides a different interpretation than presented in the manuscript.

      Regarding the two points raised here, our thoughts are as following

      (1) The end of GMPCPP-stabilized microtubules differs from that of growing microtubules, with the most obvious known difference being the absence of the region enriched in GDP-Pi-tubulin. We consider the end of GMPCPP microtubules as an analogue of the distal tip of growing microtubules, based on two key features: (1) curled protofilaments and (2) GMPCPP-tubulin, a close analogue of GTP-tubulin. Notably, both features are present at the ends of both GMPCPP-stabilized and growing microtubules. Moreover, we agree with the suggestion to use taxol-stabilized microtubules as a control. This would eliminate the second feature (absence of GTP-tubulin), allowing us to isolate the effect of the first feature. Therefore, we conducted this experiment, and our data showed that MCAK exhibits only a mild binding preference for the ends of taxol-stabilized microtubules, which is much less pronounced than for the ends of GMPCPP microtubules. This observation supports the idea that GMPCPP-stabilized ends closely resemble the growing ends of microtubules.

      (2) The reviewer suggested that stronger MCAK binding to the ends of different types of microtubules might reflect MCAK-dependent depolymerization events on a very small scale. This is an insightful possibility, which we had overlooked in the original manuscript. Fortunately, we performed the experiments at the single-molecule concentrations. Upon reviewing the raw data, we found that under ATP conditions, the binding events of MCAK were not cumulative (see Fig. X1 below) and showed no evidence of local accumulation of MCAK-tubulin aggregates.

      Author response image 1.

      The representative kymograph showing GFP-MCAK binding at the ends and lattice of GMPCPP microtubules in the presence of 1 mM ATP (10 nM GFP-MCAK), which corresponded to Fig. 5A. The arrow: the end-binding of MCAK. Vertical bar: 1 s; horizontal bar: 2 mm.

      (5) It would be helpful if the authors provided microtubule polymerization rates and catastrophe frequencies for assays with dynamic microtubules and MCAK in the presence of different nucleotides. The video recordings of microtubules under these conditions are already available to the authors, so it should not be difficult to provide these quantifications. They may reveal that microtubule ends are different (or not) under the examined conditions. It would also help to increase the overall credibility of this study by providing data that are easy to compare between different labs.

      We thank the reviewer for this suggestion. In the revised manuscript, we have provided data on the growth rates, which are similar across the different nucleotide states (Fig. s1). However, due to the short duration of our recordings (usually 5 minutes, but with a high frame rate, 10 fps), we did not observe many catastrophe events, which prevented us from quantifying catastrophe frequency using the current dataset. Since we measured the binding kinetics of MCAK during the growing phase of microtubules, the similar growth rates and microtubule end morphologies suggest that the microtubule ends are comparable across the different conditions.

      Reviewer #1 (Recommendations For The Authors):

      a. Please provide more details about how the microtubule-bound molecules were selected for analysis (include a description of scripts, selection criteria, and filters, if any). Fig 1A arrows do not provide sufficient information.

      We first measured the fluorescence intensity of each binding event. A probability distribution of these intensities was then constructed and fitted with a Gaussian function. A binding event was considered to correspond to a single molecule if its intensity fell within μ±2σ of the distribution. The details of the single-molecule screening process are now provided in the revised manuscript (see page17, lines 574-583).

      b. Evidence that MCAK is dimeric in solution should be provided (gel filtration results, controls for Figs1A - bleaching, or comparison with single GFP fluorophore).

      In the revised manuscript, we provide the gel filtration results of purified MCAK and other proteins used in this study. The elution volume of the peak for GFP-MCAK corresponded to a molecular weight range between 120 kDa (EB1-GFP dimer) and 260 kDa (XMAP215-GFP-his6), suggesting that GFP-MCAK exists as a dimer (~220 kDa) under experimental condition (please see Fig.s1 and page 5, lines 104-105). In addition, we also measured the fluorescence intensity of both MCAK<sup>sN+M</sup> and MCAK. MCAK<sup>sN+M</sup> is a monomeric mutant that contains the neck domain and motor domain (Wang et al., 2012). The average intensity of MCAK<sup>sN+M</sup> is 196 A.U., about 65% of that of MCAK (300 A.U.). These two measurements suggest that the purified MCAK used in this study exists dimers (see Fig. s1).

      c. Evidence that MCAK on microtubules represents single molecules should be provided (distribution of GFP brightness with controls - GFP imaged under identical conditions). Since assay buffers include detergent, which is not desirable, all controls should be done using the same assay conditions. The authors should rule out that their main results are detergent-sensitive.

      (1) Regarding if MCAK on microtubules represent single molecules: please refer to our responses to the two points above.

      (2) To rule out the effect of tween-20 (0.0001%, v/v), we performed additional control experiments. The results showed that it has no significant effect on microtubule-binding affinity of MCAK (see Figure below).

      Author response image 2.

      Tween-20 (0.0001%, v/v) has no significant effect on microtubule-binding affinity of MCAK. (A) The representative projection images of GFP-MCAK (5 nM) binding to taxol-stabled GDP microtubules in the presence of 1 mM AMPPNP with or without tween-20. The upper panel showed the results of the control experiments performed without MCAK. Scale bar: 5 mm. (B) Statistical quantification of the binding intensity of GFP-MCAK binding to GDP microtubules with or without tween-20 (53 microtubules from 3 assays and 70 microtubules from 3 assays, respectively). Data were presented as mean ± SEM. Statistical comparisons were performed using the two-tailed Mann-Whitney U test with Bonferroni correction, n.s., no significance.

      d. How did the authors plot single-molecule intensity distributions? I am confused as to why the intensity distribution for single molecules in Fig 1D and 2A looks so perfectly smooth, non-pixelated, and broader than expected for GFP wavelength. Please provide unprocessed original distributions, pixel size, and more details about how the distributions were processed.

      In the revised manuscript, we provided unprocessed original data in Fig. 1B and Fig. 2A. We thank the reviewer for pointing out this problem.

      e. Many quantifications are based on a limited number of microtubules and the number of molecules is not provided, starting from Fig 1D and down. Please provide detailed statistics and explain what is plotted (mean with SEM?) on each graph.

      We performed a thorough inspection of the manuscript and corrected the identified issues.

      f. Plots with averaged data should be supplemented with error bars and N should be provided in the legend. E.g. Fig 1C - average position of MT and peak positions.

      We agree with the reviewer. In the revised manuscript, we have made the changes accordingly (e.g. Fig. 2C).

      g. Detailed information should be provided about protein constructs used in this work including all tags. The use of truncated proteins or charged/bulky tags can modify protein-microtubule interactions.

      We agree with the reviewer. In the revised manuscript, we provide the information of all constructs (see Fig. s1 and the related descriptions in Methods, pages 15-16, lines 476-534).

      h. Line 515: We estimated that the accuracy of microtubule end tracking was ~6 nm by measuring the standard error of the distribution of the estimated error in the microtubule end position. - evidence should be provided using the conditions of this study, not the reference to the prior work by others.

      i. Line 520: We estimated that the accuracy of the measured position was ~2 nm by measuring the standard error of the fitting peak location". Please provide evidence.

      Point h-i: we now provide detailed descriptions of how to estimate tracking and measurement accuracy and error in our work. Please see pages 18-19, lines 626-645.

      j. Kymographs in Fig 5G are barely visible. Please provide single-channel greyscale images. What are the dim molecules diffusing on this microtubule?

      We have incorporated the changes suggested by the reviewer. We think that some of the dim signals may result from stochastic background noise, while others likely represent transient bindings of MCAK. The exposure time in our experiments was approximately 0.05 seconds; if the binding duration were shorter than this, the signal would be lower (i.e. the “dim” signals). It is important to note that in this study, we selected binding events lasting at least 2 consecutive frames, meaning transient binding events were not included. This point has been clarified in the Methods section (see page17, lines 573-583).

      k. Please provide a methods description for Fig 6. Did the buffer include 1 mM ATP? The presence of ATP would make these conditions more physiological. ATP concentration should be stated clearly in the main text or figure legend.

      The buffer contains ATP. In the revised manuscript, we have provided the methods for the experiments of microtubule dynamics assay, as well as the analysis of microtubule lifetimes and catastrophe frequency (see page 17, lines 561-572 and page 20, lines 685-690).

      l. Line 104: experiment was performed in BRB80 supplemented with 50 mM KCl and 1 mM ATP, providing a nearly physiological ion strength. Please provide a reference or add your calculations in Methods.

      We have provided references on page 5, lines 101-104 of our manuscript.

      m. What was the MCAK concentration in Figure 4? Did the microtubule shorten under any of these conditions?

      In these experiments, we used a very low concentration of MCAK and taxol-stabilized microtubules, so there’s no microtubule shortening observed here. ATP: 10 nM GFP-MCAK; AMPPNP: 1 nM GFP-MCAK; ADP: 10 nM GFP-MCAK; APO state: 0.1 nM GFP-MCAK.

      Other criticism:

      Text improvements are recommended in the Discussion. For example, line 348: Fourth, the loss of the binding preference.. suggests that the binding preference .. is required for the optimal .. preference.

      We thank the reviewer for pointing out this. In the revised manuscript, we conducted a thorough revision and review of the text.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, Chen et al. investigate the localization of microtubule kinesin-13 MCAK to the microtubule ends. MCAK is a prominent microtubule depolymerase whose molecular mechanisms of action have been extensively studied by a number of labs over the last ~twenty years. Here, the authors use single-molecule approaches to investigate the precise localization of MCAK on growing microtubules and conclude that MCAK preferentially binds to a GDP-Pi-tubulin portion of the microtubule end. The conclusions are speculative and not well substantiated by the data, making the impact of the study in its current form rather limited. Specifically, greater effort should be made to define the region of MCAK binding on microtubule ends, as well as its structural characteristics. Given that MCAK has been previously shown to effectively tip-track growing microtubule ends through an established interaction with EB proteins, the physiological relevance of the present study is unclear. Finally, the manuscript does not cite or properly discuss a number of relevant literature references, the results of which should be directly compared and contrasted to those presented here.

      We thank the reviewer for the comments. As these suggestions are more thoroughly expressed in the following comments for authors, we will provide the responses in the corresponding sections, as shown below.

      Reviewer #2 (Recommendations For The Authors):

      Significant concerns:

      (1) Establishing the precise localization of MCAK wrt microtubule end is highly non-trivial. More details should be provided, including substantial supplementary data. In particular, the authors claim ~6 nm accuracy in microtubule end positioning - this should be substantiated by data showing individual overlaid microtubule end intensity profiles as well as fits with standard deviations etc. Furthermore, to conclude that MCAK binds behind XMAP215, the authors should look at the localization of the two proteins simultaneously, on the same microtubule end. Notably, EB binding profiles are well known to exponentially decay along the microtubule lattice - this is not very apparent from the presented data. If MCAK's autonomous binding pattern matches that of EB, we should be seeing an exponentially-decaying localization for MCAK as well? However, averaged MCAK signals seem to only be fitted to Gaussian. Note that the EB binding region (i.e. position and size of the EB comet) can be substantially modulated by increasing the microtubule growth rate - this can be easily accomplished by increasing tubulin concentrations or the addition of XMAP215 (e.g. see Maurer et al. Cur Bio 2014). Thus to establish that MCAK on its own binds the same region as EB, experiments that directly modulate the size and the position of this region should be added.

      (1) We thank the reviewer for this comment. Regarding the accuracy in microtubule end positioning, we now provide more details, and please see pages 18-19, lines 625-645 in the revised manuscript.

      (2) Regarding the relative localization of XMAP215 and MCAK, we performed additional experiments to record their colocalizations simultaneously, on the same microtubule end. Our results showed that MCAK predominantly binds behind XMAP215, with 14.5% appearing within the XMAP215’s binding region. Please see Fig. 2.D-E and lines 184-197 in the revised manuscript.

      (3) Regarding the exponential decay of the EB1 signal along microtubules, we observed that the position probability distribution measured in the present study follows a Gaussian distribution, and the expected exponential decay was not apparent. Since the exponential decay is thought to result from the time delay between tubulin polymerization and GTP hydrolysis, slower polymerization is expected to reduce this latency (Maurer et al., 2014). In our experiments, the growth rate was relatively low (~0.7 mm/min), much slower than the rate observed in cells, where the comet-shaped EB1 signal is most pronounced. The previous study has shown that the exponential decay of EB1 is more pronounced at growth rates exceeding 3 mm/min in vitro (Maurer et al., 2014). Therefore, we think that the relatively slow growth may account for the observed non-exponential decay distribution of the EB1 signals. The same reason may also explain the distribution of MCAK.

      (4) We agree with the reviewer’s suggestion that altering microtubule growth rate is a valid and effective approach to regulate the EB cap length. However, the conclusion that MCAK binds to the EB region is supported by three lines of evidence: (1) the localization of MCAK at the ends of microtubules, (2) new experimental data showing that MCAK binds to the proximal end of the XMAP215 site, and (3) the tendency of MCAK to bind GTPγS microtubules, similar to EB1. Based on these findings, we did not pursue additional experiments to modify the length of the EB cap.

      (2) Even if MCAK indeed binds behind XMAP215, there is no evidence that this region is defined by the GDP-Pi nucleotide state; it could still be curved protofilaments. GTPyS is an analogue of GTP - to what extent GTPyS microtubules exactly mimic the GDP-Pi-tubulin state remains controversial. Furthermore, nucleotide sensing for EB is thought to be achieved through its binding at the interface of four tubulin dimers. However MCAK's binding site is distinct, and it has been shown to recognize intradimer tubulin curvature. Thus it is not clear how MCAK would sense the nucleotide state. On the other hand, there is mounting evidence that the morphology of the growing microtubule end can be highly variable, and that curved protofilaments may be protruding off the growing ends for tens of nanometers or more, previously observed both by EM as well as by fluorescence (e.g. Mcintosh, Moores, Chretien, Odde, Gardner, Akhmanova, Hancock, Zanic labs). Thus, to establish that MCAK indeed localizes along the closed lattice, EM approaches should be used.

      First, we conducted additional experiments that demonstrate MCAK indeed binds behind XMAP215, supporting the conclusion that MCAK interacts with the EB cap (please see Fig. 2 in the revised manuscript). Second, our argument that MCAK preferentially binds to GDP-Pi tubulin is based on two observations: (1) the binding regions of MCAK overlap with those of EB1, and (2) MCAK preferentially binds to GTPγS microtubules, which are considered a close analogue of GDP-Pi tubulin. Third, understanding the structural basis of how MCAK senses the nucleotide state of tubulin is beyond the scope of the present study. However, inspired by the reviewer’s suggestion, we looked into the structure of the MCAK-tubulin complex. The L2 loop of MCAK makes direct contact with the interdimer interface (Trofimova et al., 2018; Wang et al., 2017), which could provide a structural basis for recognizing the changes induced by GTP hydrolysis. While this remains a hypothesis, it is certainly a promising direction for future research. Forth, we agree with the reviewer that an EM approach would be ideal for establishing that MCAK localizes along the closed lattice. However, this is not the focus of the current study. Instead, we argue that MCAK binds to the EB cap, where at least some lateral interactions are likely to have formed.

      (3) The physiological relevance of the study is rather questionable: MCAK has been previously established to be able to both diffuse along the microtubule lattice (e.g. Helenius et al.) as well as hitchhike on EBs (Gouveia et al.). Given the established localization of EBs to growing microtubule ends in cells, and apparently higher affinity of MCAK for EB vs. the microtubule end itself (although direct comparisons with the literature have not been reported here), the relevance of MCAK's autonomous binding to dynamic microtubule ends is dubious.

      We thank the reviewer for raising the importance of physiological relevance. Please refer to our response to the comment No.1 of reviewer 1. Briefly, we think that the end-binding affinity of MCAK makes a significant contribution for its cellular functions. To elucidate this concept, we now use a simple model shown in Supplementary Appendix-2 (see pages 49-51, lines 1246-1316). In this model, we simplified MCAK and EB1 binding to microtubule ends by considering only these two proteins while neglecting other factors (e.g. XMAP215). Specifically, we considered two scenarios: one in which both proteins freely diffuse in the cytoplasm and another where MCAK is localized to specific cellular structures, such as the centrosome or centromere. Based on the modeling results, we argue that MCAK's functional impact at microtubule ends derives both from its intrinsic end-binding capacity and its ability to strengthen the EB1-mediated end association pathway.

      (4) Finally, the study seriously lacks discussion of and comparison with the existing literature on this topic. There are major omissions in citing relevant literature, such as e.g. landmark study by Kinoshita et al. Science 2001. Several findings reported here directly contradict previous findings in the literature. Direct comparison with e.g. Gouveia et al findings, Helenius et al. findings, and others need to be included. For example, Gouveia et al reported that EB is necessary for MCAK plus-end-tracking in vitro (please see Figure 1 of their manuscript). The authors should discuss how they reconcile the differences in their findings when compared to this earlier study.

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we have updated the text description and included comparative discussions with other relevant studies in the Discussion section. Specifically, we added comparisons with the research on XMAP215 in page 14, lines 459-472 (Barr and Gergely, 2008; Kinoshita et al., 2001; Tournebize et al., 2000). Additionally, we have compared our findings with those of Gouveia et al. and Helenius et al. regarding MCAK's preference for binding microtubule ends in page 6, lines 145-157 and page 13, 408-441, respectively (Gouveia et al., 2010; Helenius et al., 2006).

      Additional specific comments:

      Figure 1

      Gouveia et al. (Figure 1) reported that MCAK does not autonomously preferentially localize to growing tips. Specifically, Gouveia et al. found equal association rates of MCAK to both the lattice and the tip in the presence of EB3delT, an EB3 construct that does not directly interact with MCAK. How can these findings be reconciled with the results presented here?

      We are uncertain why there was no observed difference in the on-rates to the lattice and the end in the study by Gouveia et al. Even when considering only the known affinity of MCAK for curved protofilaments at the distal tip of growing microtubules, we would still expect to observe an end-binding preference. After carefully comparing the experimental conditions, we nevertheless identified some differences. First, we used a 160 nm tip size to calculate the on-rate (k<sub>on</sub>), whereas Gouveia et al. used a 450 nm tip. Using a longer tip size would naturally lead to a smaller(k<sub>on</sub>) value. Note that we chose 160 nm for several reasons: (i) a previous cryo-electron tomography study has elucidated that the sheet structures of dynamic microtubule ends have an average length of around 180 nm (Guesdon et al., 2016); (ii) Analysis of fluorescence signals at dynamic microtubule ends has demonstrated that the taper length at the microtubule end is less than 180 nm (Maurer et al., 2014); (iii) in the present study, we estimated that the length of MCAK's end-binding region is approximately 160 nm. Second, in Gouveia et al., single-molecule binding events were recorded in the presence of 75 nM EB3ΔT, which could potentially create a crowded environment at the tip, reducing MCAK binding. Third, as mentioned in our response to Reviewer 1, we took great care to minimize the interference from purification tags (e.g., His-tag) by ensuring their complete removal during protein preparation. Previous studies reported that retaining the His-tag of MAPs led to a significant increase in binding for microtubules (Maurer et al., 2011; Zhu et al., 2009). We believe that some of the factors mentioned above, or their combined effects, may account for the differences in these two observations.

      1C shows the decay of tubulin signal over several hundred nm - should show individual traces? How aligned? Doesn't this long decay suggest protruding protofilaments? (E.g. Odde/Gardner work).

      (1) In the revised manuscript, we now show individual traces (e.g. in Fig. 1B and Fig. 2A). The average trace for tubulin signal with standard deviation was shown in Fig. 2C.

      (2) The microtubule lattice was considered as a Gaussian wall and its end as a half-Gaussian in every frame. Use the peak position of the half-Gaussian of every frame to align and average microtubule end signals, during the dwell time. The average microtubule ends' half-Gaussion peak used as a reference to measure the intensity profile of individual single-molecule binding event in every frame (see page18, lines 607-624).

      (3) We think that the decay of tubulin signal results from the convolution of the tapered end structure and the point spread function. In the revised manuscript, we have updated the Figures to provide unprocessed original data in Fig. 1B and Fig. 2A.

      Please show absolute numbers of measurements in 1C (rather than normalized distribution only).

      In the revised manuscript, we have included the raw data for both tubulin and MCAK signals as part of the methods description. In Fig. 1, using normalized values allows for the simultaneous representation of microtubule and protein signals on a unified graph.

      How do the results in 1D-G compare with the previous literature? Particularly comparison of on-rates between this study and the Gouveia et al? Assuming 1 um = 1625 dimers, it appears that in the presence of EB3, the on-rate of MCAK to the tips reported in Gouveia et al. is an order of magnitude higher than reported here in the absence of EB3 (4.3 x 10E-4 vs. 2 x 10E-5). If so, and given the robust presence of EB proteins at growing microtubule ends in cells, this would invalidate the potential physiological relevance of the current study. Note that the dwell times measured in Gouveia et al. are also longer than those measured here.

      Note that in Gouveia et al, the concentration of mCherry-EB3 was 75 nM, about 187.5 times higher than that of MCAK (0.4 nM). The relative concentrations of these two proteins are not always the case in cells. Regarding the physiological relevance of the end-binding affinity of MCAK itself, please refer to our response to the point No.1 of Reviewer 1.

      Notably, Helenius et al reported a diffusion constant for MCAK of 0.38 um^2/s, which is more than an order of magnitude higher than reported here. The authors should comment on this!

      In the revised manuscript, we have provided an explanation for the difference in diffusion coefficient. Please see page 6, line 142-157. In short, low salt condition facilitates rapid diffusion of MCAK.

      Figure 2:

      This figure is critical and really depends on the analysis of the tubulin signal. Note significant variability in tubulin signal between presented examples in 2A. Also, while 2C looks qualitatively similar, there appears to be significant variability over the several hundred nm from the tip along the lattice. This is the crucial region; statistical significance testing should be presented. More detailed info, including SDs etc. is necessary.

      In the revised manuscript, we have provided raw data in Fig. 1B and Fig. 2A. Additionally, we have provided statistical analysis on the tubulin signals (Fig. 2C) and performed significance test. Please see page 5, lines 111-116 and page 7, lines 179-183 for detailed descriptions.

      Insights into the morphology of microtubule ends based on TIRF imaging have been previously gained in the literature, with reports of extended tip structures/protruding protofilaments (see e.g. Coombes et al. Cur Bio 2013, based on the methods of Demchouk et al. 2011). Such analysis should be performed here as well, if we are to conclude that nucleotide state alone, as opposed to the end morphology, specifies MCAK's tip localization.

      We appreciate the reviewer’s suggestion and agree that it provides a valid optical microscopy-based approach for estimating microtubule end morphology. However, this method did not establish a direct correlation between microtubule end morphology and tubulin nucleotide status. Therefore, we think that refining the measurement of microtubule end morphology will not necessarily provide more information to the understanding of tubulin nucleotide status at MCAK binding sites. Based on the available data in the present study, there are two main pieces of evidence supporting the idea that MCAK can sense tubulin nucleotide status: (1) the binding regions of MCAK and EB overlap significantly, and (2) MCAK shows a clear preference for binding to GTPγS microtubules, similar to EB1 (we provide a new control to support this, Fig. s4). Of course, we do not consider this to be a perfect set of evidence. As the reviewer has pointed out here and in other suggestions, future work should aim to further distinguish the nucleotide status of tubulin in the dynamic versus non-dynamic regions at the ends of microtubules, and to investigate the structural basis by which MCAK recognizes tubulin nucleotide status.

      EB comet profile should be clearly reproduced. MCAK should follow the comet profile.

      Please see our 3<sup>rd</sup> response to the point 1 of this reviewer.

      The conclusion that the MCAK binding region is larger than XMAP215 is not firm, based on the data presented. The authors state that 'the binding region of MCAK was longer than that of XMAP215'. What is the exact width of the region of the XMAP215 localization and how much longer is the MCAK end-binding region? Is this statistically significant?

      We have revised this part in the revised manuscript (page 6, lines 167-172). The position probability distributions of MCAK and XMAP215 were significantly different (K-S test, p< 10<sup>-5</sup>), and the binding region of MCAK (FWHM=185 nm) was significantly longer than that of XMAP215 (FWHM=123 nm).

      MCAK localization with AMPPNP should also be performed here. Even low concentrations of MCAK have been shown to induce microtubule catastrophe/end depolymerization. This will dramatically affect microtubule end morphology, and thus apparent positioning of MCAK at the end.

      In the end positioning experiment, we used a low concentration of MCAK (1 nM). Under this condition, microtubule dynamics remained unchanged, and the morphology of the microtubule ends was comparable across different conditions (with EB1, MCAK or XMAP215). Additionally, in the revised manuscript, we present a new experiment in which we recorded the localization of both MCAK and XMAP215 on the same microtubule. The results support the conclusion regarding their relative localization: most MCAK is found at the proximal end of the XMAP215 binding region, while approximately 15% of MCAK is located within the XMAP215 binding region. Please see Fig. 2D-E and page 7, lines 184-197 for the corresponding descriptions.

      Figure 3:

      For clearer presentation, projections showing two microtubule lattice types on the same image (in e.g. two different colors) should be shown first without MCAK, and then with MCAK.

      We thank the reviewer for this suggestion. We have adjusted the figure accordingly. Please see Fig. 4 in the revised manuscript.

      Please comment on absolute intensity values - scales seem to be incredibly variable.

      The fluorescence value presented here is the result of multiple images being summed. Therefore, the difference in absolute values is influenced not only by the binding affinity of MCAK in different states to microtubules, but also by the number of images used. In this analysis, we are not comparing MCAK in different states, but rather evaluating the binding ability of MCAK in the same state on different types of microtubules.

      Given that the authors conclude that MCAK binding mimics that of EB, EB intensity measurements and ratios on different lattice substrates should be performed as a positive control.

      We performed additional experiments with EB1, in the revised manuscript, we provide the data as a positive control (please see Fig. s4).

      Figure 4:

      MCAK-nucleotide dependence of GMPCPP microtubule-end binding has been previously established (see e.g. Helenius et al, others?) - what is new here? Need to discuss the literature. This would be more appropriate as a supplemental figure?

      In the present study, we reproduced the GMPCPP microtubule-end binding of MCAK in the AMPPNP state, as shown in several previous reports (Desai et al., 1999; Hertzer et al., 2006). Here, we also quantified the end to lattice binding preference, and our results showed that the nucleotide state-dependence shows the same trend as the binding preference of MCAK to the growing microtubule ends. Therefore, we prefer to keep this figure in the main text (Fig. 5).

      Figure 5:

      Please note that both MCAK mutants show an additional two orders of magnitude lower microtubule binding on-rates when compared to wt MCAK. This makes the analysis of preferential binding substrate for these mutants dubious.

      We agreed with this point. We have rewritten this part. Please see page 10, lines 295-327, in the revised manuscript.

      Figure 6:

      Combined effects of XMAP215 and XKCM1 (MCAK) have been previously explored in the landmark study by Kinoshita et al. Science 2001, which should be cited and discussed. Also note that Moriwaki et al. JCB 2016 explored the combined effects of XMA215 and MCAK - which should be discussed here and compared to the current results.

      We agree with the reviewer. We have revised the discussion on this part. Please see page 11, lines 329-342 and page 14, lines 459-472 in the revised manuscript.

      Please report quantification for growth rate and lifetime.

      In the revised manuscript, we provide all these data. Please see pages 11-12, lines 343-374.

      To obtain any new quantitative information on the combined effects of the two proteins, at the very minimum, the authors should perform a titration in protein concentration.

      We agree with the reviewer on this point. In our pilot experiments, we performed titration experiments to determine the appropriate concentrations of MCAK and XMAP215, respectively. We selected 50 nM for XMAP215, as it clearly enhances the growth rate and exhibits a mild promoting effect on catastrophe—two key effects of XMAP215 reported in previous studies (Brouhard et al., 2008; Farmer et al., 2021). Reducing the XMAP215 concentration eliminates the catastrophe-promoting effect, while increasing it would not much enhance the growth rate. For MCAK, we chose 20 nM, as it effectively promotes catastrophe; increasing the concentration beyond this point leads to no microtubule growth, at least in the MCAK-only condition. If there’s no microtubule growth, it would be difficult to quantify the parameters of microtubule dynamics, hindering a clear comparison of the combined versus individual effects. Therefore, we think that the concentrations used in this study are appropriate and representative. In the revised manuscript, we make this point clearer (see pages 11 and lines 329-342).

      Finally, the writing could be improved for overall clarity.

      We thank the reviewer for pointing out this. In the revised manuscript, we conducted a thorough revision and review of the text.

      Reviewer #3 (Public Review):

      The authors revisit an old question of how MCAK goes to microtubule ends, partially answered by many groups over the years. The authors seem to have omitted the literature on MCAK in the past 10-15 years. The novelty is limited due to what has previously been done on the question. Previous work showed MCAK targets to microtubule plus-ends in cells through association with EB proteins and Kif18b (work from Wordeman, Medema, Walczak, Welburn, Akhmanova) but none of their work is cited.

      We thank the reviewer for the suggestion. Some of the referenced work has already been cited in our manuscript, such as studies on the interaction between MCAK and EB1. However, other relevant literature had not been properly cited. In the revised manuscript, we have added further discussion on this topic in the context of existing findings. Please refer to pages 3-4, lines 68-85, and pages 13, lines 425-441.

      It is not obvious in the paper that these in vitro studies only reveal microtubule end targeting, rather than plus end targeting. MCAK diffuses on the lattice to both ends and its conformation and association with the lattice and ends has also been addressed by other groups-not cited here. I want to particularly highlight the work from Friel's lab where they identified a CDK phosphomimetic mutant close to helix4 which reduces the end preference of MCAK. This residue is very close to the one mutated in this study and is highly relevant because it is a site that is phosphorylated in vivo. This study and the mutant produced here suggest a charge-based recognition of the end of microtubules.

      Here the authors analyze this MCAK recognition of the lattice and microtubule ends, with different nucleotide states of MCAK and in the presence of different nucleotide states for the microtubule lattice. The main conclusion is that MCAK affinity for microtubules varies in the presence of different nucleotides (ATP and analogs) which was partially known already. How different nucleotide states of the microtubule lattice influence MCAK binding is novel. This information will be interesting to researchers working on the mechanism of motors and microtubules. However, there are some issues with some experiments. In the paper, the authors say they measure MCAK residency of growing end microtubules, but in the kymographs, the microtubules don't appear dynamic - in addition, in Figure 1A, MCAK is at microtubule ends and does not cause depolymerization. I would have expected to see depolymerization of the microtubule after MCAK targeting. The MCAK mutants are not well characterized. Do they still have ATPase activity? Are they folded? Can the authors also highlight T537 and discuss this?

      Finally, a few experiments are done with MCAK and XMAP215, after the authors say they have demonstrated the binding sites overlap. The data supporting this statement were not obvious and the conclusions that the effect of the two molecules are additive would argue against competing binding sites. Overall, while there are some interesting quantitative measurements of MCAK on microtubules - in particular in relation to the nucleotide state of the microtubule lattice - the insights into end-recognition are modest and do not address or discuss how it might happen in cells. Often the number of events is not recorded. Histograms with large SEM bars are presented, so it is hard to get a good idea of data distribution and robustness. Figures lack annotations. This compromises therefore their quantifications and conclusions. The discussion was hard to follow and needs streamlining, as well as putting their work in the context of what is known from other groups who produced work on this in the past few years.

      We thank the reviewer for the comments. Regarding the physiological relevance of the end-binding of MCAK itself, please refer to our response to the point No.1 of reviewer 1. Moreover, as we feel that other suggestions are more thoroughly expressed in the following comments for authors, we will provide the responses in the corresponding sections, as shown below.

      Reviewer #3 (Recommendations For The Authors):

      Why, on dynamic microtubules, is MCAK at microtubule plus ends and does not cause a catastrophe?

      At this concentration (10 nM MCAK with 16 mM tubulin in Fig. 1; 1 nM MCAK with 12 mM tubulin in Fig. 2), MCAK has little effect on microtubule dynamics in our experiments. Using TIRFM, we were able to observe individual MCAK binding events. Based on these observations, we think that in the current experimental condition, a single binding event of MCAK is insufficient to induce microtubule catastrophe; rather, it likely requires cumulative changes resulting from multiple binding events.

      Do the MCAK mutants still have ATPase activity?

      The ATPase activities of MCAK<sup>K525A</sup> and MCAK<sup>V298S</sup> are both reduced to about 1/3 of the wild-type (Fig. s6).

      The intensities of GFP are not all the same on the microtubule lattice (eg 1A). See blue and white arrowheads. The authors could be looking at multiple molecules of GFP-MCAK instead of single dimers. How do they account for this possibility?

      In the revised manuscript, we provide the gel filtration result of the purified MCAK, and the position of the peak corresponds to ~220 kDa, demonstrating that the purified MCAK in solution is dimeric (please see Fig.s1 and page 5, lines 101-103). We measured the fluorescence intensity of each binding event. A probability distribution of these intensities was then constructed and fitted with a Gaussian function. A binding event was considered to correspond to a single molecule if its intensity fell within μ±2σ of the distribution. The details of the single-molecule screening process are provided in the revised manuscript (see page 17, lines 574-583).

      In addition, we also measured the fluorescence intensity of both MCAK<sup>sN+M</sup> and MCAK. MCAK<sup>sN+M</sup> is a monomeric mutant that contains the neck domain and motor domain (Wang et al., 2012). The average intensity of MCAK<sup>sN+M</sup> is 196 A.U., about 65 % of that of MCAK (300 A.U.), suggesting that MCAK is a dimer (see Fig. s1). Moreover, we think that some of the dim signals may result from stochastic background noise, while others likely represent transient bindings of MCAK. The exposure time in our experiments was approximately 0.05 seconds; if the binding duration were shorter than this, the signal would be lower. It is important to note that in this study, we specifically selected binding events lasting at least 2 consecutive frames, meaning transient binding events were not included. This point has been clarified in the Methods section (see page 17, lines 568-569 and lines 574-583).

      Could the authors provide a kymograph of an MT growing, in the presence of MCAK+AMPPNP? Can MCAK track the cap?

      Under single-molecule conditions, we observed a single MCAK molecule briefly binding to the end of the microtubule. However, we did not record if MCAK at high concentrations could track microtubule ends under AMPPNP conditions.

      In the experiments in Figure 6, the authors should also show the localization of MCAK and XMAP215 at microtubule plus ends in their kymographs to show the two molecules overlap.

      Regarding the relative localization of XMAP215 and MCAK, we conducted additional experiments to record their colocalization simultaneously at the same microtubule end. Our results show that MCAK predominantly binds behind XMAP215, with 14.5% of MCAK binding within the XMAP215 binding region. Please see Fig. 2.D-E and page 7, lines 184-197 in the revised manuscript. However, we argue that the effects of XMAP215 and MCAK are additive, and their binding sites do not necessarily need to overlap for these effects to occur.

      The authors do not report what statistical tests are done in their graphs, and one concern is over error propagation of their data. Instead of bar graphs, showing the data points would be helpful.

      We have now shown all data points in the revised manuscript.

      MCAK+AMPPNP accumulates at microtubule ends. Appropriate quotes from previous work should be provided.

      We have made the revisions accordingly. Please see page 9, lines 273-276.

      Controls are missing. An SEC profile for all purified proteins should be presented. Also, the authors need to explain if they report the dimeric or monomeric concentration of MCAK, XMAP215, etc...

      We have provided the gel filtration result for all purified proteins in the revised manuscript (Fig.s1). Moreover, we now make it clear that the concentrations of MCAK and EB1 are monomeric concentration. Please see the legend for Fig. 1, line 893 in the revised manuscript.

      Figure 1: the microtubules don't look dynamic at all. This is also why the authors can record MCAK at microtubule ends, because their structure is not changing.

      The microtubules are dynamic, but they may appear non-dynamic due to the relatively slow growth rate and the high frame rate at which we are recording. We propose that individual binding events of MCAK induce structural changes at the nanoscopic or molecular scale, which are not detectable using TIRFM.

      I recommend the authors measure the Kon and Koff for single GFP-MCAK mutant molecules and provide the information alongside their normalized and averaged binding intensities of GFP-MCAK in Fig 5. Showing data points instead of bar graphs would be better.

      (1) We measured k<sub>on</sub> and dwell time for mutants at growing microtubule end. However, we did not perform single-molecule tracking for MCAK’s binding on stabilized microtubules. This is mainly because the superimposed signal on the stable microtubule already indicates the changes in the mutant's binding affinity to different microtubule structures, and moreover, the binding of the mutants is highly transient, making accurate single-molecule tracking and calculations difficult.

      (2) In the revised figure, we have included the data points in all plots.

      When discussing how Kinesin-13 interacts with the lattice, the authors should quote the papers that report the organization of full-length Kinesin-13 on tubulin heterodimers: Trofimova et al, 2018; McHugh et al 2019; Benoit et al, 2018. It would reinforce their model and account for the full-length protein, rather than just the motor domain.

      We thank the suggestion for the reviewer. In our manuscript, we have cited papers on full-length Kinesin-13 to discuss the interaction between MCAK and microtubule end-curved structure. Additionally, we have utilized the MCAK-tubulin crystal structure (PDB ID: 5MIO) in Fig. 6, as it depicts a human MCAK, which is consistent with the protein used in our study. This structure illustrates the interaction sites between MCAK and tubulin dimer, guiding our mutation studies on specific residues. Thus, we prefer to use the structure (PDB ID: 5MIO) in Fig.6.

      Figure 5A. What type of model is this? A PDB code is mentioned. Is this from an X-ray structure? If so, mention it.

      We have now included the structural information in the Figure legend (see page 37, lines 1045).

      Figure 5B. It is not possible to distinguish the different microtubule lattices (GTPyS, GDP, and GMPCPP). The experiment needs to be better labelled.

      We thank the reviewer for this comment. We have now rearranged the figure for better clarity (see Fig. 6).

      "Figure 5D: what are the statistical tests? I don't understand " The statistical comparisons were made versus the corresponding value of 848 GFP-MCAK".

      We have made this point clearer in the revised manuscript (see pages 38, line 1078-1080).

      What is the "EB cap"? This needs explaining.

      We provide this explanation for this, please see page 4, lines 87-89 in the revised manuscript.

      Work from Friel and co-workers showed MCAK T537E did not have depolymerizing activity and a reduced affinity for microtubule ends. The work of the authors should be discussed with respect to this previously published work.

      We thank the reviewer for this suggestion. In the revised manuscript, we have added discussions on this (see page 10, lines 303-307).

      The concentration of protein used in the assays is not always described.

      We have checked throughout the manuscript and made revisions accordingly.

      "Having revealed the novel binding sites of MCAK in dynamic microtubule ends " should be on "we wondered how MCAK may work ..with EB1". This is not addressed so should be removed. Instead, they can quote the work from Akhmanova's lab. Realistically this section should be rephrased as there are other plus-end targeting molecules that compete with MCAK, not just XMAP215 and EB1.

      We have rephrased this section as suggested by this reviewer to be more specific. Please see page 11, lines 329-342.

      What is AMPCPP?

      It should be “AMPPNP”

      Typos in Figure 5.

      Corrected

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      We thank the reviewer for his/her very positive comments.

      Reviewer #2 (Public review):

      We thank the reviewer for his/her positive evaluation. We plan to add RNAseq data of yeast wild-type and JDP mutant strains as more direct readout for the role of Apj1 in controlling Hsf1 activity. We agree with the reviewer that our study includes one major finding: the central role of Apj1 in controlling the attenuation phase of the heat shock response. In accordance with the reviewer we consider this finding highly relevant and interesting for a broad readership. We agree that additional studies are now necessary to mechanistically dissect how the diverse JDPs support Hsp70 in controlling Hsf1 activity. We believe that such analysis should be part of an independent study but we will indicate this aspect as part of an outlook in the discussion section of a revised manuscript.

      Reviewer #3 (Public review):

      We thank the reviewer for his/her suggestions. We agree that it is sometimes difficult to distinguish direct effects of JDP mutants on heat shock regulation from indirect ones, which can result from the accumulation of misfolded proteins that titrate Hsp70 capacity. We also agree that an in vitro reconstitution of Hsf1 displacement from DNA by Apj1/Hsp70 will be important, also to dissect Apj1 function mechanistically. We will add this point as outlook to the revised manuscript.

      Reviewer #1 (Recommendations for the authors): 

      (1) Can the authors submit the raw translatome data to a standard repository? Also, the data should be summarized in a supplemental Excel table. 

      We submitted the raw translatome data to the NCBI Gene Expression Omnibus and added the analyzed data sets (shown in Figures 1 and 5) as Supplementary Tables S4/S5 (excel sheets). We additionally included RNAseq analysis of yeast WT and JDP mutants set grown at 25°C, complementing and confirming our former translatome analysis (new Figure 5, Figure Supplement 2). Respective transcriptome raw data were also deposited at the NCBI Gene Expression Omnibus and analyzed data are available as Supplementary Table S7.

      (2) MW indicators need to be added to the Western Blot figures. 

      We added molecular weight markers to the Western Blot figures.

      (3) Can the authors please include the sequences of the primers used in all the RT-qPCR experiments? They mention they are in the supplemental information, but I couldn't locate them. 

      We added the sequences of the RT-qPCR primers as Supplementary Table S4.

      (4) Given the clear mechanism proposed, it would be nice if the authors could provide a nice summary figure. 

      We followed the suggestion of the reviewer and illustrate our main finding as new Figure 7.

      Reviewer #2 (Recommendations for the authors): 

      (1) As mentioned above, a co-IP experiment between Hsf1 and Ssa1/2 in APJ1 and apj1∆ cells, utilizing Hsf1 alleles with and without the two known binding sites, would cement the assignment of Apj1 in the Hsf1 regulatory circuit. 

      We agree with the reviewer that Hsf1-Ssa1/2 pulldown experiments, as done by Pincus and colleagues (1), will further specify the role of Apj1 in targeting Hsp70 to Hsf1 during the attenuation phase of the heat shock response. We have tried extensively such pulldown experiments to document dissociation of Ssa1/2 from Hsf1 upon heat shock in yeast wild-type cells. While we could specifically detect Ssa1/2 upon Hsf-HA1 pulldown, our results after heat shock were highly variable and inconclusive and did not allow us to probe for a role of Apj1 or the two known Ssa1/2 binding sites in the phase-specific targeting. We now discuss the potential roles of the two distinct Ssa1/2 binding sites for phase-specific regulation of Hsf1 activity in the revised manuscript (page 12, lanes 17-21).

      (2) Experiments in Figure 3 nicely localize CHIP reactions with known HSEs. A final confirmatory experiment utilizing a mutated HSE (another classic experiment in the field) would cement this finding and validate the motif and reporter-based analysis. 

      We thank the reviewer for this meaningful suggestions. We have done something like this by using the non-Hsf1 regulated gene BUD3, which lacks HSEs, as reference. We engineered a counterpart, termed “BUD3 HS-UAS”, which bears inserted HSEs, derived from the native UAS of HSP82, within the BUD3 UAS. We show that BUD3<sup>+</sup> lacking HSEs is not occupied by Hsf1 and Apj1 under either non-stress or heat shock conditions while BUD3-HSE is clearly occupied under both, paralleling Hsf1 and Apj1 occupancy of HSP82 (Figure 3E). We have renamed the engineered allele to “BUD3-HSE” to clarify the experimental design and output.

      (3) Page 8 - the ydj1-4xcga allele is introduced without explaining why it's needed, since ydj1∆ cells are viable. The authors should acknowledge the latter fact, then justify why the RQC depletion approach is preferred. Especially since the ydj1∆ mutant appears in Figure 5B. 

      ydj1∆ cells are viable, yet they grow extremely slowly at 25°C and hardly at 30°C,  making them difficult to handle. The RQC-mediated depletion of Ydj1 in ydj1-4xcga cells allows for solid growth at 30°C, facilitating strain handling and analysis of Ydj1 function. Importantly, ydj1-4xcga cells are still temperature-sensitive and exhibit the same deregulation of the heat shock response upon combination with apj1D as observed for ydj1∆ cells. Thus ydj1 knockout and knockdown cells do not differ in the relevant phenotypes reported here and we performed most of the analysis with  ydj1-4xcga cells due to their growth advantage. We added a respective explanation to the text (page 8, lanes 13-14) .

      (4) The authors raise the possibility that Sis1, Apj1, and Ydj1 may all be competing for access to Ssa1/2 at different phases of the HSR, and that access may be dictated by conformational changes in Hsf1. Given that there are at least two known Hsp70 binding sites that have negative regulatory activity in Hsf1, the possibility that domain-specific association governs the different roles should be considered. It is also unclear how the JDPs are associating with Hsf1 differentially if all binding is through Ssa1/2. 

      We thank the reviewer for the comment and will add the possibility of specific roles of the identified Hsp70 binding sites in regulating Hsf1 activity at the different phases of the heat shock response to the discussion section. Binding of Ssa1/2 to substrates (including Hsf1) is dependent on J-domain proteins (JDPs), which differ in substrate specificity. It is tempting to speculate that the distinct JDPs recognize different sites in Hsf1 and are responsible for mediating the specific binding of Ssa1/2 to either N- or C-terminal sites in Hsf1. Thus, the specific binding of a JDP to Hsf1 might dictate the binding to Ssa1/2 to either binding site. We discuss this aspect in the revised manuscript (page 12, lanes 17-21).

      (5) Figure 6 - temperature sensitivity of hsf1 and ydj1 mutants has been linked to defects in the cell wall integrity pathway rather than general proteostasis collapse. This is easily tested via plating on osmotically supportive media (i.e., 1M sorbitol) and should be done throughout Figure 6 to properly interpret the results.

      Our data indicate proteostasis breakdown in ydj1 cells by showing strongly altered localization of Sis1-GFP, pointing to massive protein aggregation (Figure 6 – Figure Supplement  1D).

      We followed the suggestion of the reviewer and performed spot tests in presence of 1 M sorbitol (see figure below). The presence of sorbitol is improving growth of ydj1-4xcga mutant cells at increased temperatures, in agreement with the remark of the reviewer. We, however, do not think that growth rescue by sorbitol is pointing to specific defects of the ydj1 mutant in cell wall integrity. Sorbitol functions as a chemical chaperone and has been shown to have protective effects on cellular proteostasis and to rescue phenotypes of diverse point mutants in yeast cells by facilitating folding of the respective mutant proteins and suppressing their aggregation (2-4). Thus sorbitol can broadly restore proteostasis, which can also explain its effects on growth of ydj1 mutants at increased temperatures. Therefore the readout of the spot test with sorbitol is not unambiguous and we therefore prefer not showing it in the manuscript.

      Author response image 1.

      Serial dilutions of indicated yeast strains were spotted on YPD plates without and with 1 M sorbitol and incubated at indicated temperatures for 2 days.<br />

      Reviewer #3 (Recommendations for the authors): 

      (1) Line 154: Can the authors, by analysis, offer an explanation for why HSR attenuation varies between genes for the sis1-4xcga strain? Is it, for example, a consequence of that a hypomorph and not a knock is used, a mRNA turnover issue, or that Hsf1 has different affinities for the HSEs in the promoters? 

      We used the sis1-4xcga knock-down strain because Sis1 is essential for yeast viability. The point raised by the reviewer is highly valid and we extensively thought about the diverse consequences of Sis1 depletion on levels of e.g. translated BTN2 (minor impact) and HSP104 (strong impact) mRNA. We meanwhile performed transcriptome analysis and confirmed the specific impact of Sis1 depletion on HSP104 mRNA levels, while BTN2 mRNA levels remained much less affected (new Figure 5 - Figure Supplement 2A/B). We compared numbers and spacings of HSEs in the respective target genes but could not identify obvious differences. Hsf1 occupancy within the UAS region of both BTN2 and HSP104 is very comparable at three different time points of a 39°C heat shock: 0, 5 and 120 min, arguing against different Hsf1 affinities to the respective HSEs (5). The molecular basis for the target-specific derepression upon Sis1 depletion thus remains to be explored. We added a respective comment to the revised version of the manuscript (page 12, lanes 3-8) .

      (2) Line 194: The analysis of ChIP-seq is not very elaborated in its presentation. How specific is this interaction? Can it be ruled out by analysis that it is simply the highly expressed genes after the HS that lead to Apj1 appearing there? More generally: Can the data in the main figure be presented to give a more unbiased genome-wide view of the results?

      We overall observed a low number of Apj1 binding events in the UAS of genes. The interaction of Apj1 with HSEs is specific as we do not observe Apj1 binding to the UAS of well-expressed non-heat shock genes. Similarly, Apj1 does not bind to ARS504 (Figure S3 – Figure Supplement 1). We extended the description of our ChIP-seq analysis procedures leading to the identification of HSEs as Apj1 target sites to make it easier to understand the data analysis. We additionally re-analysed the two Apj1 binding peaks that did not reveal an HSE in our original analysis. Using a modified setting we can identify a slightly degenerated HSE in the promoter region of the two genes (TMA10, RIE1) and changed Figure 3C accordingly. Notably, TMA10 is a known target gene of Hsf1. The expanded analysis is further documenting the specificity of the Apj1 binding peaks.

      (3) Line 215. Figure 3. The clear anticorrelation is puzzling. Presumably, Apj1 binds Hsf1 as a substrate, and then a straight correlation is expected: When Hsf1 substrate levels decrease at the promoters, also Apj1 signal is predicted to decrease. What explanations could there be for this? Is it, for example, that Hsf1 is not always available as a substrate on every promoter, or is Apj1 tied up elsewhere in the cell/nucleus early after HS? 

      We propose that Apj1 binds HSE-bound Hsf1 only after clearance of nuclear inclusions, which form upon heat stress. Apj1 thereby couples the restoration of nuclear proteostasis to the attenuation of the heat shock response. This explains the delayed binding of Apj1 to HSEs (via Hsf1), while Hsf1 shows highest binding upon activation of the heat shock response (early timepoints). Notably, the binding efficiency of Hsf1 and Apj1 (% input) largely differ, as we determine strong binding of Hsf1 five min post heat shock (30-40% of input), whereas maximal 3-4% of the input is pulled down with Apj1 (60 min post heat shock) (Figure 3D). Even at this late timepoint 10-20% of the input is pulled down with Hsf1. The diverse kinetics and pulldown efficiencies suggest that Apj1 displaces Hsf1 from HSEs and accordingly Hsf1 stays bound to HSEs in apj1D cells (Figure 4). This activity of Apj1 explains the anti-correlation: increased targeting of Apj1 to HSE-bound Hsf1 will lower the absolute levels of HSE-bound Hsf1. What we observe in the ChIP experiment at the individual timepoints is a snapshot of this reaction. Accordingly, at the last timepoint (120 min after heat shock ) analyzed, we observe low binding of both Hsf1 and Apj1 as the heat shock response has been shut down.

      (4) Line 253: "Sis-depleted".  

      We have corrected the mistake.

      (5) Line 332: Fig. 6C SIS1 OE from pRS315. A YIP would have been better, 20% of the cells will typically not express a protein with a CEN/ARS of the pRS-series so the Sis1 overexpression phenotype may be underestimated and this may impact on the interpretation. 

      We agree with the reviewer that Yeast Integrated Plasmids (YIP) represent the gold standard for complementation assays. We are not aware of a study showing that 20% of cells harboring pRS-plasmids do not express the encoded protein. The results shown in Fig. 8C/D demonstrate that even strong overproduction of Sis1 cannot restore Hsf1 activity control. This interpretation also will not be affected assuming that a certain percentage of these cells do not express Sis1. Nevertheless, we added a comment to the respective section pointing to the possibility that the Sis1 effect might be underestimated due to variations in Sis1 expression (page 11, lanes 15-19).

      (6) Figure 1C. Since n=2, a more transparent way of showing the data is the individual data points. It is used elsewhere in the manuscript, and I recommend it. 

      We agree that showing individual data points can enhance transparency, particularly with small sample sizes. However, the log2 fold change (log2FC) values presented in Figure 1C and other figures derived from ribosome profiling and RNAseq experiments were generated using the DESeq2 package. This DeSeq2 pipeline is widely used in analyzing differential gene expression and known for its statistical robustness. It performs differential expression analysis based on a model that incorporates normalization, dispersion estimation, and shrinkage of fold changes. The pipeline automatically accounts for biological, technical variability, and batch effects, thereby improving the reliability of results. These log2FC values are not directly calculated from log-transformed normalized counts of individual samples but are instead estimated from a fitted model comparing group means. Therefore, the individual values of replicates in DESeq2 log2FC cannot be shown.

      (7) Figure 1D. Please add the number of minutes on the X-axis. Figure legend: "Cycloheximide" is capitalized.  

      We revised the figure and figure legend as recommended.

      (8) Several figure panels: Statistical tests and SD error bars for experiments performed in duplicates simply feel wrong for this reviewer. I do recognize that parts of the community are calculating, in essence, quasi-p-values using parametric methods for experiments with far too low sample numbers, but I recommend not doing so. In my opinion, better to show the two data points and interpret with caution.

      We followed the advice of the reviewer and removed statistical tests for experiments based on duplicates.

      References

      (1) Krakowiak, J., Zheng, X., Patel, N., Feder, Z. A., Anandhakumar, J., Valerius, K. et al. (2018) Hsf1 and Hsp70 constitute a two-component feedback loop that regulates the yeast heat shock response eLife 7,

      (2) Guiberson, N. G. L., Pineda, A., Abramov, D., Kharel, P., Carnazza, K. E., Wragg, R. T. et al. (2018) Mechanism-based rescue of Munc18-1 dysfunction in varied encephalopathies by chemical chaperones Nature communications 9, 3986

      (3) Singh, L. R., Chen, X., Kozich, V., and Kruger, W. D. (2007) Chemical chaperone rescue of mutant human cystathionine beta-synthase Mol Genet Metab 91, 335-342

      (4) Marathe, S., and Bose, T. (2024) Chemical chaperone - sorbitol corrects cohesion and translational defects in the Roberts mutant bioRxiv  10.1101/2024.09.04.6109452024.2009.2004.610945

      (5) Pincus, D., Anandhakumar, J., Thiru, P., Guertin, M. J., Erkine, A. M., and Gross, D. S. (2018) Genetic and epigenetic determinants establish a continuum of Hsf1 occupancy and activity across the yeast genome Mol Biol Cell 29, 3168-3182

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Weaknesses: 

      The main weakness in this paper lies in the authors' reliance on a single model to derive conclusions on the role of local antigen during the acute phase of the response by comparing T cells in model antigen-vaccinia virus (VV-OVA) exposed skin to T cells in contralateral skin exposed to DNFB 5 days after the VV-OVA exposure. In this setting, antigen-independent factors may contribute to the difference in CD8+ T cell number and phenotype at the two sites. For example, it was recently shown that very early memory precursors (formed 2 days after exposure) are more efficient at seeding the epithelial TRM compartment than those recruited to skin at later times (Silva et al, Sci Immunol, 2023). DNFB-treated skin may therefore recruit precursors with reduced TRM potential. In addition, TRM-skewed circulating memory precursors have been identified (Kok et al, JEM, 2020), and perhaps VV-OVA exposed skin more readily recruits this subset compared to DNFB-exposed skin. Therefore, when the DNFB challenge is performed 5 days after vaccinia virus, the DNFB site may already be at a disadvantage in the recruitment of CD8+ T cells that can efficiently form TRM. In addition, CD8+ T cell-extrinsic mechanisms may be at play, such as differences in myeloid cell recruitment and differentiation or local cytokine and chemokine levels in VV-infected and DNFB-treated skin that could account for differences seen in TRM phenotype and function between these two sites. Although the authors do show that providing exogenous peptide antigen at the DNFB-site rescues their phenotype in relation to the VV-OVA site, the potential antigen-independent factors distinguishing these two sites remain unaddressed. In addition, there is a possibility that peptide treatment of DNFB-treated initiates a second phase of priming of new circulatory effectors in the local-draining lymph nodes that are then recruited to form TRM at the DFNB-site, and that the effect does not solely rely on TRM precursors at the DNFB-treated skin site at the time of peptide treatment. 

      Thank you for pointing out these potential caveats to our work.  We have considered the possibility that late application of peptide or cell-extrinsic difference could affect the interpretation of our results.  We would like to highlight that in our prior publication on this topic [1], we found that OT-1 responses in mice infected with VV-OVA and VV-N (irrelevant antigen) yielded the same responses as in our VV-OVA/DNFB models.  In addition, in both our prior publication and our current manuscript, application of peptide to DNFB painted sites results in T<sub>RM</sub> with a similar phenotype to those in the VV-OVA site.  Thus, we are confident that it is the presence of cognate antigen in the skin that drives the augmented T<sub>RM</sub> fitness that we observe.

      Secondly, although the authors conclusively demonstrate that TGFBRIII is induced by TCR signals and required for conferring increased fitness to local-antigen-experienced CD8+ TRM compared to local antigen-inexperienced cells, this is done in only one experiment, albeit repeated 3 times. The data suggest that antigen encounter during TRM formation induces sustained TGFBRIII expression that persists during the antigen-independent memory phase. It remains unclear why only the antigen encounter in skin, but not already in the draining lymph nodes, induces sustained TGFBRIII expression. Further characterizing the dynamics of TGFBRIII expression on CD8+ T cells during priming in draining lymph nodes and over the course of TRM formation and persistence may shed more light on this question. Probing the role of this mechanism at other sites of TRM formation would also further strengthen their conclusions and enhance the significance of this finding. 

      This is an intriguing point.  We do not understand why expression of TGFbR3 in T<sub>RM</sub> required antigen encounter in the skin if T<sub>RM</sub> at all sites clearly have encountered antigen during priming in the LN.  We speculate that durable TGFbR3 expression may require antigen encounter in the context of additional cues present in the periphery or only once cells have committed to the T<sub>RM</sub> lineage.  A more detailed characterization of the dynamics of TGFbR3 expression in multiple tissues would be informative and represents a promising future direction for this project.  We note that to robustly perform these experiments a reporter mouse would likely be a requirement.

      Reviewer #2 (Public review): 

      Weaknesses: 

      Overall, the authors' conclusions are well supported, although there are some instances where additional controls, experiments, or clarifications would add rigor. The conclusions regarding skin-localized TCR signaling leading to increased skin CD8+ TRM proliferation in-situ and increased TGFBR3 expression would be strengthened by assessing skin CD8+ TRM proliferation and TGFBR3 expression in models of high versus low avidity topical OVA-peptide exposure.

      Thank you for these helpful suggestions.  We did not attempt these experiment as we were concerned that given the relatively modest expansion differences observed with the APL that resolving differences in TGFbR3 and BrdU would prove unreliable. However, this is something that we could attempt as we continue working on this project.

      The authors could further increase the novelty of the paper by exploring whether TGFBR3 is regulated at the RNA or protein level. To this end, they could perform analysis of their single-cell RNA sequencing data (Figure 1), comparing Tgfbr3 mRNA in DNFB versus VV-treated skin. 

      As discussed above, a more detailed analysis of TGFbR3 regulation is of great interest.  These experiments would likely require the creation of additional tools (e.g. a reporter mouse) to provide robust data.  However, as suggested, we have re-analyzed our scRNAseq looking for expression of Tgfbr3. Pseudobulk analysis of cells isolated from VV or DNFB sites suggests that Tgfbr3 appears to be elevated in antigen-experienced TRM at steady-state (Author response image 1).

      Author response image 1.

      Pseudobulk analysis by average gene expression of Tgfbr3 in cells isolated from either VV or DNFB treated flanks, divided by the average gene expression of Tgfbr3 in naïve CD8 T cells from the same dataset.

      For clarity, when discussing antigen exposure throughout the paper, it would be helpful for the authors to be more precise that they are referring to the antigen in the skin rather than in the draining lymph node. A more explicit summary of some of the lab's previous work focused on CD8+ TRM and the role of TGFb would also help readers better contextualize this work within the existing literature on which it builds. 

      We appreciate this feedback, and we have clarified this in the text.

      For rigor, it would be helpful where possible to pair flow cytometry quantification with the existing imaging data.

      Thank you for these suggestions.  In terms of quantification of number of T<sub>RM</sub>by flow cytometry, we have previously demonstrated as much as a 36-fold decrease in cell count when compared to numbers directly visualized by immunofluorescence [1].  Thus, for enumeration of T<sub>RM</sub> we rely primarily on direct IF visualization and use flow cytometry primarily for phenotyping.

      Additional controls, namely enumerating TRM in the opposite, untreated flank skin of VV-only-treated mice and the treated flank skin of DNFB-only treated mice, would help contextualize the results seen in dually-treated mice in Figure 2.

      Without a source of inflammation (e.g. VV infection of DNFB) we see very few T<sub>RM</sub>in untreated skin.  A representative image is provided (Author response image 2).  A single DNFB stimulation does not recruit any CD8+ T cells to the skin without a prior sensitization [2].

      Author response image 2.

      Representative images of epidermal whole mounts of VV treated flank skin, and an untreated site from the same mouse isolated on day 50 post infection and stained for CD8a.

      In figure legends, we suggest clearly reporting unpaired T tests comparing relevant metrics within VV or DNFB-treated groups (for example, VV-OVA PBS vs VV-OVA FTY720 in Figure 3F).

      Thank you for this suggestion.  The figure legends have been amended.

      Finally, quantifying right and left skin draining lymph node CD8+ T cell numbers would clarify the skin specificity and cell trafficking dynamics of the authors' model. 

      We quantified the numbers of CD8 T cells in left and right skin draining lymph nodes by flow cytometry in mice at day 50 post VV infection DNFB-pull.  We observe similar numbers of cells at both sites (Author response Image 3).

      Author response Image 3.

      Quantification of total number of CD8+ T cells in left and right inguinal lymph nodes. Each symbol represents paired data from the same individual animal, and this is representative of 3 separate experiments.

      Reviewer #1 (Recommendations for the authors): 

      (1) Figures 1D and S1C demonstrate that 80-90 % of TRM at both VV and DNFB sites express CD103+. In contrast, the sequencing data suggests the TRM at the VV site has much higher Itgae expression. Also, clusters 3 and 4, which express significantly more Itgae than all other clusters, together comprise only ~30% of CD8+ T cells at the VV-infected skin site. How can these discrepancies between transcript and protein expression be explained? 

      Thank you for these excellent comments. T<sub>RM</sub> at both VV and DNFB sites appear to express similarly high levels of CD103 protein in both the OT-I system as we previously published [1] and in a polyclonal system using tetramers.  The lower penetrance of Itgae expression in the scRNAseq data we attribute to a lack of sensitivity which is common with this modality.  However, the relative increased expression of Itgae in clusters 3 and 4 is interesting and may suggest increased Itgae production/stability.  However, in the absence of any effect on protein expression, we chose not to focus on these mRNA differences.

      (2) For the experiments in Figure 3D, in order to exclude a contribution from circulating memory cells, FTY720 should have been administered during the duration of, not prior to, the initiation of the recall response. The effect of FTY720 wears off quickly, so the current experimental setting likely allows for circulating cells to enter the skin. This concern is mitigated by the results of anti-Thy1.1 mAb treatment, but documenting the experiment as in Figure D will likely be confusing to readers. 

      Thank you for this comment.  We relied on the literature indicating that the half-life of FTY720 in blood is longer than 6 days [3-5].  However, on reviewing this again, there are other reports suggesting a lower halflife.  Thank you for pointing out this potential caveat.  As mentioned above, we do not think this affects the interpretation of our data as similar results were obtained with anti-Thy1.1

      (3) Similar to what is described in the weaknesses section, the data on TGFBRIII expression is lacking. When is TGFBRIII induced? In the LN during primary activation and it is then sustained by a secondary antigen exposure at the peripheral target tissue site? Or is it only induced in the peripheral tissue, and there is interesting biology to uncover in regard to how it is induced by the TCR only after secondary exposure, etc.? 

      Thank you for these comments. As discussed above, a more detailed analysis of TGFbR3 regulation is of great interest.  These experiments would likely require the creation of additional tools (e.g. a reporter mouse) to provide robust data and are part of our future directions.

      (4) As described in the weakness section, there could be TCR-independent differences between the VV-OVA and DNFB sites that lead to phenotypic changes in the TRMs that are formed there, both CD8+ T cell-intrinsic (kinetics; with regard to time after initial priming) and extrinsic (microenvironmental differences due to the nature of the challenge, recruited cell types, cytokines, chemokines, etc.). Since the authors report the use of both VV and VV-ova, we recommend an experimental strategy that controls for this by challenging one site with VV and another with VV-OVA concomitantly, followed by repeating the key experiments reported in this manuscript. 

      As discussed above, we have previously published a very similar experiment using VV-OVA and VV-N infection on opposite flanks [1].

      (5) In Figure 6J please indicate means and provide more of the statistics comparing the groups (such as comparing VV-WT vehicle to VV-KO vehicle etc.), and potentially display on a linear scale as with all of the other figures looking at cells/mm2 to help convince the reader of the conclusions and support the secondary findings mentioned in the text such as "Notably, numbers of Tgfbr3ΔCD8 TRM in cohorts treated with vehicle remained at normal levels indicating that loss of TGFβRIII does not affect TRM epidermal residence in the steady state" despite it looking like there is a decrease when looking at the graph. 

      We appreciate the feedback on the readability of this figure, and so have updated figure 6J to be on a linear scale and added additional helpful statistics to the figure legend. The difference between Tgfbr3<sup>WT</sup> and Tgfbr3<sup>∆CD8</sup> at steady state is excellent point, and we agree that there could to be a trend towards reduction in the huNGFR+ T<sub>RM</sub> across both groups, even without CWHM12 administration. However, we did not see statistically significant reductions in steady-state Tgfbr3<sup>∆CD8</sup> T<sub>RM</sub>, but the slight reduction in both VV-OVA and DNFB treated flanks suggests that TGFßRIII may play a role in steady-state maintenance of all T<sub>RM</sub>. Perhaps with more sensitive tools to better visualize TGFßRIII expression, we could identify stepwise upregulation of TGFßRIII depending on TCR signal strength, possibly starting in the lymph node. We have also amended our description of this figure in the text, to allow for the possibility that a low, but under the level of detection amount of TGFßRIII could play a role in steady-state maintenance of both local antigen-experienced and bystander T<sub>RM</sub>.

      Minor points: 

      (1) In describing Figure 4B, the term "doublets" for pairs of connected dividing cells is confusing. 

      Thank you for this comment, the term has been revised to “dividing cells” in the text and figure.

      (2) Figure legend 4F: BrdU is not "expressed" . 

      Very true, it has been changed to “incorporation”.

      (3) Do CreERT2 and/or huNGFR expressed by transferred OT-I cells act as foreign antigens in C57BL/6 mice, potentially causing elimination of circulating memory cells? If that were the case, this would not necessarily confound the read-out of TRM persistence studied here, since skin TRM are likely protected from at least antibody-mediated deletion and their numbers are not maintained by recruitment of circulating cells at stead-state. However, it would be useful to be aware of this potential limitation of this and similar models. 

      Thank you for raising the important technical concern.  In our prior work [1] and this work, we monitor the levels of transferred OT-I cells in the blood over time.  We have not observed rejection of huNGFR+ cells.  We also note that others using the same system have also not observed rejection [6].

      (4) In Figure 6J, means or medians should be indicated 

      This has been updated in Figure 6J.

      (5) Using the term "antigen-experienced" to specifically refer to TRM at the VV site could be confusing, since those at the DNFB site are also Ag-experienced (in the LN draining the VV skin site). 

      We agree that it is a challenging term, as all T<sub>RM</sub> are memory cells. That is why in the text we refer to T<sub>RM</sub> isolated from the VV site as “local antigen experienced T<sub>RM</sub>.”, to try to distinguish them from bystanders that did not experience local antigen.

      (6) The Title essentially restates what was already reported in the authors' prior study. If the data supporting the TGFBRIII-mediated mechanism is studied in more depth, maybe adding this aspect to the title may be useful? 

      Thank you for this suggestion.  I think the current title is probably most suitable for the current manuscript but we are willing to change it should the editors support an alternative title.

      Reviewer #2 (Recommendations for the authors): 

      (1) Definition of bystander CD8+ TRM: The first paragraph of the introduction defines CD8+ TRM. To improve the clarity of this definition, we suggest being explicit that bystander TRM experience cognate antigen in the SDLNs but, in contrast to other TRM, do not experience cognate antigen in the skin. 

      Thank you, we have clarified this is in the text.

      (2) Consider softening the language when comparing the efficiency of CD8+ recruitment of the skin between DNFB and VV-treated flanks. For example, substitute "equal efficiency" with "comparable efficiency" since it is difficult to directly compare the extent of inflammation between viral and hapten-based treatments. 

      We have adjusted this terminology throughout the paper.

      (3) Throughout figure legends, we appreciate the indication of the number of experimental repeats performed. We suggest, either through statistics or supplemental figures, demonstrating the degree of variability between experiments to aid readers in understanding the reproducibility of results. 

      Thank you for this suggestion.  In key figures we show data from individual mice across multiple experiments. Thus, inter-experiment variability is captured in our figures.  

      (4) Figure 1: 

      a) Add control mice treated with either vaccinia virus or DNFB and harvest back skin at day 52 to demonstrate baseline levels of polyclonal and B8R tetramer-positive CD8s in the epidermis. These controls would clarify the background CD8+ expansion that might occur in DNFB-treated mice in the absence of vaccinia virus. 

      This point was addressed above.

      b) Figure 1: It would be helpful to see the %Tet+ population specifically in the CD103+ population, recognizing that the majority of the CD8+ from the skin are CD103+. 

      We did look only at CD103+ CD8 T cells from the skin for our tetramer analysis, so this has been clarified in the figure legend.

      c) Provide a UMAP, very similar to 1H, where CD8+ T cells, vaccinia virus, and DNFB-treated flanks are overlaid.

      Thank you for this suggestion.  A UMAP combining aspects of 1G (cell types from the whole ImmgenT dataset) with 1H (our data) results in a figure that is very difficult to interpret.  Thus, we have separated cell types across the entire ImmgenT data set (e.g. CD8+ T cells) and our data into 2 separate panels.

      d) 1D: left flow plot has numbered axis while the right flow plot does not. 

      Thank you, this has been fixed.

      (5) Figure 2: 

      a) In the figure legend, define what is meant by the grey line present in Figures 2C and 2D. 

      This has been updated in the figure legend.

      b) Edit the Y axis of 2C and 2D to specify the TRM signature score. 

      This has been updated in the figure.

      c) Include panel 1D from 1S into Figure 2 to help clarify for the reader what genes are expressed in the 0 - 5 clusters.

      We appreciate the feedback, but we found the heatmap made the figure look too busy, so we feel comfortable keeping it available within supplemental figure 1.

      d) In body of text explicitly discuss that the TRM module used to calculate a signature score was created using virus infection modules (HSV, LCMV and influenza) and thus some of the transcriptional similarity between the authors vaccinia virus treated CD8+ TRM and the TRM module might be due to viral infection rather than TRM status.

      Thank you for this comment.  We have now emphasized this point in the text.

      (6) Figure 3: 

      a) If there are leftover tissue sections, it would be optimal to show specific staining for CD103. We recognize that this data has been previously published by the lab, but it would be ideal to show it once in this paper. 

      Unfortunately, we do not have leftover tissue sections, so we are unable to measure CD103 by I.F. in these experiments.

      b) If you did collect skin draining lymph nodes in the Thy1.1 depletion model, it would be nice to see flow data showing the depletion effects in the skin draining lymph nodes in addition to the blood. 

      Unfortunately, we did not collect the skin draining lymph nodes, and do not have that data for the relevant experiments.

      c) Figure 3 F & G: Perform a T-test comparing vaccinia virus PBS to FTY720 and isotype to anti-Thy1.1 within the same treatment group. Showing no significance with these two comparisons would strengthen the authors' claims. Statistics can be described in legend. 

      We have included this analysis in the figure legend.

      (7) Figure 4: 

      a) It would be helpful to have the CD69+/CD103+ population in this model discussed/defined more. The CD69 expression seen in 4E is lower than the reviewers would've predicted, and it would be interesting to see CD103 expression as well.

      We have found that generally CD103 is a stronger marker for in the skin by flow, as CD69 staining is somewhat less robust in the colors we have chosen.  By way of example, we present gating we did upstream in that experiment, gated previously on liveCD45+CD3+CD8+ events (Author response image 4).

      Author response image 4.

      Representative flow cytometric plots showing CD69 and CD103 expression in gated live CD45+CD8+CD90.1+ cells isolates from VV-OVA or DNFB treated flanks.

      (8) Figure 5: 

      a) Define APL and its purpose in both the body of text and the figure legend. 

      We have clarified this in the text and the figure legend.

      b) Using in-vivo BrdU, compare proliferation between high avidity N4 and low avidity Y3 OVA-peptide at the primary recall timepoint. 

      We considered this, but due to the lack of sensitivity of the BrdU incorporation and the relatively subtle phenotype of the Y3, we did not think the assay would be sensitive enough to identify differences.

      (9) Figure 6: 

      a) Compare TGFBR3 expression in CD8+ T cells from mice receiving high avidity N4 versus low avidity Y3 OVA-peptide at the primary recall timepoint. 

      This point was discussed above.

      b) Either 1) examine TGFBR3 mRNA expression in VV vs DNFB skin from scRNA-seq dataset or 2) perform a qPCR on epidermal CD8+ T cells from mice receiving high avidity N4 versus low avidity Y3 at the primary recall timepoint. This would help distinguish whether TGFBR3 regulation occurs at the mRNA versus protein level. 

      This point has been discussed above.

      c) Figure 6A: Not required, but it seems like the TGFBR3 gate could be shifted to the right a bit. 

      The gates were set using FMO.

      d) Figure 6C: What comparison is the asterisk indicating significance referring to?

      It is the Dunnett’s test comparing VV-OVA to DNFB and untreated skin, the figure has been amended to clarify this point.

      e) Figure 6: To increase the rigor of the claim that CWHM12 is creating a TGFb limiting condition, the authors could either 1) perform an ELISA or cell-based assay measuring active TGFb, 2) recapitulate results of 6J using monoclonal antibody against avb6 as done in Hirai et al., 2021, Immunity., or 3) examine Tgfbr3 mRNA expression in your single cell RNAseq data, comparing cluster 0 and cluster 3.

      We are pleased to have the opportunity to show Tgfbr3 mRNA, which is above in figure R1.

      (10) Material and methods: 

      Specify how the localization of the back skin used for imaging was made consistent between the right and left flanks. 

      We have updated this methodology in the text.

      Literature Cited

      (1) Hirai, T., et al., Competition for Active TGFβ Cytokine Allows for Selective Retention of Antigen-Specific Tissue- Resident Memory T Cells in the Epidermal Niche. Immunity, 2021. 54(1): p. 84-98.e5.

      (2) Manresa, M.C., Animal Models of Contact Dermatitis: 2,4-Dinitrofluorobenzene-Induced Contact Hypersensitivity, in Animal Models of Allergic Disease: Methods and Protocols, K. Nagamoto-Combs, Editor. 2021, Springer US: New York, NY. p. 87-100.

      (3) Müller, H.C., et al., The Sphingosine-1 Phosphate receptor agonist FTY720 dose dependently affected endothelial integrity in vitro and aggravated ventilator-induced lung injury in mice. Pulmonary Pharmacology & Therapeutics, 2011. 24(4): p. 377-385.

      (4) Nofer, J.-R., et al., FTY720, a Synthetic Sphingosine 1 Phosphate Analogue, Inhibits Development of Atherosclerosis in Low-Density Lipoprotein Receptor–Deficient Mice. Circulation, 2007. 115(4): p. 501-508.

      (5) Brinkmann, V., et al., Fingolimod (FTY720): discovery and development of an oral drug to treat multiple sclerosis. Nat Rev Drug Discov, 2010. 9(11): p. 883-97.

      (6) Andrews, L.P., et al., A Cre-driven allele-conditioning line to interrogate CD4<sup>+</sup> conventional T cells. Immunity, 2021. 54(10): p. 2209-2217.e6.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Cho et al. present a comprehensive and multidimensional analysis of glutamine metabolism in the regulation of B cell differentiation and function during immune responses. They further demonstrate how glutamine metabolism interacts with glucose uptake and utilization to modulate key intracellular processes. The manuscript is clearly written, and the experimental approaches are informative and well-executed. The authors provide a detailed mechanistic understanding through the use of both in vivo and in vitro models. The conclusions are well supported by the data, and the findings are novel and impactful. I have only a few, mostly minor, concerns related to data presentation and the rationale for certain experimental choices.

      Detailed Comments:

      (1) In Figure 1b, it is unclear whether total B cells or follicular B cells were used in the assay. Additionally, the in vitro class-switch recombination and plasma cell differentiation experiments were conducted without BCR stimulation, which makes the system appear overly artificial and limits physiological relevance. Although the effects of glutamine concentration on the measured parameters are evident, the results cannot be confidently interpreted as true plasma cell generation or IgG1 class switching under these conditions. The authors should moderate these claims or provide stronger justification for the chosen differentiation strategy. Incorporating a parallel assay with anti-BCR stimulation would improve the rigor and interpretability of these findings. 

      We will edit the manuscript to be more explicit that total splenic B cells were used in this set-up figure and the rest of the paper. In addition, we will try to perform new experiments to improve this "set-up figure" (and add old and new data for Supplemental Figure presentation). Specifically, we will increase the range of conditions tested - e.g., styles of stimulating proliferation and differentiation - to foster an increased sense of generality. We plan to compare mitogenic stimulation with anti-CD40 to  anti-IgM and to anti-IgM + anti-CD40, all with BAFF, IL-4, and IL-5, bearing in mind excellent work from Aiba et al, Immunity 2006; 24: 259-268, and similar papers. We also will try to present some representative flow cytometric profiles (presumably in new Supplemental Figure panels).

      To be transparent and add to a more open public discussion (using the virtues of this forum, the senior author and colleagues would caution about whether any in vitro conditions exist that warrant complete confidence. That is the reason for proceeding to immunization experiments in vivo. That is not said to cast doubt on our own in vitro data - there are some experiments (such as those of Fig. 1a-c and associated Supplemental Fig. 1) that only can be done in vitro or are better done that way (e.g., because of rapid uptake of early apoptotic B cells in vivo).

      For instance: Well-respected papers use the CD40LB and NB21.2D9 systems to activate B cells and generate plasma cells. Those appear to be BCR-independent and unfortunately, we found that they cannot be used with a.a. deprivation or these inhibitors due to effects on the engineered stroma-like cells. In considering BCR engagement, Reth has published salient points about signaling and concentrations of the Ab, the upshot being that this means of activating mitogenesis and plasma cell differentiation (when the B cells are costimulated via CD40 or TLR(4 or 7/8) is probably more than a bit artificial. Moreover, although Aiba et al, Immunity 2006; 24: 259-268 is a laudable exception, one rarely finds papers using BAFF despite the strong evidence it is an essential part of the equation of B cell regulation in vivo and a cytokine that modulates BCR signaling - in the cultures. 

      (2) In Figure 1c, the DMK alone condition is not presented. This hinders readers' ability to properly asses the glutaminolysis dependency of the cells for the measured readouts. Also, CD138+ in developing PCs goes hand in hand with decreased B220 expression. A representative FACS plot showing the gating strategy for the in vitro PCs should be added as a supplementary figure. Similarly, division number (going all the way to #7) may be tricky to gate and interpret. A representative FACS plot showing the separation of B cells according to their division numbers and a subsequent gating of CD138 or IgG1 in these gates would be ideal for demonstrating the authors' ability to distinguish these populations effectively.

      We agree that exact placement  of divisions deconvolution by FlowJow is more fraught than might be thought forpresentations in many or most papers. For the revision, we will try to add one or several representative FACS plot(s) with old and new data to provide the gating on CTV fluorescence, bearing these points in mind when extending the experiments from ~7 years ago (Fig. 1b, c). With the representative examples of the old data pasted in here, we will aver, however, that using divisions 0-6, and ≥7 was reasonable. 

      Ditto for DMK with normal glutamine. However, in the spirit of eLife transparency lacking in many other journals, this comparison is more fraught than the referee comment would make things seem. The concentration tolerated by cells is highly dependent on the medium and glutamine concentration, and perhaps on rates of glutaminolysis (due to its generation of ammonia). In practice, we find that DMK becomes more toxic to B cells unless glutamine is low or glutaminolysis is restricted. Thus, the concentration of DMK that is tolerated and used in Fig. 1b, c can become toxic to the B cells when using the higher levels of glutamine in typical culture media (2 mM or more) - at which point the "normal conditions + DMK" "control" involves the surviving cells in conditions with far greater cell death and less population expansion than the "low glutamine + DMK". condition. Overall, we appreciate the suggestion to show more DMK data and will work to do so for the earlier proliferation data (shown above) and the new experiments.  

      Author response image 1.

       

      (3) A brief explanation should be provided for the exclusive use of IgG1 as the readout in class-switching assays, given that naïve B cells are capable of switching to multiple isotypes. Clarifying why IgG1 was preferentially selected would aid in the interpretation of the results.

      We will edit the text to be more explicit and harmonize in light of the referee's suggestion that we focus the presentation of serologic data on IgG1 in the immunization experiments.

      [IgG1 provides the strongest signal and hence better signal/noise both in vitro and with the alum-based immunizations that are avatars for the adjuvant used in the majority of protein-based vaccines for humans.]

      (4) The immunization experiments presented in Figures 1 and 2 are well designed, and the data are comprehensively presented. However, to prevent potential misinterpretation, it should be clarified that the observed differences between NP and OVA immunizations cannot be attributed solely to the chemical nature of the antigens - hapten versus protein. A more significant distinction lies in the route of administration (intraperitoneal vs. intranasal) and the resulting anatomical compartment of the immune response (systemic vs. lung-restricted). This context should be explicitly stated to avoid overinterpretation of the comparative findings.

      We agree with the referee and will edit the text accordingly. Certainly, the difference in how the anti-ova response is elicited compared to the anti-NP response in the same mice or with a bit different an immunization regimen might be another factor - or the major factor - that could contribute towards explaining why glutaminolysis was important after ovalbumin inhalations (used because emergence of anti-ova Ab / ASCs is suppressed by the NP hapten after NP-ova immunization) but not needed for the anti-NP response unless Slc2a1 or Mpc2 also was inactivated. Thank you prompting addition of this caveat.

      Nevertheless, it seems fair to note that in Figures 1 and 2, the ASCs and Ab are being analyzed for NP and ova in the same mice, albeit with the NP-specific components not being driven by the inhalations of ovalbumin. With that in mind, when one compares the IgG1 anti-NP ASC and Ab to those for IgG1 anti-ovalbumin (ASC in bone marrow; Ab), the ovalbumin-specific response was reduced whereas the anti-NP response was not.

      (5) NP immunization is known to be an inducer of an IgG1-dominant Th2-type immune response in mice. IgG2c is not a major player unless a nanoparticle delivery system is used. However, the authors arbitrarily included IgG2c in their assays in Figures 2 and 3. This may be confusing for the readers. The authors should either justify the IgG2c-mediated analyses or remove them from the main figures. (It can be added as supplemental information with proper justification). 

      We will rearrange the Figure panels to move the IgM and IgG2c data to Supplemental Figures.

      For purposes of public discourse, we note that the data of previous Figure 3(c, g) show a very strong NP-specific IgG2c response that seems to contradict the concept that IgG2c responses necessarily are weak in this setting, and the important role of IgG2c (mouse - IgG1 in humans) in controlling or clearing various pathogens as well as in autoimmunity. So from the standpoint of providing a better sense of generality to the loss-of-function effects, we continue to think that these measurements are quite important. That said, the main text has many figure panels and as the review notes, the class switching and in vitro ASC generation were done with IL-4 / IgG1-promoting conditions. If possible, we will try to assay in vitro class switching with IFN-g rather than IL-4 but there may not be enough resources (time before lab closure; money).

      [As a collegial aside, we speculate that a greater or lesser IgG2c anti-NP response may arise due to different preparations of NP-carrier obtained from the vendor (Biosearch) having different amounts of TLR (e.g., TLR4) ligand. In any case, the points of presenting the IgG2c (and IgM) data were to push against the limiting boundaries of convention (which risks perpetuating a narrow view of potential outcomes) and make the breadth of results more apparent to readers.

      (6) Similarly, in affinity maturation analyses, including IgM is somewhat uncommon. I do not see any point in showing high affinity (NP2/NP20) IgMs (Figure 3d), since that data probably does not mean much.

      As noted in the reply immediately preceding this one, we appreciate this suggestion from the reviewer and will move the IgM and IgG2c to Supplemental status.

      Nonetheless, in collegial discourse we disagree a bit with the referee in light of our data as well as of work that (to our minds) leads one to question why inclusion of affinity maturation of IgM is so uncommon - as the referee accurately notes. Of course a defect in the capacity to class-switch is highly deleterious in patients but that is not the same as concluding that recall IgM or its affinity is of little consequence.

      In some of the pioneering work back in the 1980's, Bothwell showed that NP-carrier immunization generated hybridomas producing IgM Ab with extensive SHM (~11% of the 18 lineages; ~ 1/3 of the IgM hybridomas) [PMID: 8487778], IgM B cells appear to move into GC, and there is at least a reasonable published basis for the view that there are GC-derived IgM (unswitched) memory B cells (MBC) that would be more likely, upon recall activation, to differentiate into ASCs. [As an example, albeit with the Jenkins lab anti-rPE response, Taylor, Pape, and Jenkins generated quantitative estimates of the numbers of Ag-specific IgM<sup>+</sup>vs switched MBC that were GC-derived (or not). [PMID: 22370719]. While they emphasized that ~90% of  IgM<sup>+</sup> MBC appeared to be GC-independent, their data also indicated that ~1/2 of all GC-derived MBC were IgM<sup>+</sup> rather than switched (their Fig. 8, B vs C; also 8E, which includes alum-PE). And while we immensely respect the referee, we are perhaps less confident that IgM or high-affinity Ag-specific IgM doesn't mean that much, if only because of evidence that localized Ab compete for Ag and may thus influence selective processes [PMCID: PMC2747358; PMID: 15953185; PMID: 23420879; PMID: 27270306].

      (7) Following on my comment for the PC generation in Figure 1 (see above), in Figure 4, a strategy that relies solely on CD40L stimulation is performed. This is highly artificial for the PC generation and needs to be justified, or more physiologically relevant PC generation strategies involving anti-BCR, CD40L, and various cytokines should be shown. 

      In line with our response to point (1), we plan and will try to self-fund testing BCR-stimulated B cells (anti-CD40 to  anti-IgM and to anti-IgM + anti-CD40, all with BAFF, IL-4, and IL-5).

      (8) The effects of CB839 and UK5099 on cell viability are not shown. Including viability data under these treatment conditions would be a valuable addition to the supplementary materials, as it would help readers more accurately interpret the functional outcomes observed in the study. 

      We will add to the supplemental figures to present data that provide cues as to relative viability / survival under the experimental conditions used. [FSC X SSC as well as 7AAD or Ghost dye panels; we also hope to generate new data that include further experiments scoring annexin V staining.]

      (9) It is not clear how the RNA seq analysis in Figure 4h was generated. The experimental strategy and the setup need to be better explained.

      The revised manuscript will include more information (at minimum in the Methods, Legend), and we apologize that in this and a few other instances sufficiency of detail was sacrificed on the altar of brevity.

      [Adding a brief synopsis to any reader before the final version of record, given the many months it will take to generate new data, thoroughly revise the manuscript, etc:

      In three temporally and biologically independent experiments, cultures were harvested 3.5 days after splenic B cells were purified and cultured as in the experiments of Fig. 4a-e. total cellular RNA prepared from the twelve samples (three replicates for each of four conditions - DMSO vehicle control, CB839, UK5099, and CB839 + UK5099) was analyzed by RNA-seq. After the RNA-seq data were initially processed using the pipeline described in the Methods. For panels g & h of Fig 4, DE Seq2 was used to quantify and compare read counts in the three CB839 + UK5099 samples relative to the three independent vehicle controls and identify all genes for which variances yielded P<0.05. In Fig 4g, all such genes for which the difference was 'statistically significant' (i.e., P<0.05) were entered into the Immgen tool and thereby mapped to the B lineage subsets shown in the figure panels (i.e., g, h). In (g), these are displayed using one format, whereas (h) uses the 'heatmap' tool in MyGeneSet.  

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, the authors investigate the functional requirements for glutamine and glutaminolysis in antibody responses. The authors first demonstrate that the concentrations of glutamine in lymph nodes are substantially lower than in plasma, and that at these levels, glutamine is limiting for plasma cell differentiation in vitro. The authors go on to use genetic mouse models in which B cells are deficient in glutaminase 1 (Gls), the glucose transporter Slc2a1, and/or mitochondrial pyruvate carrier 2 (Mpc2) to test the importance of these pathways in vivo. 

      Interestingly, deficiency of Gls alone showed clear antibody defects when ovalbumin was used as the immunogen, but not the hapten NP. For the latter response, defects in antibody titers and affinity were observed only when both Gls and either Mpc2 or Slc2a1 were deleted. These latter findings form the basis of the synthetic auxotrophy conclusion. The authors go on to test these conclusions further using in vitro differentiations, Seahorse assays, pharmacological inhibitors, and targeted quantification of specific metabolites and amino acids. Finally, the authors document reduced STAT3 and STAT1 phosphorylation in response to IL-21 and interferon (both type 1 and 2), respectively, when both glutaminolysis and mitochondrial pyruvate metabolism are prevented. 

      Strengths:

      (1) The main strength of the manuscript is the overall breadth of experiments performed. Orthogonal experiments are performed using genetic models, pharmacological inhibitors, in vitro assays, and in vivo experiments to support the claims. Multiple antigens are used as test immunogens--this is particularly important given the differing results. 

      (2) B cell metabolism is an area of interest but understudied relative to other cell types in the immune system. 

      (3) The importance of metabolic flexibility and caution when interpreting negative results is made clear from this study.

      Weaknesses:

      (1) All of the in vivo studies were done in the context of boosters at 3 weeks and recall responses 1 week later. This makes specific results difficult to interpret. Primary responses, including germinal centers, are still ongoing at 3 weeks after the initial immunization. Thus, untangling what proportion of the defects are due to problems in the primary vs. memory response is difficult.

      (2) Along these lines, the defects shown in Figure 3h-i may not be due to the authors' interpretation that Gls and Mpc2 are required for efficient plasma cell differentiation from memory B cells. This interpretation would only be correct if the absence of Gls/Mpc2 leads to preferential recruitment of low-affinity memory B cells into secondary plasma cells. The more likely interpretation is that ongoing primary germinal centers are negatively impacted by Gls and Mpc2 deficiency, and this, in turn, leads to reduced affinities of serum antibodies

      We provisionally plan to edit the wording of the conclusion a bit to add a possibility we consider unlikely to avoid a conclusion that MBCs bearing switched BCRs are affected once reactivated. We also will perform a new experiment to investigate, but unfortunately time before lab closure has been and remains our enemy both for performance and multiple replication of the work presented in Figure 3, panels h & i, and the related Supplemental Data (Supplemental Fig. 3a-j). Unfortunately, it will not be possible to do a memory experiment with recall immunization out at 8 weeks.  Despite the grant funding running out and institutional belt-tightening, however, we'll try to perform a new head-to-head comparison of 4 wk post-immunization with and without the boost at three weeks.

      The intriguing concern (points 1 & 2) provides a springboard for consideration of generalizations and simplifications. Germinal center durability is not at all monolithic, and instead is quite variable**. The premise (cognitive bias, perhaps?) in the interpretation is that in our previous work we find few if any GC B cells - NP-APC-binding or otherwise - above the background (non-immunized controls) three weeks after immunization with NP-ovalbumin in alum. Recognizing that it is not NP-carrier in alum as immunizations, we note for the readers and referee that Fig. 1 of the Taylor, Pape, & Jenkins paper considered above [PMID: 22370719] reported 10-fold more Ag-specific MBCs than GC B cells at day 29 post-immunization (the point at which the boost / recall challenge was performed in our Figure 3h, i).

      Viewed from that perspective, the surmise of the comment is that a major contribution to the differences in both all-affinity and high-affinity anti-NP IgG1 shown in Fig. 3i derives from the immunization at 4 wk stimulating GC B cells we cannot find as opposed to memory B cells. However, it is true that in the literature (especially with the experimentally different approach of transferring BCR-transgenic / knock-in versions of an NP-biased BCR) there may be meaningful pools of IgG1 and IgG2c GC B cells. Alternatively, our current reagents for immunizations may have become better at maintaining GC than those in the past - which we will try to test.

      The issue and question also relate to rates of output of plasma cells or rises in the serum concentrations of class-switched Ab. To this point, our prior experiences agree with the long-published data of the Kurosaki lab in Figure 3c of the Aiba et al paper noted above (Immunity, 2006) (and other such time courses). Readers can note that the IgG1 anti-NP response (alum adjuvant, as in our work) hits its plateau at 2 wk, and did not increase further from 2 to 3 wk. In other words, GC are on the decline and  Ab production has reached its plateau by the time of the 2nd immunization in Fig. 3h). 

      Assuming we understand the comment and line of reasoning correctly, we also lean towards disagreeing with the statement "This interpretation would only be correct if the absence of Gls/Mpc2 leads to preferential recruitment of low-affinity memory B cells into secondary plasma cells." Our evidence shows that both low-affinity as well as high-affinity anti-NP Ab (IgG1) went down as a result of combined gene-inactivation after the peak primary response (Fig. 3i). Recent papers show that affinity maturation is attributable to greater proliferation of plasmablasts with high-affinity BCR. Accordingly, the findings with loss of GLS and MPC function are quite consistent with the interpretation that much of the response after the second immunization draws on MBC differentiation into plasmablasta and then plasma cells, where the proliferative advantage of high-affinity cells is blunted by the impaired metabolism. The provisional plan, however, is to note the alternative, if less likely, interpretation proposed by the review.

      ** In some contexts, of course, especially certain viral infections or vaccination with lipid nanoparticles carrying modified mRNA, germinal centers are far more persistent; also, in humans even the seasonal flu vaccine **

      (3) The gating strategies for germinal centers and memory B cells in Supplemental Figure 2 are problematic, especially given that these data are used to claim only modest and/or statistically insignificant differences in these populations when Gls and Mpc2 are ablated. Neither strategy shows distinct flow cytometric populations, and it does not seem that the quantification focuses on antigen-specific cells.

      We will enhance these aspects of the presentation, using old and hopefully new data, but note for readers that many many other papers in the best journals show plots in which the separation of, say, GC-Tfh from overall Tfh is based on cut-off within what essentially is a continuous spectrum of emission as adjusted or compensated by the cytometer (spectral or conventional).

      Perhaps incorrectly, we omitted presenting data that included the results with NP-APC-staining - in part because within the GC B cell gate the frequencies of NP-binding events (GCB cells) were similar in double-knockout samples and controls. In practice, that would mean that the metabolic requirement applied about equally to NP+ and the total population. We will try to rectify this point in the revision.

      (4) Along these lines, the conclusions in Figure 6a-d may need to be tempered if the analysis was done on polyclonal, rather than antigen-specific cells. Alum induces a heavily type 2-biased response and is not known to induce much of an interferon signature. The authors' observations might be explained by the inclusion of other ongoing GCs unrelated to the immunization. 

      We will make sure the text is clear that the in vitro experiments do not represent GC B cells and that the RNA-seq data were not an Ag (SRBC)-specific subset.

      We also will try to work in a schematic along with expanding the Legends to make it more readily clear that the RNA-seq data (and hence the GSEA) involved immunizations with SRBC (not the alum / NP system which - it may be noted - in these experiments actually generated a robust IgG2c (type 1-driven) response along with the type 2-enhanced IgG1 response.

      Reviewer #3 (Public review): 

      Summary: 

      In their manuscript, the authors investigate how glutaminolysis (GLS) and mitochondrial pyruvate import (MPC2) jointly shape B cell fate and the humoral immune response. Using inducible knockout systems and metabolic inhibitors, they uncover a "synthetic auxotrophy": When GLS activity/glutaminolysis is lost together with either GLUT1-mediated glucose uptake or MPC2, B cells fail to upregulate mitochondrial respiration, IL 21/STAT3 and IFN/STAT1 signaling is impaired, and the plasma cell output and antigen-specific antibody titers drop significantly. This work thus demonstrates the promotion of plasma cell differentiation and cytokine signaling through parallel activation of two metabolic pathways. The dataset is technically comprehensive and conceptually novel, but some aspects leave the in vivo and translational significance uncertain.

      Strengths:

      (1) Conceptual novelty: the study goes beyond single-enzyme deletions to reveal conditional metabolic vulnerabilities and fate-deciding mechanisms in B cells.

      (2) Mechanistic depth: the study uncovers a novel "metabolic bottleneck" that impairs mitochondrial respiration and elevates ROS, and directly ties these changes to cytokine-receptor signaling. This is both mechanistically compelling and potentially clinically relevant.

      (3) Breadth of models and methods: inducible genetics, pharmacology, metabolomics, seahorse assay, ELISpot/ELISA, RNA-seq, two immunization models.

      (4) Potential clinical angle: the synergy of CB839 with UK5099 and/or hydroxychloroquine hints at a druggable pathway targeting autoantibody-driven diseases.

      We agree and thank the referee for the positive comments and this succinct summary of what we view as contributions of the paper.

      Weaknesses: 

      (1) Physiological relevance of "synthetic auxotrophy"

      The manuscript demonstrates that GLS loss is only crippling when glucose influx or mitochondrial pyruvate import is concurrently reduced, which the authors name "synthetic auxotrophy". I think it would help readers to clarify the terminology more and add a concise definition of "synthetic auxotrophy" versus "synthetic lethality" early in the manuscript and justify its relevance for B cells.

      We will edit the Abstract, Introduction, and Discussion to try to do better on this score. Conscious of how expansive the prose and data are even in the original submission, we appear to have taken some shortcuts that we will try to rectify. Thank you for highlighting this need to improve on a key concept!

      That said, we punctiliously & perhaps pedantically encourage readers to be completely accurate, in that under one condition of immunization GLS loss substantially reduced the anti-ovalbumin response (Fig. 1, Fig. 2a-c). And for this provisional response, we will expand a bit on the notion that synthetic auxotrophy represents effects on differentiation that appear to go beyond and not simply to be selective death, even though decreased population expansion is observed and one cannot exclude some contribution of enhanced death in vivo. Finally, we will note that this comment of the review raises interesting semantic questions about what represents "physiological relevance" but leave it at that.

      While the overall findings, especially the subset specificity and the clinical implications, are generally interesting, the "synthetic auxotrophy" condition feels a little engineered.

      One can readily say that CAR-T cells are 'a little engineered' so it is a matter of balancing this perspective of the referee against the strengths they highlight in points 1, 2, and 4. In any case, we will probably try to expand and be more explicit in the Discussion of the revised manuscript.

      In brief, even were the money not all gone, we would not believe that expanding the heft of this already rather large manuscript and set of data would be appropriate. As matters stand, a basic new insight about metabolic flexibility and its limits leads to evidence of a way to reduce generation of Ab and a novel impairment of STAT transcription factor induction by several cytokine receptors. The vulnerability that could be tested in later work on B cell-dependent autoimmunity includes the capacity to test a compound that already has been to or through FDA phase II in patients together with an FDA-approved standard-of-care agent.

      Put a different way, the point is that a basic curiosity to understand why decreasing glucose influx did not have an even more profound effect than what was observed, combined with curiosity as to why glutaminolysis was dispensable in relatively standard vaccine-like models of immunize / boost, provided a springboard to identification of new vulnerabilities. As above, we appreciate being made aware that this point merits being made more explicit in the Discussion of the edited version.

      Therefore, the findings strongly raise the question of the likelihood of such a "double hit" in vivo and whether there are conditions, disease states, or drug regimens that would realistically generate such a "bottleneck".

      Hence, the authors should document or at least discuss whether GC or inflamed niches naturally show simultaneous downregulation/lack of glutamine and/or pyruvate. The authors should also aim to provide evidence that infections (e.g., influenza), hypoxia, treatments (e.g., rapamycin), or inflammatory diseases like lupus co-limit these pathways. 

      Again, we appreciate some 'licensing' to be more expansive and explicit, and will try to balance editing in such points against undue tedium or tendentiously speculative length in the Discussion. In particular, we will note that a clear, simple implication of the work is to highlight an imperative to test CB839 in lupus patients already on hydroxychloroquine as standard-of-care, and to suggest development of UK5099 (already tested many times in mouse models of cancer) to complement glutaminase inhibition. 

      As backdrop, we note that the failure to advance imaging mass spectrometry to the capacity to quantify relative or absolute (via nano-DESI) concentrations of nutrients in localized interstitia is a critical gap in the entire field. Techniques that sample the interstitial fluid of tumor masses or in our case LN as a work-around have yielded evidence that there can be meaningful limitations of glucose and glutamine, but it needs to be acknowledged that such findings may be very model-specific and, as can be the case with cutting-edge science, are not without controversy. That said, yes, we had found that hypoxia reduced glutamine uptake but given the norms of focused, tidy packages only reported on leucine in an earlier paper [PMID27501247; PMCID5161594].

      It would hence also be beneficial to test the CB839 + UK5099/HCQ combinations in a short, proof-of-concept treatment in vivo, e.g., shortly before and after the booster immunization or in an autoimmune model. Likewise, it may also be insightful to discuss potential effects of existing treatments (especially CB839, HCQ) on human memory B cell or PC pools.

      We certainly agree that the suggestions offered in this comment are important next steps and the right approach to test if the findings reported here translate toward the treatment of autoimmune diseases that involve B cells, interferons, and pathophysiology mediated by auto-Ab. As practical points, performance and replication of such studies would take more time than the year allotted for return of a revised manuscript to eLife and in any case neither funds nor a lab remain to do these important studies. 

      Concrete evidence for our concurrence was embodied in a grant application to NIH that was essential for keeping a lab and doing any such studies. [We note, as a suggestion to others, that an essential component of such studies would be to test the effects of these compounds on B cells from patients and mice with autoimmunity]. Perhaps unfortunately for SLE patients, the review panelists did not agree about the importance of such studies. However, it can be hoped that the patent-holder of CB839 (and perhaps other companies developing glutaminase inhibitors) will see this peer-reviewed pre-print and the public dialogue, and recognize how positive results might open a valuable contribution to mitigation of diseases such as SLE.

      (2) Cell survival versus differentiation phenotype

      Claims that the phenotypes (e.g., reduced PC numbers) are "independent of death" and are not merely the result of artificial cell stress would benefit from Annexin-V/active-caspase 3 analyses of GC B cells and plasmablasts. Please also show viability curves for inhibitor-treated cell

      This comment leads us to see that the wording on this point may have been overly terse in the interests of brevity, and thereby open to some misunderstanding. Accordingly, we will expand out the text of the Abstract and elsewhere in the manuscript, to be more clear. In addition, we will add in some data on the point, hopefully including some results of new experiments.

      To clarify in this public context, it is not that an increase in death (along with the reported decrease in cell cycling) can be or is excluded - and in fact it likely exists in vitro. The point is that beyond any such increase, and taking into account division number (since there is evidence that PC differentiation and output numbers involve a 'division-counting' mechanism), the frequencies of CD138+ cells and of ASCs among the viable cells are lower, as is the level of Prdm1-encoded mRNA even before the big increase in CD138+ cells in the population. 

      (3) Subset specificity of the metabolic phenotype

      Could the metabolic differences, mitochondrial ROS, and membrane-potential changes shown for activated pan-B cells (Figure 5) also be demonstrated ex vivo for KO mouse-derived GC B cells and plasma cells? This would also be insightful to investigate following NP-immunization (e.g., NP+ GC B cells 10 days after NP-OVA immunization).

      We agree that such data could be nice and add to the comprehensiveness of the work. We will try to scrounge the resources (time; money; human) to test this roughly as indicated. That said, we would note that the frequencies and hence numbers of NP+ GC B cells are so low that even in the flow cytometer we suspect there will not be enough "events" to rely on the results with DCFDA in the tiny sub-sub-subset. It also bears noting that reliable flow cytometric identification of the small NP-specific plasmablast/plasma cell subset amidst the overall population, little of which arose from immunization or after deletion of the floxed segments in B cells, would potentially be misleading.

      (4) Memory B cell gating strategy

      I am not fully convinced that the memory-B-cell gate in Supplementary Figure 2d is appropriate. The legend implies the population is defined simply as CD19+GL7-CD38+ (or CD19+CD38++?), with no further restriction to NP-binding cells. Such a gate could also capture naïve or recently activated B cells. From the descriptions in the figure and the figure legend, it is hard to verify that the events plotted truly represent memory B cells. Please clarify the full gating hierarchy and, ideally, restrict the MBC gate to NP+CD19+GL7-CD38+ B cells (or add additional markers such as CD80 and CD273). Generally, the manuscript would benefit from a more transparent presentation of gating strategies.

      We will further expand the supplemental data displays to include more of the gating and analytic scheme, and hope to be able to have performed new experiments and analyses (including additional markers) that could mitigate the concern noted here. In addition, we will include flow data from the non-immunized control mice that had been analyzed concurrently in the experiments illustrated in this Figure.

      Although it should be noted that the labeling indicated that the gating included the important criterion that cells be IgD- (Supplemental Fig. 2b), which excludes the vast majority of naive B cells, in principle marginal zone (MZ) B cells might fall within this gate. However, the MZ B population is unlikely to explain the differences shown in Supplemental Fig. 2b-d.

      (5) Deletion efficiency - [The] mRNA data show residual GLS/MPC2 transcripts (Supplementary Figure 8). Please quantify deletion efficiency in GC B cells and plasmablasts.

      Even were there resources to do this, the degree of reduction in target mRNA (Gls; Mpc2) renders this question superfluous.

      Are there likely to be some cells with only one, or even neither, allele converted from fl to D? Yes, but they would be a minor subset in light of the magnitude of mRNA reduction, in contrast to our published observations with Slc2a1. As to plasmablasts and plasma cells, the pre-existing populations make such an analysis misleading, while the scarcity of such cells recoverable with antigen capture techniques is so low as to make both RNA and genomic DNA analyses questionable.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This paper investigates the control signals that drive event model updating during continuous experience. The authors apply predictions from previously published computational models to fMRI data acquired while participants watched naturalistic video stimuli. They first examine the time course of BOLD pattern changes around human-annotated event boundaries, revealing pattern changes preceding the boundary in anterior temporal and then parietal regions, followed by pattern stabilization across many regions. The authors then analyze time courses around boundaries generated by a model that updates event models based on prediction error and another that uses prediction uncertainty. These analyses reveal overlapping but partially distinct dynamics for each boundary type, suggesting that both signals may contribute to event segmentation processes in the brain.

      Strengths:

      (1) The question addressed by this paper is of high interest to researchers working on event cognition, perception, and memory. There has been considerable debate about what kinds of signals drive event boundaries, and this paper directly engages with that debate by comparing prediction error and prediction uncertainty as candidate control signals.

      (2) The authors use computational models that explain significant variance in human boundary judgments, and they report the variance explained clearly in the paper.

      (3) The authors' method of using computational models to generate predictions about when event model updating should occur is a valuable mechanistic alternative to methods like HMM or GSBS, which are data-driven.

      (4) The paper utilizes an analysis framework that characterizes how multivariate BOLD pattern dissimilarity evolves before and after boundaries. This approach offers an advance over previous work focused on just the boundary or post-boundary points.

      We appreciate this reviewer’s recognition of the significance of this research problem, and of the value of the approach taken by this paper.

      Weaknesses:

      (1) While the paper raises the possibility that both prediction error and uncertainty could serve as control signals, it does not offer a strong theoretical rationale for why the brain would benefit from multiple (empirically correlated) signals. What distinct advantages do these signals provide? This may be discussed in the authors' prior modeling work, but is left too implicit in this paper.

      We added a brief discussion in the introduction highlighting the complementary advantages of prediction error and prediction uncertainty, and cited prior theoretical work that elaborates on this point. Specifically, we now note that prediction error can act as a reactive trigger, signaling when the current event model is no longer sufficient (Zacks et al., 2007). In contrast, prediction uncertainty is framed as proactive, allowing the system to prepare for upcoming changes even before they occur (Baldwin & Kosie, 2021; Kuperberg, 2021). Together, this makes clearer why these two signals could each provide complementary benefits for effective event model updating.

      "One potential signal to control event model updating is prediction error—the difference between the system’s prediction and what actually occurs. A transient increase in prediction error is a valid indicator that the current model no longer adequately captures the current activity. Event Segmentation Theory (EST; Zacks et al., 2007) proposes that event models are updated when prediction error increases beyond a threshold, indicating that the current model no longer adequately captures ongoing activity. A related but computationally distinct proposal is that prediction uncertainty (also termed "unpredictability"), in addition to error, serves as the control signal (Baldwin & Kosie, 2021). The advantage of relying on prediction uncertainty to detect event boundaries is that it is inherently proactive: the cognitive system can start looking for cues about what might come next before the next event starts (Baldwin & Kosie, 2021; Kuperberg, 2021)."

      (2) Boundaries derived from prediction error and uncertainty are correlated for the naturalistic stimuli. This raises some concerns about how well their distinct contributions to brain activity can be separated. The authors should consider whether they can leverage timepoints where the models make different predictions to make a stronger case for brain regions that are responsive to one vs the other.

      We addressed this concern by adding an analysis that explicitly tests the unique contributions of prediction error– and prediction uncertainty–driven boundaries to neural pattern shifts. In the revised manuscript, we describe how we fit a combined FIR model that included both boundary types as predictors and then compared this model against versions with only one predictor. This allowed us to identify the variance explained by each boundary type over and above the other. The results revealed two partially dissociable sets of brain regions sensitive to error- versus uncertainty-driven boundaries (see Figure S1), strengthening our argument that these signals make distinct contributions.

      "To account for the correlation between uncertainty-driven boundaries and error-driven boundaries, we also fitted a FIR model that predicts pattern dissimilarity from both types of boundaries (combined FIR) for each parcel. Then, we performed two likelihood ratio tests: combined FIR to error FIR, which measures the unique contribution of uncertainty boundaries to pattern dissimilarity, and combined FIR to uncertainty FIR, which measures the unique contribution of error boundaries to pattern dissimilarity. The analysis also revealed two dissociable sets of brain regions associated with each boundary type (see Figure S1)."

      (3) The authors refer to a baseline measure of pattern dissimilarity, which their dissimilarity measure of interest is relative to, but it's not clear how this baseline is computed. Since the interpretation of increases or decreases in dissimilarity depends on this reference point, more clarity is needed.

      We clarified how the FIR baseline is estimated in the methods section. Specifically, we now explain that the FIR coefficients should be interpreted relative to a reference level, which reflects the expected dissimilarity when timepoints are far from an event boundary. This makes it clear what serves as the comparison point for observed increases or decreases in dissimilarity.

      "The coefficients from the FIR model indicates changes relative to baseline, which can be conceptualized as the expected value when far from the boundary."

      (4) The authors report an average event length of ~20 seconds, and they also look at +20 and -20 seconds around each event boundary. Thus, it's unclear how often pre- and post-boundary timepoints are part of adjacent events. This complicates the interpretations of the reported time courses.

      This is related to reviewer's 2 comment, and it will be addressed below.

      (5) The authors describe a sequence of neural pattern shifts during each type of boundary, but offer little setup of what pattern shifts we might expect or why. They also offer little discussion of what cognitive processes these shifts might reflect. The paper would benefit from a more thorough setup for the neural results and a discussion that comments on how the results inform our understanding of what these brain regions contribute to event models.

      We thank the reviewer for this advice on how better to set the context for the different potential outcomes of the study. We expanded both the introduction and discussion to better set up expectations for neural pattern shifts and to interpret what these shifts may reflect. In the introduction, we now describe prior findings showing that sensory regions tend to update more quickly than higher-order multimodal regions (Baldassano et al., 2017; Geerligs et al., 2021, 2022), and we highlight that it remains unclear whether higher-order updates precede or follow those in lower-order regions. We also note that our analytic approach is well-suited to address this open question. In the discussion, we then interpret our results in light of this framework. Specifically, we describe how we observed early shifts in higher-order areas such as anterior temporal and prefrontal cortex, followed by shifts in parietal and dorsal attention regions closer to event boundaries. This pattern runs counter to the traditional bottom-up temporal hierarchy view and instead supports a model of top-down updating, where high-level representations are updated first and subsequently influence lower-level processing (Friston, 2005; Kuperberg, 2021). To make this interpretation concrete, we added an example: in a narrative where a goal is reached midway—for instance, a mystery solved before the story formally ends—higher-order regions may update the event representation at that point, and this updated model then cascades down to shape processing in lower-level regions. Finally, we note that the widespread stabilization of neural patterns after boundaries may signal the establishment of a new event model.

      Excerpt from Introduction:

      “More recently, multivariate approaches have provided insights into neural representations during event segmentation. One prominent approach uses hidden Markov models (HMMs) to detect moments when the brain switches from one stable activity pattern to another (Baldassano et al., 2017) during movie viewing; these periods of relative stability were referred to as "neural states" to distinguish them from subjectively perceived events. Sensory regions like visual and auditory cortex showed faster transitions between neural states. Multi-modal regions like the posterior medial cortex, angular gyrus, and intraparietal sulcus showed slower neural state shifts, and these shifts aligned with subjectively reported event boundaries. Geerligs et al. (2021, 2022) employed a different analytical approach called Greedy State Boundary Search (GSBS) to identify neural state boundaries. Their findings echoed the HMM results: short-lived neural states were observed in early sensory areas (visual, auditory, and somatosensory cortex), while longer-lasting states appeared in multi-modal regions, including the angular gyrus, posterior middle/inferior temporal cortex, precuneus, anterior temporal pole, and anterior insula. Particularly prolonged states were found in higher-order regions such as lateral and medial prefrontal cortex...

      The previous evidence about evoked responses at event boundaries indicates that these are dynamic phenomena evolving over many seconds, with different brain areas showing different dynamics (Ben-Yakov & Henson, 2018; Burunat et al., 2024; Kurby & Zacks, 2018; Speer et al., 2007; Zacks, 2010). Less is known about the dynamics of pattern shifts at event boundaries, because the HMM and GSBS analysis methods do not directly provide moment-by-moment measures of pattern shifts. For example, one question is whether shifts in higher-order regions precedes or follow shifts in lower-level regions. Both the spatial and temporal aspects of evoked responses and pattern shifts at event boundaries have the potential to provide evidence about potential control processes for event model updating.”

      Excerpt from Discussion:

      “We first characterized the neural signatures of human event segmentation by examining both univariate activity changes and multivariate pattern changes around subjectively identified event boundaries. Using multivariate pattern dissimilarity, we observed a structured progression of neural reconfiguration surrounding human-identified event boundaries. The largest pattern shifts were observed near event boundaries (~4.5s before) in dorsal attention and parietal regions; these correspond with regions identified by Geerligs et al. as shifting their patterns on an intermediate timescale (2022). We also observed smaller pattern shifts roughly 12 seconds prior to event boundaries in higher-order regions within anterior temporal cortex and prefrontal cortex, and these are slow-changing regions identified by Geerligs et al. (2022). This is puzzling. One prevalent proposal, based on the idea of a cortical hierarchy of increasing temporal receptive windows (TRWs), suggests that higher-order regions should update representations after lower-order regions do (Chang et al., 2021). In this view, areas with shorter TRWs (e.g., word-level processors) pass information upward, where it is integrated into progressively larger narrative units (phrases, sentences, events). This proposal predicts neural shifts in higher-order regions to follow those in lower-order regions. By contrast, our findings indicate the opposite sequence. Our findings suggest that the brain might engage in top-down event representation updating, with changes in coarser-grain representations propagating downward to influence finer-grain representations. (Friston, 2005; Kuperberg, 2021). For example, in a narrative where the main goal is achieved midway—such as a detective solving a mystery before the story formally ends—higher-order regions might update the overarching event representation at that point, and this updated model could then cascade down to reconfigure how lower-level regions process the remaining sensory and contextual details. In the period after a boundary (around +12 seconds), we found widespread stabilization of neural patterns across the brain, suggesting the establishment of a new event model. Future work could focus on understanding the mechanisms behind the temporal progression of neural pattern changes around event boundaries.”

      Reviewer #2 (Public review):

      Summary:

      Tan et al. examined how multivoxel patterns shift in time windows surrounding event boundaries caused by both prediction errors and prediction uncertainty. They observed that some regions of the brain show earlier pattern shifts than others, followed by periods of increased stability. The authors combine their recent computational model to estimate event boundaries that are based on prediction error vs. uncertainty and use this to examine the moment-to-moment dynamics of pattern changes. I believe this is a meaningful contribution that will be of interest to memory, attention, and complex cognition research.

      Strengths:

      The authors have shown exceptional transparency in terms of sharing their data, code, and stimuli, which is beneficial to the field for future examinations and to the reproduction of findings. The manuscript is well written with clear figures. The study starts from a strong theoretical background to understand how the brain represents events and has used a well-curated set of stimuli. Overall, the authors extend the event segmentation theory beyond prediction error to include prediction uncertainty, which is an important theoretical shift that has implications in episodic memory encoding, the use of semantic and schematic knowledge, and attentional processing.

      We thank the reader for their support for our use of open science practices, and for their appreciation of the importance of incorporating prediction uncertainty into models of event comprehension.

      Weaknesses:

      The data presented is limited to the cortex, and subcortical contributions would be interesting to explore. Further, the temporal window around event boundaries of 20 seconds is approximately the length of the average event (21.4 seconds), and many of the observed pattern effects occur relatively distal from event boundaries themselves, which makes the link to the theoretical background challenging. Finally, while multivariate pattern shifts were examined at event boundaries related to either prediction error or prediction uncertainty, there was no exploration of univariate activity differences between these two different types of boundaries, which would be valuable.

      The fact that we observed neural pattern shifts well before boundaries was indeed unexpected, and we now offer a more extensive interpretation in the discussion section. Specifically, we added text noting that shifts emerged in higher-order anterior temporal and prefrontal regions roughly 12 seconds before boundaries, whereas shifts occurred in lower-level dorsal attention and parietal regions closer to boundaries. This sequence contrasts with the traditional bottom-up temporal hierarchy view and instead suggests a possible top-down updating mechanism, in which higher-order representations reorganize first and propagate changes to lower-level areas (Friston, 2005; Kuperberg, 2021). (See excerpt for Reviewer 1’s comment #5.)

      With respect to univariate activity, we did not find strong differences between error-driven and uncertainty-driven boundaries. This makes the multivariate analyses particularly informative for detecting differences in neural pattern dynamics. To support further exploration, we have also shared the temporal progression of univariate BOLD responses on OpenNeuro for interested researchers.

      Reviewer #3 (Public review):

      Summary:

      The aim of this study was to investigate the temporal progression of the neural response to event boundaries in relation to uncertainty and error. Specifically, the authors asked (1) how neural activity changes before and after event boundaries, (2) if uncertainty and error both contribute to explaining the occurrence of event boundaries, and (3) if uncertainty and error have unique contributions to explaining the temporal progression of neural activity.

      Strengths:

      One strength of this paper is that it builds on an already validated computational model. It relies on straightforward and interpretable analysis techniques to answer the main question, with a smart combination of pattern similarity metrics and FIR. This combination of methods may also be an inspiration to other researchers in the field working on similar questions. The paper is well written and easy to follow. The paper convincingly shows that (1) there is a temporal progression of neural activity change before and after an event boundary, and (2) event boundaries are predicted best by the combination of uncertainty and error signals.

      We thank the reviewer for their thoughtful and supportive comments, particularly regarding the use of the computational model and the analysis approaches.

      Weaknesses:

      (1) The current analysis of the neural data does not convincingly show that uncertainty and prediction error both contribute to the neural responses. As both terms are modelled in separate FIR models, it may be that the responses we see for both are mostly driven by shared variance. Given that the correlation between the two is very high (r=0.49), this seems likely. The strong overlap in the neural responses elicited by both, as shown in Figure 6, also suggests that what we see may mainly be shared variance. To improve the interpretability of these effects, I think it is essential to know whether uncertainty and error explain similar or unique parts of the variance. The observation that they have distinct temporal profiles is suggestive of some dissociation, but not as convincing as adding them both to a single model.

      We appreciate this point. It is closely related to Reviewer 1's comment 2; please refer to our response above.

      (2) The results for uncertainty and error show that uncertainty has strong effects before or at boundary onset, while error is related to more stabilization after boundary onset. This makes me wonder about the temporal contribution of each of these. Could it be the case that increases in uncertainty are early indicators of a boundary, and errors tend to occur later?

      We also share the intuition that increases in uncertainty are early indicators of a boundary, and errors tend to occur later. If that is the case, we would expect some lags between prediction uncertainty and prediction error. We examined lagged correlation between prediction uncertainty and prediction error, and the optimal lag is 0 for both uncertainty-driven and error-driven models. This indicates that when prediction uncertainty rises, prediction error also simultaneously rises.

      Author response image 1.

      (3) Given that there is a 24-second period during which the neural responses are shaped by event boundaries, it would be important to know more about the average distance between boundaries and the variability of this distance. This will help establish whether the FIR model can properly capture a return to baseline.

      We have added details about the distribution of event lengths. Specifically, we now report that the mean length of subjectively identified events was 21.4 seconds (median 22.2 s, SD 16.1 s). For model-derived boundaries, the average event lengths were 28.96 seconds for the uncertainty-driven model and 24.7 seconds for the error-driven model.

      "For each activity, a separate group of 30 participants had previously segmented each movie to identify fine-grained event boundaries (Bezdek et al., 2022). The mean event length was 21.4 s (median 22.2 s, SD 16.1 s). Mean event lengths for uncertainty-driven model and error-driven model were 28.96s, and 24.7s, respectively."

      (4) Given that there is an early onset and long-lasting response of the brain to these event boundaries, I wonder what causes this. Is it the case that uncertainty or errors already increase at 12 seconds before the boundaries occur? Or if there are other makers in the movie that the brain can use to foreshadow an event boundary? And if uncertainty or errors do increase already 12 seconds before an event boundary, do you see a similar neural response at moments with similar levels of error or uncertainty, which are not followed by a boundary? This would reveal whether the neural activity patterns are specific to event boundaries or whether these are general markers of error and uncertainty.

      We appreciate this point; it is similar to reviewer 2’s comment 2. Please see our response to that comment above.

      (5) It is known that different brain regions have different delays of their BOLD response. Could these delays contribute to the propagation of the neural activity across different brain areas in this study?

      Our analyses use ±20 s FIR windows, and the key effects we report include shifts ~12s before boundaries in higher-order cortex and ~4.5s pre-boundary in dorsal attention/parietal areas. Given the literature above, region-dependent BOLD delays are much smaller (~1–2s) than the temporal structure we observe (Taylor et al., 2018), making it unlikely that HRF lag alone explains our multi-second, region-specific progression.

      (6) In the FIR plots, timepoints -12, 0, and 12 are shown. These long intervals preclude an understanding of the full temporal progression of these effects.

      For page length purposes, we did not include all timepoints. We uploaded an animation of all timepoints in Openneuro for interested researchers.

      References

      Taylor, A. J., Kim, J. H., & Ress, D. (2018). Characterization of the hemodynamic response function across the majority of human cerebral cortex. NeuroImage, 173, 322–331. https://doi.org/10.1016/j.neuroimage.2018.02.061

    1. people almostoverwork themselves, and they feel this compulsion and duty to the degreethat sometimes I think they sometimes ruin their lives.

      Oh, don't turn this around. This is true. It is true for pianists in Spain, for athletic swimmers in Oklahoma, and for football players in Argentina. Child exploitation is a thing. Perhaps not self-overwork, but this is what society asks of them.

      There's this thing in progressivism, where we should be more rested and have more leisure. I sorta agree insofar as it is sustainable, but if the right-wing politicians exploit themselves, we must keep up too. We can't fight a tank with a sunflower for now, I think. We can't fight populism with 3 hour talks.

      So no, I deny there exists, at least more than anecdotically, this fog of mystery surrounding Korean players. Indeed, many lower ranks, may admire and find a new interest in their culture thanks to popular figures like that.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study examined the changes in ATL GABA levels induced by cTBS and its relationship with BOLD signal changes and performance in a semantic task. The findings suggest that the increase in ATL GABA levels induced by cTBS is associated with a decrease in BOLD signal. The relationship between ATL GABA levels and semantic task performance is nonlinear, and more specifically, the authors propose that the relationship is an inverted U-shaped relationship.

      Strengths:

      The findings of the research regarding the increase of GABA and decrease of BOLD caused by cTBS, as well as the correlation between the two, appear to be reliable. This should be valuable for understanding the biological effects of cTBS.

      Weakness:

      I am pleased to see the authors' feedback on my previous questions and suggestions, and I believe the additional data analysis they have added is helpful. Here are my reserved concerns and newly discovered issues.

      (1) Regarding the Inverted U-Shaped Curve In the revised manuscript, the authors have accepted some of my suggestions and conducted further analysis, which is now presented in Figure 3B. These results provide partial support for the authors' hypothesis. However, I still believe that the data from this study hardly convincingly support an inverted U-shaped distribution relationship.

      The authors stated in their response, "it is challenging to determine the optimal level of ATL GABA," but I think this is achievable. From Figures 4C and 4D, the ATL GABA levels corresponding to the peak of the inverted U-shaped curve fall between 85 and 90. In my understanding, this can be considered as the optimal level of ATL GABA estimated based on the existing data and the inverted U-shaped curve relationship. However, in the latter half of the inverted U-shaped curve, there are quite few data points, and such a small number of data points hardly provides reliable support for the quantitative relationship in the latter half of the curve. I suggest that the authors should at least explicitly acknowledge this and be cautious in drawing conclusions. I also suggest that the authors consider fitting the data with more types of non-linear relationships, such as a ceiling effect (a combination of a slope and a horizontal line), or a logarithmic curve.

      We appreciate R1’s comments. Inverted U-shaped relationships are well-established in neuroscience, particularly in the context of neurotransmitter concentrations (e.g., dopamine, acetylcholine, noradrenaline) and their influence on cognitive functions such as working memory and cognitive control (Aston-Jones & Cohen., 2005; Cools & D'Esposito., 2011; Vijayraghavan et al., 2007; He & Zempel., 2013). Recently, Ferri et al. (2017) demonstrated an inverted U-shaped relationship between excitation-inhibition balance (EIB: the ratio of Glx and GABA) and multisensory integration, showing that both excessive and insufficient inhibition negatively impact functionality. Given that GABA is the brain’s primary inhibitory neurotransmitter, our findings suggest that ATL GABA may play a similar regulatory role in semantic memory function.

      While our statistical modelling approach demonstrated that the inverted U-shaped function was the best-fitting model for our current data in explaining the relationship between ATL GABA and semantic memory, we acknowledge the limitation of having fewer data points in the latter half (right side) of the curve, where excessive ATL GABA levels are associated with poorer semantic performance. Following R1’s suggestion, we have explicitly acknowledged this limitation in the revised manuscript and exercised caution in our discussion.

      Discussion, p.17, line 408

      "However, our findings should be interpreted with caution due to the limitation of having fewer data points in the latter half (right side) of the inverted U-shaped curve. Future studies incorporating GABA agonists could help further validate and refine these findings."

      Following R1’s latter suggestion, we tested a logarithmic curve model. The results showed significant relationships between ATL GABA and semantic performance (R<sup>2</sup> = 0.544, p < 0.001) and between cTBS-induced changes in ATL GABA and semantic performance (R<sup>2</sup> = 0.202, p < 0.001). However, the quadratic (inverted U-shaped) model explained more variance than the logarithmic model, as indicated by a higher R<sup>2</sup> and lower BIC. Model comparisons further confirmed that the inverted U-shaped model provided the best fit for both ATL GABA in relation to semantic performance (Fig. 4C) and cTBS-induced ATL GABA changes in relation to semantic function (Fig. 4D).

      Author response table 1.

      (2) In Figure 2F, the authors demonstrated a strong practice effect in this study, which to some extent offsets the decrease in behavioral performance caused by cTBS. Therefore, I recommend that the authors give sufficient consideration to the practice effect in the data analysis.

      One issue is the impact of the practice effect on the classification of responders and non-responders. Currently, most participants are classified as non-responders, suggesting that the majority of the population may not respond to the cTBS used in this study. This greatly challenges the generalizability of the experimental conclusions. However, the emergence of so many non-responders is likely due to the prominent practice effect, which offsets part of the experimental effect. If the practice effect is excluded, the number of responders may increase. The authors might estimate the practice effect based on the vertex simulation condition and reclassify participants after excluding the influence of the practice effect.

      Another issue is that considering the significant practice effect, the analysis in Figure 4D, which mixes pre- and post-test data, may not be reliable.

      We appreciate Reviewer 1’s thoughtful comments regarding the practice effect and its potential impact on our findings. Our previous analysis revealed a strong practice effect on reaction time (RT), with participants performing tasks faster in the POST session, regardless of task condition (Fig. S3). Given our hypothesis that inhibitory ATL cTBS would disrupt semantic task performance, we accounted for this by using inverse efficiency (IE), which combines accuracy and RT. This analysis demonstrated that ATL cTBS disrupted semantic task performance compared to both control stimulation (vertex) and control tasks, despite the practice effect (i.e., faster RT in the POST session), thereby supporting our hypothesis. These findings may suggest that the effects of ATL cTBS were more subtly reflected in semantic task accuracy rather than RT.

      Regarding inter-individual variability in response to rTMS/TBS, prior studies have shown that 50–70% of participants are non-responders, either do not respond or respond in an unexpected manner (Goldsworthy et al., 2014; Hamada et al., 2013; Hinder et al., 2014; Lopez-Alonso et al., 2014; Maeda et al., 2000a; Müller-Dahlhaus et al., 2008). Our previous study (Jung et al., 2022) using the same semantic task and cTBS protocol was the first to explore TBS-responsiveness variability in semantic memory, where 12 out of 20 participants (60%) were classified as responders. The proportion of responders and non-responders in the current study aligns with previous findings, suggesting that this variability is expected in TBS research.

      However, we acknowledge R1’s concern that the strong practice effect may have influenced responder classification. To address this, we estimated the practice effect using the vertex stimulation condition and reclassified participants accordingly by adjusting ATL stimulation performance (IE) relative to vertex stimulation performance (IE). This reclassification identified nine responders (an increase of two), aligning with the typical responder proportion (52%) reported in the TBS literature. Overall, we replicated the previous findings with improved statistical robustness.

      A 2×2×2 ANOVA was conducted with task (semantic vs. control) and session (PRE vs. POST) as within-subject factors, and group (responders vs. non-responders) as a between-subject factor. The analysis revealed a significant interaction between the session and group (F<sub>1, 15</sub> = 10.367, p = 0.006), a marginally significant interaction between the session and task (F<sub>1, 15</sub> = 4.370, p = 0.054), and a significant 3-way interaction between the session, task, and group (F<sub>1, 15</sub> = 7.580, p = 0.015). Post hoc t-tests showed a significant group difference in semantic task performance following ATL stimulation (t = 2.349, p = 0.033). Post hoc paired t-test demonstrated that responders exhibited poorer semantic task performance following the ATL cTBS (t = -5.281, p < 0.001), whereas non-responders showed a significant improvement (t = 3.206, p = 0.007) (see Figure. 3A).

      Notably, no differences were observed between responders and non-responders in the control task performance across pre- and post-stimulation sessions, confirming that the practice effect was successfully controlled (Figure. 3B).

      We performed a 2 x 2 ANOVA with session (pre vs. post) as a within subject factor and with group (responders vs. non-responders) as a between subject factor to examine the effects of group in ATL GABA levels. The results revealed a significant main effect of session (F<sub>1, 14</sub> = 39.906, p < 0.001) and group (F<sub>1, 14</sub> = 9.677, p = 0.008). Post hoc paired t-tests on ATL GABA levels showed a significant increase in regional ATL GABA levels following ATL stimulation for both responders (t = -3.885, p = 0.002) and non-responders (t = -4.831, p = 0.001). Furthermore, we replicated our previous finding that baseline GABA levels were significantly higher in responders compared to non-responders (t = 2.816, p = 0.007) (Figure. 3C). This pattern persisted in the post-stimulation session (t = 2.555, p = 0.011) (Figure. 3C).

      Accordingly, we have revised the Methods and Materials (p 26, line 619), Results (p11, line 233-261), and Figure 3.

      (3) The analysis in Figure 3A has a double dipping issue. Suppose we generate 100 pairs of random numbers as pre- and post-test scores, and then group the data based on whether the scores decrease or increase; the pre-test scores of the group with decreased scores will have a very high probability of being higher than those of the group with increased scores. Therefore, the findings in Figure 3A seem to be meaningless.

      Yes, we agreed with R1’s comments. However, Figure 3A illustrates interindividual responsiveness patterns, while Figure 3B demonstrates that these results account for practice effects, incorporating new analyses.

      (4) The authors use IE as a behavioral measure in some analyses and use accuracy in others. I recommend that the authors adopt a consistent behavioral measure.

      We appreciate Reviewer 1’s suggestion. In examining the relationship between ATL GABA and semantic task performance, we have found that only semantic accuracy—not reaction time (RT) or inverse efficiency (IE)—shows a significant positive correlation and regression with ATL GABA levels and semantic task-induced ATL activation, both in our previous study (Jung et al., 2017) and in the current study. ATL GABA levels were not correlated with semantic RT (Jung et al., 2017: r = 0.34, p = 0.14, current study: r = 0.26, p = 0.14). It should be noted that there were no significant correlations between ATL GABA levels and semantic inverse efficiency (IE) in both studies (Jung et al., 2017: r = 0.13, p = 0.62, current study: r = 0.22, p = 0.44). As a result, we found no significant linear and non-linear relationship between ATL GABA levels and RT (linear function R<sup>2</sup> = 0.21, p =0.45, quadratic function: R<sup>2</sup> = 0.17, p = 0.21) and between ATL GABA levels and IE (linear function R<sup>2</sup> = 0.24, p =0.07, quadratic function: R<sup>2</sup> = 2.24, p = 0.12).

      The absence of a meaningful relationship between ATL GABA and semantic RT or IE may be due to the following reasons: 1) RT is primarily associated with premotor and motor activation during semantic processing rather than ATL activation; 2) ATL GABA is likely to play a key role in refining distributed semantic representations through lateral inhibition, which sharpens the activated representation (Jung et al., 2017; Liu et al. 2011; Isaacson & Scanziani., 2011). This sharpening process may contribute to more accurate semantic performance (Jung et al., 2017). In our semantic task, for example, when encountering a camel (Fig. 1B), multiple semantic features (e.g., animal, brown, desert, sand, etc.) are activated. To correctly identify the most relevant concept (cactus), irrelevant associations (tree) must be suppressed—a process that likely relies on inhibitory mechanisms. Given this theoretical framework, we have used accuracy as the primary measure of semantic performance to elucidate the ATL GABA function.

      Reviewer #2 (Public review):

      Summary:

      The authors combined inhibitory neurostimulation (continuous theta-burst stimulation, cTBS) with subsequent MRI measurements to investigate the impact of inhibition of the left anterior temporal lobe (ATL) on task-related activity and performance during a semantic task and link stimulation-induced changes to the neurochemical level by including MR spectroscopy (MRS). cTBS effects in the ATL were compared with a control site in the vertex. The authors found that relative to stimulation of the vertex, cTBS significantly increased the local GABA concentration in the ATL. cTBS also decreased task-related semantic activity in the ATL and potentially delayed semantic task performance by hindering a practice effect from pre to post. Finally, pooled data with their previous MRS study suggest an inverted u-shape between GABA concentration and behavioral performance. These results help to better understand the neuromodulatory effects of non-invasive brain stimulation on task performance.

      Strengths:

      Multimodal assessment of neurostimulation effects on the behavioral, neurochemical, and neural levels. In particular, the link between GABA modulation and behavior is timely and potentially interesting.

      Weaknesses:

      The analyses are not sound. Some of the effects are very weak and not all conclusions are supported by the data since some of the comparisons are not justified. There is some redundancy with a previous paper by the same authors, so the novelty and contribution to the field are overall limited. A network approach might help here.

      Reviewer #3 (Public review):

      Summary:

      The authors used cTBS TMS, magnetic resonance spectroscopy (MRS), and functional magnetic resonance imaging (fMRI) as the main methods of investigation. Their data show that cTBS modulates GABA concentration and task-dependent BOLD in the ATL, whereby greater GABA increase following ATL cTBS showed greater reductions in BOLD changes in ATL. This effect was also reflected in the performance of the behavioural task response times, which did not subsume to practice effects after AL cTBS as opposed to the associated control site and control task. This is in line with their first hypothesis. The data further indicates that regional GABA concentrations in the ATL play a crucial role in semantic memory because individuals with higher (but not excessive) GABA concentrations in the ATLs performed better on the semantic task. This is in line with their second prediction. Finally, the authors conducted additional analyses to explore the mechanistic link between ATL inhibitory GABAergic action and semantic task performance. They show that this link is best captured by an inverted U-shaped function as a result of a quadratic linear regression model. Fitting this model to their data indicates that increasing GABA levels led to better task performance as long as they were not excessively low or excessively high. This was first tested as a relationship between GABA levels in the ATL and semantic task performance; then the same analyses were performed on the pre and post-cTBS TMS stimulation data, showing the same pattern. These results are in line with the conclusions of the authors.

      Comments on revisions:

      The authors have comprehensively addressed my comments from the first round of review, and I consider most of their answers and the steps they have taken satisfactorily. Their insights prompted me to reflect further on my own knowledge and thinking regarding the ATL function.

      I do, however, have an additional and hopefully constructive comment regarding the point made about the study focusing on the left instead of bilateral ATL. I appreciate the methodological complexities and the pragmatic reasons underlying this decision. Nevertheless, briefly incorporating the justification for this decision into the manuscript would have been beneficial for clarity and completeness. The presented argument follows an interesting logic; however, despite strong previous evidence supporting it, the approach remains based on an assumption. Given that the authors now provide the group-level fMRI results captured more comprehensively in Supplementary Figure 2, where the bilateral pattern of fMRI activation can be observed in the current data, the authors could have strengthened their argument by asserting that the activation related to the given semantic association task in this data was bilateral. This would imply that the TMS effects and associated changes in GABA should be similar for both sites. Furthermore, it is worth noting the approach taken by Pobric et al. (2007, PNAS), who stimulated a site located 10 mm posterior to the tip of the left temporal pole along the middle temporal gyrus (MTG) and not the bilateral ATL.

      We appreciate the reviewer’s constructive comment regarding the focus on the left ATL rather than bilateral ATL in our study. Accordingly, we have added the following paragraph in the Supplementary Information.

      “Justification of target site selection and cTBS effects

      Evidence suggests that bilateral ATL systems contribute to semantic representation (for a review, see Lambon Ralph., 2017). Consistent with this, our semantic task induced bilateral ATL activation (Fig. S2). Thus, stimulating both left and right ATL could provide a more comprehensive understanding of cTBS effects and its GABAergic function.

      Previous rTMS studies have applied inhibitory stimulation to the left vs. right ATL, demonstrating that stimulation at either site significantly disrupted semantic task performance (Pobric et al., 2007, PNAS; Pobric et al., 2010, Neuropsychologia; Lambon Ralph et al., 2009, Cerebral Cortex). Importantly, these studies reported no significant difference in rTMS effects between left and right ATL stimulation, suggesting that stimulating either hemisphere produces comparable effects on semantic processing. In the current study, we combined cTBS with multimodal imaging to investigate its effects on the ATL. Given our study design constraints (including the need for a control site, control task, and control stimulation) and limitations in scanning time, we selected the left ATL as the target region. This choice also aligned with the MRS voxel placement used in our previous study (Jung et al., 2017), allowing us to combine datasets and further investigate GABAergic function in the ATL. Accordingly, cTBS was applied to the peak coordinate of the left ventromedial ATL (MNI -36, -15, -30) as identified by previous fMRI studies (Binney et al., 2010; Visser et al., 2012).

      Given that TMS pulses typically penetrate 2–4 cm, we acknowledge the challenge of reaching deeper ventromedial ATL regions. However, our findings indicate that cTBS effectively modulated ATL function, as evidenced by reduced task-induced regional activity, increased ATL GABA concentrations, and poorer semantic performance, confirming that TMS pulses successfully influenced the target region. To further validate these effects, we conducted an ROI analysis centred on the ventromedial ATL (MNI -36, -15, -30), which revealed a significant reduction in ATL activity during semantic processing following ATL stimulation (t = -2.43, p = 0.014) (Fig. S7). This confirms that cTBS successfully modulated ATL activity at the intended target coordinate.”

      We appreciate R3's comment regarding the approach taken by Pobric et al. (2007, PNAS), who stimulated a site 10 mm posterior to the tip of the left temporal pole along the middle temporal gyrus (MTG). This approach has been explicitly discussed in our previous papers and reviews (e.g., Lambon Ralph, 2014, Proc. Royal Society B). Our earlier use of lateral ATL stimulation at this location (Pobric et al. 2007; Lambon Ralph et al. 2009; Pobric et al. 2010) was based on its alignment with the broader ATL region commonly atrophied in semantic dementia (cf. Binney et al., 2010 for a direct comparison of SD atrophy, fMRI data and the TMS region). Since these original ATL TMS investigations, a series of distortion-corrected or distortion-avoiding fMRI studies (e.g., Binney et al 2010; Visser et al, various, Hoffman et al., various; Jackson et al., 2015) have demonstrated graded activation differences across the ATL. While weaker activation is present at the original lateral ATL (MTG) stimulation site, the peak activation is maximal in the ventromedial ATL—a finding that was also observed in the current study. Accordingly, we selected the ventromedial ATL as our target site for stimulation.

      Following these points, we have revised the manuscript in the Methods and Materials.

      Transcranial magnetic stimulation p23, line 525-532,

      “Previous rTMS studies targeted a lateral ATL site 10 mm posterior to the temporal pole on the middle temporal gyrus (MTG) (Pobric et al. 2007; Lambon Ralph et al. 2009; Pobric et al. 2010), aligning with the broader ATL region typically atrophied in semantic dementia  (Binney et al. 2010). However, distortion-corrected fMRI studies (Binney et al. 2010; Visser et al. 2012) have revealed graded activation differences across the ATL, with peak activation in the ventromedial ATL. Based on these findings, we selected the target site in the left ATL (MNI -36, -15, -30) from a prior distortion-corrected fMRI study (Binney et al. 2010; Visser et al. 2012 that employed the same tasks as our study (for further details, see the Supplementary Information).”

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      The authors have responded to all my comments and I found most of the responses reasonable and sufficient. However, I have one remaining point: I pointed out before that the scope of this paper is somehow narrow and asked for a network analysis. I found the response to my question somehow puzzling since the authors write:

      "However, it is important to note that we did not find any significant correlations between ATL GABA changes and cTBS-induced changes in the functional connectivity. Consequently, we are currently preparing another paper that specifically addresses the network-level changes induced by ATL cTBS."

      I don't understand the logic here. Even in the absence of significant correlations between ATL GABA changes and cTBS-induced changes in connectivity, it would be interesting to know how baseline connectivity is correlated with the induced changes. I am not sure if it is adequate to squeeze another paper out of the dataset instead of reporting it here as suggested.

      We apologise that our previous response was not clear. To examine cTBS-induced network-level changes, we conducted ROI analyses targeting key semantic regions, including the bilateral ATL, inferior frontal gyrus (IFG), and posterior middle temporal gyrus (pMTG), as well as Psychophysiological Interactions (PPI) using the left ATL as a seed region. The ROI analysis revealed that ATL stimulation significantly decreased task-induced activity in the left ATL (target region) while increasing activity in the right ATL and left IFG. PPI analyses showed that ATL stimulation enhanced connectivity between the left ATL and the right ATL (both ventromedial and lateral ATL), bilateral IFG, and bilateral pMTG, suggesting that ATL stimulation modulates a bilateral semantic network.

      Building on these findings, we conducted Dynamic Causal Modeling (DCM) to estimate and infer interactions among predefined brain regions across different experimental conditions (Friston et al., 2003). The bilateral ventromedial ATL, lateral ATL, IFG, and pMTG were defined as network nodes with mutual connections. Our model examined cTBS effects at the left ATL under both baseline (intrinsic) and semantic task (modulatory) conditions, estimating 56 intrinsic parameters for baseline connectivity and testing 16 different modulatory models to assess cTBS-induced connectivity changes during semantic processing. Here, we briefly summarize the key DCM analysis results: 1) ATL cTBS significantly altered effective connectivity between the left and right lateral and ventromedial ATL in both intrinsic and modulatory conditions; 2) cTBS increased modulatory connectivity from the right to the left ATL compared to vertex stimulation.

      Given the complexity and depth of these findings, we believe that a dedicated paper focusing on the network-level effects of ATL cTBS is necessary to provide a more comprehensive and detailed analysis, which extends beyond the scope of the current study. It should be noted that no significant relationship was found between ATL GABA levels and ATL connectivity in both PPI and DCM analyses.

      Reviewer #3 (Recommendations for the authors):

      In response to my comment about the ATL activation being rather medial in the fMRI data and my concern about the TMS pulse perhaps not reaching this site, the authors offer an excellent solution to demonstrate TMS effects to such a medial ATL coordinate. I think that the analyses and figures they provide as a response to this comment and a brief explanation of this result should be incorporated into supplementary materials for methodologically oriented readers. Also, perhaps it would be beneficial to discuss that the effect of TMS on vATL remains a matter of further research to see not just if but also how TMS pulse reaches target coordinates, given the problematic anatomical location of the region.

      We appreciate R3’s suggestion. Please, see our reply above.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary:

      Cell metabolism exhibits a well-known behavior in fast-growing cells, which employ seemingly wasteful fermentation to generate energy even in the presence of sufficient environmental oxygen. This phenomenon is known as Overflow Metabolism or the Warburg effect in cancer. It is present in a wide range of organisms, from bacteria and fungi to mammalian cells.

      In this work, starting with a metabolic network for Escherichia coli based on sets of carbon sources, and using a corresponding coarse-grained model, the author applies some well-based approximations from the literature and algebraic manipulations. These are used to successfully explain the origins of Overflow Metabolism, both qualitatively and quantitatively, by comparing the results with E. coli experimental data.

      By modeling the proteome energy efficiencies for respiration and fermentation, the study shows that these parameters are dependent on the carbon source quality constants K_i (p.115 and 116). It is demonstrated that as the environment becomes richer, the optimal solution for proteome energy efficiency shifts from respiration to fermentation. This shift occurs at a critical parameter value K_A(C).

      This counter intuitive results qualitatively explains Overflow Metabolism.

      Quantitative agreement is achieved through the analysis of the heterogeneity of the metabolic status within a cell population. By introducing heterogeneity, the critical growth rate is assumed to follow a Gaussian distribution over the cell population, resulting in accordance with experimental data for E. coli. Overflow metabolism is explained by considering optimal protein allocation and cell heterogeneity.

      The obtained model is extensively tested through perturbations: 1) Introduction of overexpression of useless proteins; 2) Studying energy dissipation; 3) Analysis of the impact of translation inhibition with different sub-lethal doses of chloramphenicol on Escherichia coli; 4) Alteration of nutrient categories of carbon sources using pyruvate. All model perturbations results are corroborated by E. coli experimental results.

      Strengths:

      In this work, the author effectively uses modeling techniques typical of Physics to address complex problems in Biology, demonstrating the potential of interdisciplinary approaches to yield novel insights. The use of Escherichia coli as a model organism ensures that the assumptions and approximations are well-supported in existing literature. The model is convincingly constructed and aligns well with experimental data, lending credibility to the findings. In this version, the extension of results from bacteria to yeast and cancer is substantiated by a literature base, suggesting that these findings may have broad implications for understanding diverse biological systems.

      We appreciate the reviewer’s exceptionally positive comments. The manuscript has been significantly improved thanks to the reviewer’s insightful suggestions.

      Weaknesses:

      The author explores the generalization of their results from bacteria to cancer cells and yeast, adapting the metabolic network and coarse-grained model accordingly. In previous version this generalization was not completely supported by references and data from the literature. This drawback, however, has been treated in this current version, where the authors discuss in much more detail and give references supporting this generalization.

      We appreciate the reviewer’s recognition of our revisions and the insightful suggestions provided in the previous round, which have greatly strengthened our manuscript.

      Reviewer #2 (Public Review):

      In this version of manuscript, the author clarified many details and rewrote some sections. This substantially improved the readability of the paper. I also recognized that the author spent substantial efforts in the Appendix to answer the potential questions.

      We thank the reviewer for the positive comments and the suggestions to improve our manuscript.

      Unfortunately, I am not currently convinced by the theory proposed in this paper. In the next section, I will first recap the logic of the author and explain why I am not convinced. Although the theory fits many experimental results, other theories on overflow metabolism are also supported by experiments. Hence, I do not think based on experimental data we could rule in or rule out different theories.

      We thank the reviewer for both the critical and constructive comments. 

      Regarding the comments on the comparison between theoretical and experimental results, we would like to first emphasize that no prior theory has resolved the conflict arising from the proteome efficiencies measured in E. coli and eukaryotic cells. Specifically, prevalent explanations (Basan et al., Nature 528, 99–104 (2015); Chen and Nielsen, PNAS 116, 17592–17597 (2019)) hold that overflow metabolism results from proteome efficiency in fermentation consistently being higher than that in respiration. While it was observed in E. coli that proteome efficiency in fermentation exceeds that in respiration when cells were cultured in lactose at saturated concentrations (Basan et al., Nature 528, 99-104 (2015)), more recent findings (Shen et al., Nature Chemical Biology 20, 1123–1132 (2024)) show that the measured proteome efficiency in respiration is actually higher than in fermentation for many yeast and cancer cells, despite the presence of aerobic glycolytic fermentation flux. To the best of our knowledge, no prior theory has explained these contradictory experimental results. Notably, our theory resolves this conflict and quantitatively explains both sets of experimental observations (Basan et al., Nature 528, 99-104 (2015); Shen et al., Nature Chemical Biology 20, 1123–1132 (2024)) by incorporating cell heterogeneity and optimizing cell growth rate through protein allocation. 

      Furthermore, rather than merely fitting the experimental results, as explained in Appendices 6.2, 8.1-8.2 and summarized in Appendix-tables 1-3, nearly all model parameters important for our theoretical predictions for E. coli were derived from in vivo and in vitro biochemical data reported in the experimental literature. For comparisons between model predictions and experimental results for yeast and cancer cells (Shen et al., Nature Chemical Biology 20, 1123–1132 (2024)), we intentionally derived Eq. 6 to ensure an unbiased comparison.

      Finally, in response to the reviewer’s suggestion, we have revised the expressions in our manuscript to present the differences between our theory and previous theories in a more modest style. 

      Recap: To explain the origin of overflow metabolism, the author uses the following logic:

      (1) There is a substantial variability of single-cell growth rate

      (2) The flux (J_r^E) and (J_f^E) are coupled with growth rate by Eq. 3

      (3) Since growth rate varies from cells to cells, flux (J_r^E) and (J_f^E) also varies (4) The variabilities of above fluxes in above create threshold-analog relation, and hence overflow metabolism.

      We thank the reviewer for the clear summary. We apologize for not explaining some points clearly enough in the previous version of our manuscript, which may have led to misunderstandings. We have now revised the relevant content in the manuscript to clarify our reasoning. Specifically, we have applied the following logic in our explanation:

      (a) The solution for the optimal growth strategy of a cell under a given nutrient condition is a binary choice between respiration and fermentation, driven by comparing their proteome efficiencies (ε<sub>r</sub> and ε<sub>f</sub> ).

      (b) Under nutrient-poor conditions, the nutrient quality (κ<sub>A</sub>) is low, resulting in the proteome efficiency of respiration being higher than that of fermentation (i.e., ε<sub>r</sub> > ε<sub>f</sub>), so the cell exclusively uses respiration.  

      (c) In rich media (with high κ<sub>A</sub>), the proteome efficiency of fermentation increases more rapidly and surpasses that of respiration (i.e., ε<sub>f</sub> > ε<sub>r</sub> ), hence the cell switches to fermentation.  

      (d) Heterogeneity is introduced: variability in the κ<sub>cat</sub> of catalytic enzymes from cell to cell. This leads to heterogeneity (variability) in ε<sub>r</sub> and ε<sub>f</sub> within a population of cells under the same nutrient condition.  

      (e) The critical value of nutrient quality for the switching point (, where ε<sub>r</sub>= ε<sub>f</sub> ) changes from a single point to a distribution due to cell heterogeneity. This results in a distribution of the critical growth rate λ<sub>C</sub> (defined as ) within the cell population.

      (f) The change in culturing conditions (with a highly diverse range of κ<sub>A</sub>) and heterogeneity in the critical growth rate λ<sub>C</sub> (a distribution of values) result in the threshold-analog relation of overflow metabolism at the cell population level.

      Steps (a)-(c) were applied to qualitatively explain the origin of overflow metabolism, while steps (d)-(f) were further used to quantitatively explain the threshold-analog relation observed in the data on overflow metabolism.

      Regarding the reviewer’s recap, which seems to have involved some misunderstandings, we first emphasize that the major change in cell growth rate for the threshold-analog relation of overflow metabolism—particularly as it pertains to logic steps (1), (3) and (4)—is driven by the highly varied range of nutrient quality (κ<sub>A</sub>) in the culturing conditions, rather than by heterogeneity between cells. For the batch culture data, the nutrient type of the carbon source differs significantly (e.g., Fig.1 in Basan et al., Nature 528, 99-104 (2015), wild-type strains). In contrast, for the chemostat data, the concentration of the carbon source varies greatly due to the highly varied dilution rate (e.g., Table 7 in Holms, FEMS Microbiology Reviews 19, 85-116 (1996)). Both of these factors related to nutrient conditions are the major causes of the changes in cell growth rate in the threshold-analog relation. 

      Second, Eq. 3, as mentioned in logic step (2), represents a constraint between the fluxes ( and ) and the growth rate (λ) for a single nutrient condition (with a given value of κ<sub>A</sub> ideally) rather than for varied nutrient conditions. For a single cell in each nutrient condition, the optimal growth strategy is binary, between respiration and fermentation. 

      Finally, for the threshold-analog relation of overflow metabolism, the switch from respiration to fermentation is caused by the increased nutrient quality in the culturing conditions, rather than by cell heterogeneity as indicated in logic step (4). Upon nutrient upshifts, the proteome efficiency of fermentation surpasses that of respiration, causing the optimal growth strategy for the cell to switch from respiration to fermentation. The role of cell heterogeneity is to transform the growth rate-dependent fermentation flux in overflow metabolism from a digital response to a threshold-analog relation under varying nutrient conditions.

      My opinion:

      The logic step (2) and (3) have caveats. The variability of growth rate has large components of cellular noise and external noise. Therefore, variability of growth rate is far from 100% correlated with variability of flux (J_r^E) and (J_f^E) at the single-cell level. Single-cell growth rate is a complex, multivariate functional, including (Jr^E) and (J_f^E) but also many other variables. My feeling is the correlation could be too low to support the logic here.

      One example: ribosomal concentration is known to be an important factor of growth rate in bulk culture. However, the "growth law" from bulk culture cannot directly translate into the growth law at single-cell level [Ref1,2]. This is likely due to other factors (such as cell aging, other muti-stability of cellular states) are involved.

      Therefore, I think using Eq.3 to invert the distribution of growth rate into the distribution of (Jr^E) and (J_f^E) is inapplicable, due to the potentially low correlation at single-cell level. It may show partial correlations, but may not be strong enough to support the claim and create fermentation at macroscopic scale.

      Overall, if we track the logic flow, this theory implies overflow metabolism is originated from variability of k_cat of catalytic enzymes from cells to cells. That is, the author proposed that overflow metabolism happens macroscopically as if it is some "aberrant activation of fermentation pathway" at the single-cell level, due to some unknown partially correlation from growth rate variability.

      We thank the reviewer for raising these questions and for the insights. We apologize for any lack of clarity in the previous version of our manuscript that may have caused misunderstandings. We have revised the manuscript to address all points, and below are our responses to the questions, some of which seem to involve misunderstandings. 

      First, in our theory, the qualitative behavior of overflow metabolism—where cells use respiration under nutrient-poor conditions (low growth rate) and fermentation in rich media (high growth rate)—does not arise from variability between cells, as the reviewer seems to have interpreted. Instead, it originates from growth optimization through optimal protein allocation under significantly different nutrient conditions. Specifically, the proteome efficiency of fermentation is lower than that of respiration (i.e. ε<sub>f</sub> < ε<sub>r</sub>) under nutrient-poor conditions, making respiration the optimal strategy in this case. However, in rich media, the proteome efficiency of fermentation surpasses that of respiration (i.e. ε<sub>f</sub> < ε<sub>r</sub>), leading the cell to switch to fermentation for growth optimization. To implement the optimal strategy, as clarified in the revised manuscript and discussed in Appendix 2.4, a cell should sense and compare the proteome efficiencies between respiration and fermentation, choosing the pathway with the higher efficiency, rather than sensing the growth rate, which can fluctuate due to stochasticity. Regarding the role of cell heterogeneity in overflow metabolism, as discussed in our previous response, it is twofold: first, it quantitatively illustrates the threshold-analog response of growth rate-dependent fermentation flux, which would otherwise be a digital response without heterogeneity during growth optimization; second, it enables us to resolve the paradox in proteome efficiencies observed in E. coli and eukaryotic cells, as raised by Shen et al. (Shen et al., Nature Chemical Biology 20, 1123–1132 (2024)). 

      Second, regarding logic step (2) in the recap, the reviewer thought we had coupled the growth rate (λ) with the respiration and fermentation fluxes ( and ) through Eq. 3, and used Eq. 3 to invert the distribution of growth rate into the distribution of respiration and fermentation fluxes. We need to clarify that Eq. 3 represents the constraint between the fluxes and the growth rate under a single nutrient condition, rather than describing the relation between growth rate and the fluxes ( and ) under varied nutrient conditions. In a given nutrient condition (with a fixed value of κ<sub>A</sub>), without considering optimal protein allocation, the cell growth rate varies with the fluxes according to Eq.3 by adjusting the proteome allocation between respiration and fermentation (ϕ<sub>r</sub> and ϕ<sub>f</sub>). However, once growth optimization is applied, the optimal protein allocation strategy for a cell is limited to either pure respiration (with ϕ<sub>f</sub> =0 and ) or pure fermentation (with ϕ<sub>r</sub> =0 and ), depending on the nutrient condition (or the value of κ<sub>A</sub>). Furthermore, under varying nutrient conditions (with different values of κ<sub>A</sub>), both proteome efficiencies of respiration and fermentation (ε<sub>r</sub> and (ε<sub>f</sub>) change with nutrient quality κ<sub>A</sub> (see Eq. 4). Thus, Eq. 3 does not describe the relation between growth rate (λ) and the fluxes ( and ) under nutrient variations.

      Thirdly, regarding reviewer’s concerns on logic step (3) in the recap, as well as the example where ribosome concentration does not correlate well with cell growth rate at the single-cell level, we fully agree with reviewer that, due to factors such as stochasticity and cell cycle status, the growth rate fluctuates constantly for each cell. Consequently, it would not be fully correlated with cell parameters such as ribosome concentration or respiration/fermentation flux. We apologize for our oversight in not discussing suboptimal growth conditions in the previous version of the manuscript. In response, we have added a paragraph to the discussion section and a new Appendix 2.4, titled “Dependence of the model on optimization principles,” to address these issues in detail. Specifically, recent experimental studies (Dai et al., Nature microbiology 2, 16231 (2017); Li et al., Nature microbiology 3, 939–947 (2018)) show that the inactive portion of ribosomes (i.e., ribosomes not bound to mRNAs) can vary under different culturing conditions. The reviewer also pointed out that ribosome concentration does not correlate well with cell growth rate at single-cell level. In this regard, we have cited Pavlou et al. (Pavlou et al., Nature Communications 16, 285 (2025)) instead of the references provided by the reviewer (Ref1 and Ref2), with our rationale outlined in the final section of the author response. These findings (Dai et al, (2017); Li et al., (2018); Pavlou et al., (2025)) suggest that ribosome allocation may be suboptimal under many culturing conditions, likely as cells prepare for potential environmental changes (Li et al., Nature microbiology 3, 939–947 (2018)). However, since our model's predictions regarding the binary choice between respiration and fermentation are based solely on comparing proteome efficiency between these two pathways, the optimal growth principle in our model can be relaxed. Specifically, efficient protein allocation is required only for enzymes rather than ribosomes, allowing our model to remain applicable under suboptimal growth conditions. Furthermore, protein allocation via the ribosome occurs at the single-cell level rather than at the population level. The strong linear correlation between ribosomal concentration and growth rate at the population level under nutrient variations suggests that each cell optimizes its protein allocation individually. Therefore, the principle of growth optimization still applies to individual cells, although factors like stochasticity, nutrient variation preparations, and differences in cell cycle stages may complicate this relationship, resulting in only a rough linear correlation between ribosome concentration and growth rate at the single-cell level (with with R<sup>2</sup> = 0.64 reported in Pavlou et al., (2025)). 

      Lastly, regarding the reviewer concerns about the heterogeneity of fermentation and respiration at macroscopic scale, we first clarify in the second paragraph of this response that the primary driving force for cells to switch from respiration to fermentation in the context of overflow metabolism is the increased nutrient quality under varying culturing conditions, which causes the proteome efficiency of fermentation to surpass that of respiration. Under nutrient-poor conditions, our model predicts that all cells use respiration, and therefore no heterogeneity for the phenotype of respiration and fermentation arises in these conditions. However, in a richer medium, particularly one that does not provide optimal conditions but allows for an intermediate growth rate, our model predicts that some cells opt for fermentation while others continue with respiration due to cell heterogeneity (with ε<sub>f</sub> > ε<sub>r</sub> for some cells engaging in fermentation and ε<sub>r</sub> > ε<sub>f</sub> for the other cells engaging in respiration within the same medium). Both of these predictions have been validated in isogenic singlecell experiments with E. coli (Nikolic et al., BMC Microbiology 13, 258 (2013)) and S. cerevisiae (Bagamery et al., Current Biology 30, 4563–4578 (2020)). The single-cell experiments by Nikolic et al. with E. coli in a rich medium of intermediate growth rate clearly show a bimodal distribution in the expression of genes related to overflow metabolism (see Fig. 5 in Nikolic et al., BMC Microbiology 13, 258 (2013)), where one subpopulation suggests purely fermentation, while the other suggests purely respiration. In contrast, in a medium with lower nutrient concentration (and consequently lower nutrient quality), only the respirative population exists (see Fig. 5 in Nikolic et al., BMC Microbiology 13, 258 (2013)). These experimental results from E. coli (Nikolic et al., BMC Microbiology 13, 258 (2013)) are fully consistent with our model predictions. Similarly, the single-cell experiments with S. cerevisiae by Bagamery et al. clearly identified two subpopulations of cells with respect to fermentation and respiration in a rich medium, which also align well with our model predictions regarding heterogeneity in fermentation and respiration within a cell population in the same medium.

      Compared with other theories, this theory does not involve any regulatory mechanism and can be regarded as a "neutral theory". I am looking forward to seeing single cell experiments in the future to provide evidences about this theory.

      We thank the reviewer for raising these questions and for the valuable insights. Regarding the regulatory mechanism, we have now added a paragraph in the discussion section of our manuscript and Appendix 2.4 to address this point. Specifically, our model predicts that a cell can implement the optimal strategy by directly sensing and comparing the proteome efficiencies of respiration and fermentation, choosing the pathway with the higher efficiency. At the gene regulatory level, a growing body of evidence suggests that the cAMP-CRP system plays an important role in sensing and executing the optimal strategy between respiration and fermentation (Basan et al., Nature 528, 99-104 (2015); Towbin et al., Nature Communications 8, 14123 (2017); Valgepea et al., BMC Systems Biology 4, 166 (2010); Wehrens et al., Cell Reports 42, 113284 (2023)). However, it has also been suggested that the cAMP-CRP system alone is insufficient, and additional regulators may need to be identified to fully elucidate this mechanism (Basan et al., Nature 528, 99-104 (2015); Valgepea et al., BMC Systems Biology 4, 166 (2010)). 

      Regarding the single-cell experiments that provide evidence for this theory, we have shown in the previous paragraphs of this response that the heterogeneity between respiration and fermentation, as predicted by our model for isogenic cells within the same culturing condition, has been fully validated by single-cell experiments with E. coli (Fig. 5 from Nikolic et al., BMC Microbiology 13, 258 (2013)) and S. cerevisiae (Fig. 1 and the graphical abstract from Bagamery et al., Current Biology 30, 4563–4578 (2020)). We have now revised the discussion section of our manuscript to make this point clearer.

      [Ref1] https://www.biorxiv.org/content/10.1101/2024.04.19.590370v2

      [Ref2] https://www.biorxiv.org/content/10.1101/2024.10.08.617237v2

      We thank the reviewer for providing insightful references. Regarding the two specific references, Ref1 directly addresses the deviation in the linear relationship between growth rate and ribosome concentration (“growth law”) at the single-cell level. However, since the authors of Ref1 determined the rRNA abundance in each cell by aligning sequencing reads to the genome, this method inevitably introduces a substantial amount of measurement noise. As a result, we chose not to cite or discuss this preprint in our manuscript. Ref2 appears to pertain to a different topic, which we suspect may be a copy/paste error. Based on the reviewer’s description and the references in Ref1, we believe the correct Ref2 should be Pavlou et al., Nature Communications 16, 285 (2025) (with the biorxiv preprint link: https://www.biorxiv.org/content/10.1101/2024.04.26.591328v1). In this reference, it is stated that the relationship between ribosome concentration and growth rate only roughly aligns with the “growth law” at the single-cell level (with R<sup>2</sup> = 0.64), exhibiting a certain degree of deviation. We have now cited and incorporated the findings of Pavlou et al. (Pavlou et al., Nature Communications 16, 285 (2025)) in both the discussion section of our manuscript and Appendix 2.4. Overall, we agree with Pavlou et al.’s experimental results, which suggest that ribosome concentration does not exhibit a strong linear correlation with cell growth rate at the single-cell level. However, we remain somewhat uncertain about the extent of this deviation, as Pavlou et al.’s experimental setup involved alternating nutrients between acetate and glucose, and the lapse of five generations may not have been long enough for the growth to be considered balanced. Furthermore, as observed in Supplementary Movie 1 of Pavlou et al., some of the experimental cells appeared to experience growth limitations due to squeezing pressure from the pipe wall of the mother machine, which could further increase the deviation from the “growth law” at the single-cell level.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I have no specific comments for the authors related to this last version of the paper. I believe the authors have properly improved the previous version of the manuscript.

      Response: We thank the reviewer for the highly positive comments and for recognizing the improvements made in the revised version of our manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for their thorough re-evaluation of our revised manuscript. Addressing final issues they raised has improved the manuscript further. We sincerely appreciate the detailed explanations that the reviewers provided in the "recommendations for authors" section. This comprehensive feedback helped us identify the sources of ambiguity within the analysis descriptions and in the discussion where we interpreted the results. Below, you will find our responses to the specific comments and recommendations.

      Reviewer #1 (Recommendations):

      (1) I find that the manuscript has improved significantly from the last version, especially in terms of making explicit the assumptions of this work and competing models. I think the response letter makes a good case that the existence of other research makes it more likely that oscillators are at play in the study at hand (though the authors might consider incorporating this argumentation a bit more into the paper too). Furthermore, the authors' response that the harmonic analysis is valid even when including x=y because standard correlation analysis were not significant is a helpful response. The key issue that remains for me is that I have confusions about the additional analyses prompted by my review to a point where I find it hard to evaluate how and whether they demonstrate entrainment or not. 

      First, I don't fully understand Figure 2B and how it confirms the Arnold tongue slice prediction. In the response letter the authors write: "...indicating that accuracy increased towards the preferred rate at fast rates and decreased as the stimulus rate diverged from the preferred rate at slow rates". The figure shows that, but also more. The green line (IOI < preferred rate) indeed increases toward the preferred rate (which is IOI = 0 on the x-axis; as I get it), but then it continues to go up in accuracy even after the preferred rate. And for the blue line, performance also continues to go up beyond preferred rate. Wouldn't the Arnold tongue and thus entrainment prediction be that accuracy goes down again after the preferred rate has passed? That is to say, shouldn't the pattern look like this (https://cdn.elifesciences.org/public-review-media/90735/v3/GPlt38F.png) which with linear regression should turn to a line with a slope of 0?

      This was my confusion at first, but then I thought longer about how e.g. the blue line is predicted only using trials with IOI larger than the preferred rate. If that is so, then shouldn't the plot look like this? (https://cdn.elifesciences.org/public-review-media/90735/v3/SmU6X73.png). But if those are the only data and the rest of the regression line is extrapolation, why does the regression error vary in the extrapolated region? It would be helpful if the authors could clarify this plot a bit better. Ideally, they might want to include the average datapoints so it becomes easier to understand what is being fitted. As a side note, colours blue/green have a different meaning in 2B than 2D and E, which might be confusing. 

      We thank the reviewer for their recommendation to clarify the additional analyses we ran in the previous revision to assess whether accuracy systematically increased toward the preferred rate estimate. We realized that the description of the regression analysis led to misunderstandings. In particular, we think that the reviewer interpreted (1) our analysis as linear regression (based on the request to plot raw data rather than fits), whereas, in fact, we used logistic regression, and (2) the regression lines in Figure 2B as raw IOI values, while, in fact, they were the z-scored IOI values (from trials where stimulus IOI were faster than an individual’s preferred rate, IOI < preferred rate, in green; and from trials stimulus IOI were slower than an individual’s preferred rate, IOI > preferred rate, in blue), as the x axis label depicted. We are happy to have the opportunity to clarify these points in the manuscript. We have also revised Figure 2B, which was admittedly maybe a bit opaque, to more clearly show the “Arnold tongue slice”.  

      The logic for using (1) logistic regression with (2) Z-scored IOI values as the predictor is as follows. Since the response variable in this analysis, accuracy, was binary (correct response = 1, incorrect response = 0), we used a logistic regression. The goal was to quantify an acrosssubjects effect (increase in accuracy toward preferred rate), so we aggregated datasets across all participants into the model. The crucial point here is that each participant had a different preferred rate estimate. Let’s say participant A had the estimate at IOI = 400 ms, and participant B had an estimate at IOI = 600 ms. The trials where IOI was faster than participant A’s estimate would then be those ranging from 200 ms to 398 ms, and those that were slower would range from 402 ms to 998 ms. For Participant B, the situation would be different:  trials where IOI was faster than their estimate would range from 200 ms to 598 ms, and slower trials would range between 602 ms to 998 ms. For a fair analysis that assesses the accuracy increase, regardless of a participant’s actual preferred rate, we normalized these IOI values (faster or slower than the preferred rate). Zscore normalization is a common method of normalizing predictors in regression models, and was especially important here since we were aggregating predictors across participants, and the predictors ranges varied across participants. Z-scoring ensured that the scale of the sample (that differs between participant A and B, in this example) was comparable across the datasets. This is also important for the interpretation of Figure 2B. Since Z-scoring involves mean subtraction, the zero point on the Z-scaled IOI axis corresponds to the mean of the sample prior to normalization (for Participant A: 299 ms, for Participant B: 399 ms) and not the preferred rate estimate. We have now revised Figure 2B in a way that we think makes this much clearer.  

      The manuscript text includes clarification that the analyses included logistic regression and stimulus IOI was z-scored: 

      “In addition to estimating the preferred rate as stimulus rates with peak performance, we investigated whether accuracy increased as a function of detuning, namely, the difference between stimulus rate and preferred rate, as predicted by the entrainment models (Large, 1994; McAuley, 1995; Jones, 2018). We tested this prediction by assessing the slopes of mixed-effects logistic regression models, where accuracy was regressed on the IOI condition, separately for stimulus rates that were faster or slower than an individual’s preferred rate estimate. To do so, we first z-scored IOIs that were faster and slower than the participant’s preferred rate estimates, separately to render IOI scales comparable across participants.” (p. 7)

      While thinking through the reviewer’s comment, we realized we could improve this analysis by fitting mixed effects models separately to sessions’ data. In these models, fixed effects were z-scored IOI and ‘detuning direction’ (i.e., whether IOI was faster or slower than the participant’s preferred rate estimate). To control for variability across participants in the predicted interaction between z-scored IOI and direction, this interaction was added as a random effect. 

      “Ideally, they might want to include the average datapoints so it becomes easier to understand what is being fitted.”

      Although we agree with the reviewer that including average datapoints in a figure in addition to model predictions usually better illustrates what is being fitted than the fits alone, this doesn’t work super well for logistic regression, since the dependent variable is binary. To try to do a better job illustrating single-participant data though, we instead  fitted logistic models to each participant’s single session datasets, separately to conditions where z-scored IOI from fasterthan-preferred rate trials, and those from slower-than-preferred rate trials, predicted accuracy. From these single-participant models, we obtained slope values, we referred to as ‘relative detuning slope’, for each condition and session type. This analysis allowed us to illustrate the effect of relative detuning on accuracy for each participant. Figure 2B now shows each participant’s best-fit lines from each detuning direction condition and session.

      Since we now had relative detuning slopes for each individual (which we did not before), we took advantage of this to assess the relationship between oscillator flexibility and the oscillator’s behavior in different detuning situations (how strongly leaving the preferred rate hurt accuracy, as a proxy for the width of the Arnold tongue slice). Theoretically, flexible oscillators should be able to synchronize to wide range of rates, not suffering in conditions where detuning is large (Pikovsky et al., 2003). Conversely, synchronization of inflexible oscillators should depend strongly on detuning. To test whether our flexibility measure predicted this dependence on detuning, which is a different angle on oscillator flexibility, we first averaged each participant’s detuning slopes across detuning directions (after sign-flipping one of them). Then, we assessed the correlation between the average detuning slopes and flexibility estimates, separately from conditions where |-𝚫IOI| or |+𝚫IOI| predicted accuracy. The results revealed significant negative correlations (Fig. 2F), suggesting that performance of individuals with less flexible oscillators suffered more as detuning increased. Note that flexibility estimates quantified how much accuracy decreased as a function of trial-to-trial changes in stimulus rate (±𝚫IOI). Thus, these results show that oscillators that were robust to changes in stimulus rate were also less dependent on detuning to be able to synchronize across a wide range of stimulus rates. We are excited to be able to provide this extra validation of predictions made by entrainment models. 

      To revise the manuscript with the updated analysis on detuning:

      • We added the descriptions of the analyses to the Experiment 1 Methods section.

      Calculation of detuning slopes and their averaging procedure are in Preferred rate estimates:

      “In addition to estimating the preferred rate as stimulus rates with peak performance, we investigated whether accuracy increased as a function of detuning, namely, the difference between stimulus rate and preferred rate, as predicted by the entrainment models (Large, 1994; McAuley, 1995; Jones, 2018). We tested this prediction by assessing the slopes of mixed-effects logistic regression models, where accuracy was regressed on the IOI condition, separately for stimulus rates that were faster or slower than an individual’s preferred rate estimate. To do so, we first z-scored IOIs that were faster and slower than the participant’s preferred rate estimates, separately to render IOI scales comparable across participants. The detuning direction (i.e., whether stimulus IOI was faster or slower than the preferred rate estimate) was coded categorically. Accuracy (binary) was predicted by these variables (zscored IOI, detuning direction), and their interaction. The model was fitted separately to datasets from random-order and linear-order sessions, using the fitglme function in MATLAB. Fixed effects were z-scored IOI and detuning direction and random effect was their interaction. We expected a systematic increase in performance toward the preferred rate, which would result in a significant interaction between stimulus rate and detuning direction. To decompose the significant interaction and to visualize the effects of detuning, we fitted separate models to each participant’s single-session datasets, and obtained slopes from each direction condition, hereafter denoted as the ‘relative-detuning slope’. We treated relative-detuning slope as an index of the magnitude of relative detuning effects on accuracy. We then evaluated these models, using the glmval function in MATLAB to obtain predicted accuracy values for each participant and session. To visualize the relative-detuning curves, we averaged the predicted accuracies across participants within each session, separately for each direction condition (faster or slower than the preferred rate). To obtain a single value of relative-detuning magnitude for each participant, we averaged relative detuning slopes across direction conditions. However, since slopes from IOI > preferred rate conditions quantified an accuracy decrease as a function of detuning, we sign-flipped these slopes before averaging. The resulting average relative detuning slopes, obtained from each participant’s single-session datasets, quantified how much the accuracy increase towards preferred rate was dependent on, in other words, sensitive to, relative detuning.” (p. 7-8)

      • We added the information on the correlation analyses between average detuning slopes in Flexibility estimates.

      “We further tested the relationship between the flexibility estimates (𝛽 from models where |𝚫IOI| or |+𝚫IOI| predicted accuracy) and average detuning slopes (see Preferred rate estimates) from random-order sessions. We predicted that flexible oscillators (larger 𝛽) would be less severely affected by detuning, and thus have smaller detuning slopes. Conversely, inflexible oscillators (smaller 𝛽) should have more difficulty in adapting to a large range of stimulus rates, and their adaptive abilities should be constrained around the preferred rate, as indexed by steeper relative detuning slopes.” (p. 8)

      • We provided the results in Experiment 1 Results section.

      “Logistic models assessing a systematic increase in accuracy toward the preferred rate estimate in each session type revealed significant main effects of IOI (linear-order session: 𝛽 = 0.264, p < .001; random-order session: 𝛽 = 0.175, p < .001), and significant interactions between IOI and direction (linear-order session: 𝛽 = -0.444, p < .001; random-order session: 𝛽 = -0.364, p < .001), indicating that accuracy increased as fast rates slowed toward the preferred rate (positive slopes) and decreased again as slow rates slowed further past the preferred rate (negative slopes), regardless of the session type. Fig. 2B illustrates the preferred rate estimation method for an example participant’s dataset and shows the predicted accuracy values from models fitted to each participant’s single-session datasets. Note that the main effect and interaction were obtained from mixed effects models that included aggregated datasets from all participants, whereas the slopes quantifying the accuracy increase as a function of detuning (i.e., relative detuning slopes) were from models fitted to single-participant datasets.” (p. 9-10)

      “We tested the relationship between the flexibility estimates and single-participant relative detuning slopes from random-order sessions (Fig. 2B). The results revealed negative correlations between the relative detuning slopes and flexibility estimates, both with 𝛽 (r(23) =0.529, p = 0.007) from models where |-𝚫IOI| predicted accuracy (adapting to speeding-up trials), and 𝛽 (r(23) =-0.580, p = 0.002) from models where |+𝚫IOI| predicted accuracy (adapting to slowing-down trials). That is, the performance of individuals with less flexible oscillators suffered more as detuning increased. These results are shown in Fig. 2F.” (p. 10)

      • We modified Figure 2. In Figure 2B, there are now separate subfigures with the z-scored IOI faster (left) or slower (right) than the preferred rate predicting accuracy. We illustrated the correlations between average relative detuning slopes and flexibility estimates in Figure 2F. 

      Author response image 1.

      Main findings of Experiment 1. A Left: Each circle represents a single participant’s preferred rate estimate from the random-order session (x axis) and linear-order session (y axis). The histograms along the top and right of the plot show the distributions of estimates for each session type. The dotted and dashed lines respectively represent 1:2 and 2:1 ratio between the axes, and the solid line represents one-to-one correspondence. Right: permutation test results. The distribution of summed residuals (distance of data points to the closest y=x, y=2*x and y=x/2 lines) of shuffled data over 1000 iterations, and the summed residual from original data (dashed line) that fell below .008 of the permutation distribution. B Top: Illustration of the preferred rate estimation method from an example participant’s linear-order session dataset. Estimates were the stimulus rates (IOI) where smoothed accuracy (orange line) was maximum (arrow). The dotted lines originating from the IOI axis delineate the stimulus rates that were faster (left, IOI < preferred rate) and slower (right, IOI > preferred rate) than the preferred rate estimate and expand those separate axes, the values of which were Z-scored for the relative-detuning analysis. Bottom: Predicted accuracy, calculated from single-participant models where accuracy in random-order (purple) and linear-order (orange) sessions was predicted by z-scored IOIs that were faster than a participant’s preferred rate estimate (left), and by those that were slower (right). Thin lines show predicted accuracy from single-participant models, solid lines show the averages across participants and the shaded areas represent standard error of the mean. Predicted accuracy is maximal at the preferred rate and decreases as a function of detuning. C Average accuracy from random-order (left, purple) and linear-order (right, orange) sessions. Each circle represents a participant’s average accuracy. D Flexibility estimates. Each circle represents an individuals’ slope (𝛽) obtained from logistic models, fitted separately to conditions where |𝚫IOI| (left, green) or |+𝚫IOI| (right blue) predicted accuracy, with greater values (arrow’s direction) indicating better oscillator flexibility. The means of the distributions of 𝛽 from both conditions were smaller than zero (dashed line), indicating a negative effect of between-trial absolute rate change on accuracy. E Participants’ average bias from |𝚫IOI| (green), and |+𝚫IOI| (blue) conditions in random-order (left) and linear-order (right) sessions. Negative bias indicates underestimation of the comparison intervals, positive bias indicates the opposite. Box plots in C-E show median (black vertical line), 25th and 75th percentiles (box edges) and extreme datapoints (whiskers). In C and E, empty circles show outlier values that remained after data cleaning procedures. F Correlations between participants’ average relative detuning slopes, indexing the steepness of the increase in accuracy towards the preferred rate estimate (from panel B), and flexibility estimates from |-𝚫IOI| (top, green), and |+𝚫IOI| (bottom, blue) conditions (from panel C). Solid black lines represent the best-fit line, dashed lines represent 95% confidence intervals.

      • We discussed the results in General Discussion and emphasized that only entrainment models, compared to timekeeper models, predict a relationship between detuning and accuracy that is amplified by oscillator’s inflexibility: “we observed systematic increases in task accuracy (Experiment 1) toward the best-performance rates (i.e., preferred rate estimates), with the steepness of this increase being closely related to the effects of rate change (i.e., oscillator flexibility). Two interdependent properties of an underlying system together modulating an individual’s timing responses show strong support for the entrainment approach” (p. 24)

      “As a side note, colours blue/green have a different meaning in 2B than 2D and E, which might be confusing.” 

      Upon the reviewer’s recommendation, we changed the color scale across Figure 2, such that colors refer to the same set of conditions across all panels. 

      (2) Second, I don't understand the additional harmonic relationship analyses in the appendix, and I suspect other readers will not either. As with the previous point, it is not my view that the analyses are faulty or inadequate, it is rather that the lack of clarity makes it challenging to evaluate whether they support an entrainment model or not. 

      We decided to remove the analysis that was based on a circular approach, and we have clarified the analysis that was based on a modular approach by giving example cases: 

      “We first calculated how much the slower estimate (larger IOI value) diverts, proportionally from the faster estimate (smaller IOI value) or its multiples (i.e., harmonics) by normalizing the estimates from both sessions by the faster estimate. The outcome measure was the modulus of the slower, with respect to the faster estimate, divided by the faster estimate, described as mod(max(X), min(X))/min(X) where X = [session1_estimate session2_estimate]. An example case would be a preferred rate estimate of IOI = 603 ms from the linear-order session and an estimate of IOI = 295 ms from the random-order session. In this case, the slower estimate (603 ms) diverts from the multiple of the faster estimate (295*2 = 590 ms) by 13 ms, a proportional deviation of 4% of the faster estimate (295 ms). The outcome measure in this example is calculated as mod(603,295)/295 = 0.04.” (Supplementary Information, p. 2)

      Crucially, the ability of oscillators to respond to harmonically-related stimulus rates is a main distinction between entrainment and interval (timekeeper) models. In the current study, we found that each participant’s best-performance rates, the preferred rate estimates, had harmonic relationships. The additional analyses further showed that these harmonic relationships were not due to chance. This finding speaks against the interval (timekeeper) approaches and is maximally compatible with the entrainment framework. 

      Here are a number of questions I would like to list to sketch my confusion: 

      • The authors write: "We first normalized each participant's estimates by rescaling the slower estimate with respect to the faster one and converting the values to radians". Does slower estimate mean: "task accuracy in those trials in which IOI was slower than a participant's preferred frequency"? 

      Preferred rate estimates were stimulus rates (IOI) with best performance, as described in Experiment 1 Methods section. 

      “We conceptualized individuals' preferred rates as the stimulus rates where durationdiscrimination accuracy was highest. To estimate preferred rate on an individual basis, we smoothed response accuracy across the stimulus-rate (IOI) dimension for each session type, using the smoothdata function in Matlab. Estimates of preferred rate were taken as the smoothed IOI that yielded maximum accuracy” (p. 7). 

      The estimation method and the resulting estimate for an example participant was provided in Figure 2B. The updated figure in the current revision has this illustration only for linear-order session. 

      “Estimates were the stimulus rates (IOI) where smoothed accuracy (orange line) was maximum (arrow)” (Figure caption, p. 9).

      • "We reasoned that values with integer-ratio relationships should correspond to the same phase on a unit circle". What is values here; IOI, or accuracy values for certain IOIs? And why should this correspond to the same phase? 

      We removed the analysis on integer-ratio relationships that was based on a circular approach that the reviewer is referring to here. We clarified the analysis that was based on a modular approach and avoided using the term ‘values’ without specifying what values corresponded to.

      • Des "integer-ratio relationships" have to do with the y=x, y=x*2 and y=x/2 relationships of the other analyses?  

      Integer-ratio relationships indeed refer to y=x, y=x*2 and y=x/2 relationships. For example, if a number y is double of another number x (y = x*2), these values have an integer-ratio relationship, since 2 is an integer. This holds true also for the case where y = x/2 since x = y*2. 

      • Supplementary Figure S2c shows a distribution of median divergences resulting from the modular approach. The p-value is 0.004 but the dashed line appears to be at a much higher percentile of the distribution. I find this hard to understand. 

      We thank the reviewer for a detailed inspection of all figures and information in the manuscript. The reviewer’s comment led us to realize that this figure had an error. We updated the figure in Supplementary Information (Supplementary Figure S2). 

      Reviewer #2 (Public Review):

      To get a better understanding of the mechanisms underlying the behavioral observations, it would have been useful to compare the observed pattern of results with simulations done with existing biophysical models. However, this point is addressed if the current study is read along with this other publication of the same research group: Kaya, E., & Henry, M. J. (2024, February 5). Modeling rhythm perception and temporal adaptation: top-down influences on a gradually decaying oscillator.       https://doi.org/10.31234/osf.io/q9uvr 

      We agree with the reviewer that the mechanisms underlying behavioral responses can be better understood by modeling approaches. We thank the reviewer for acknowledging our computational modeling study that addressed this concern. 

      Reviewer #2 (Recommendations):

      I very much appreciate the thorough work done by the authors in assessing all reviewers' concerns. In this new version they clearly state the assumptions to be tested by their experiments, added extra analyses further strengthening the conclusions and point the reader to a neurocomputational model compatible with the current observations. 

      I only regret that the authors misunderstood the take home message of our Essay (Doelling & Assaneo 2021). Despite this being obviously out of the scope of the current work, I would like to take this opportunity to clarify this point. In that paper, we adopted a Stuart-Landau model not to determine how an oscillator should behave, but as an example to show that some behaviors usually used to prove or refute an underlying "oscillator like" mechanism can be falsified. We obviously acknowledge that some of the examples presented in that work are attainable by specific biophysical models, as explicitly stated in the essay: "There may well be certain conditions, equations, or parameters under which some of these commonly held beliefs are true. In that case, the authors who put forth these claims must clearly state what these conditions are to clarify exactly what hypotheses are being tested." 

      This work did not mean to delineate what oscillator is (or in not), but to stress the importance of explicitly introducing biophysical models to be tested instead of relying on vague definitions sometimes reflecting the researchers' own beliefs. The take home message that we wanted to deliver to the reader appears explicitly in the last paragraph of that essay: "We believe that rather than concerning ourselves with supporting or refuting neural oscillators, a more useful framework would be to focus our attention on the specific neural dynamics we hope to explain and to develop candidate quantitative models that are constrained by these dynamics. Furthermore, such models should be able to predict future recordings or be falsified by them. That is to say that it should no longer be sufficient to claim that a particular mechanism is or is not an oscillator but instead to choose specific dynamical systems to test. In so doing, we expect to overcome our looping debate and to ultimately develop-by means of testing many model types in many different experimental conditions-a fundamental understanding of cognitive processes and the general organization of neural behavior." 

      We appreciate the reviewer’s clarification of the take-home message from Doelling and Assaneo (2021). We concur with the assertions made in this essay, particularly regarding the benefits of employing computational modeling approaches. Such methodologies provide a nuanced and wellstructured foundation for theoretical predictions, thereby minimizing the potential for reductionist interpretations of behavioral or neural data.

      In addition, we would like to underscore the significance of delineating the level of analysis when investigating the mechanisms underlying behavioral or neural observations. The current study or Kaya & Henry (2024) involved no electrophysiological measures. Thus, we would argue that the appropriate level of analysis across our studies concerns the theoretical mechanisms rather than how these mechanisms are implemented on the neural (physical) level. In both studies, we aimed to explore or approximate the theoretical oscillator that guides dynamic attention rather than the neural dynamics underlying these theoretical processes. That is, theoretical (attentional) entrainment may not necessarily correspond to neural entrainment, and differentiating these levels could be informative about the parallels and differences between these levels. 

      References

      Doelling, K. B., & Assaneo, M. F. (2021). Neural oscillations are a start toward understanding brain activity rather than the end. PLoS Biol, 19(5), e3001234. https://doi.org/10.1371/journal.pbio.3001234  Jones, M. R. (2018). Time will tell: A theory of dynamic attending. Oxford University Press. 

      Kaya, E., & Henry, M. J. (2024). Modeling rhythm perception and temporal adaptation: top-down influences on a gradually decaying oscillator. PsyArxiv. https://doi.org/https://doi.org/10.31234/osf.io/q9uvr 

      Large, E. W. (1994). Dynamic representation of musical structure. The Ohio State University. 

      McAuley, J. D. (1995). Perception of time as phase: Toward an adaptive-oscillator model of rhythmic pattern processing Indiana University Bloomington]. 

      Pikovsky, A., Rosenblum, M., & Kurths, J. (2003). Synchronization: A Universal Concept in Nonlinear Sciences. Cambridge University Press.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      (1) You claim transdiagnostic phenotypes are temporally stable -- since they're relatively new constructs, do we know how stable? In what order?  

      This is an important question. We have added two recent references to support this claim on page 1 and cite these studies in the references on pages 25 and 28:

      “Using factor analysis, temporally stable (see Fox et al., 2023a; Sookud, Martin, Gillan, & Wise, 2024), transdiagnostic phenotypes can be extracted from extensive symptom datasets (Wise, Robinson, & Gillan, 2023).”

      Fox, C. A., McDonogh, A., Donegan, K. R., Teckentrup, V., Crossen, R. J., Hanlon, A. K., … Gillan, C. M. (2024). Reliable, rapid, and remote measurement of metacognitive bias. Scientific Reports, 14(1), 14941. https://doi.org/10.1038/s41598-024-64900-0

      Sookud, S., Martin, I., Gillan, C., & Wise, T. (2024, September 5). Impaired goal-directed planning in transdiagnostic compulsivity is explained by uncertainty about learned task structure. https://doi.org/10.31234/osf.io/zp6vk

      More specifically, Sookud and colleagues found the intraclass correlation coefficient (ICC) for both factors to be high after a 3- or 12 month period (ICC<sub>AD_3</sub> = 0.87; ICC<sub>AD_12</sub> = 0.87; ICC<sub>CIT_3</sub> = 0.81; ICC<sub>CIT_3</sub>= 0.76; see Tables S41 and S50 in Sookud et al., 2024).

      (2) On hypotheses of the study: 

      I didn't understand the logic behind the hypothesis relating TDx Compulsivity -> Metacognition > Reminder-setting

      It seems that (a) Compulsivity relates to overconfidence which should predict less remindersetting

      Compulsivity has an impaired link between metacognition and action, breaking the B->C link in the mediation described above in (a). What would this then imply about how Compulsivity is related to reminder-setting?

      "In the context of our study, a Metacognitive Control Mechanism would be reflected in a disrupted relationship between confidence levels and their tendency to set reminders."  What exactly does this predict - a lack of a correlation between confidence and remindersetting, specifically in high-compulsive subjects?

      Lastly, there could be a direct link between compulsivity and reminder-usage, independent of any metacognitive influence. We refer to this as the Direct Mechanism  Why though theoretically would this be the case? 

      "We initially hypothesised to find support for the Metacognitive Control Mechanism and that highly compulsive individuals would offload more". 

      The latter part here, "highly compulsive individuals would offload more" is I think the exact opposite prediction of the Metacognitive control mechanism hypothesis (compulsive individuals offload less). How could you possibly have tried to find support, then, for both? 

      Is the hypothesis that compulsivity positively predicts reminder setting the "direct mechanism" - if so, please clarify that, and if not, it should be added as a distinct mechanism, and additionally, the direct mechanism should be specified. 

      There's more delineation of specific hypotheses (8 with caveats) in Methods. 

      "We furthermore also tested this hypothesis but predicted raw confidence (percentage of circles participants predicted they would remember; H6b and H8b respectively)," What is the reference of "this hypothesis" given that right before this sentence two hypotheses are mentioned?  To keep this all organized, it would be good to simply have a table with hypotheses listed clearly. 

      We agree with the reviewer that there is room to improve the clarity of how our hypotheses are presented. The confusion likely arises from the fact that, since we first planned and preregistered our study, several new pieces of work have emerged, which might have led us to question some of our initial hypotheses. We have taken great care to present the hypotheses as they were preregistered, while also considering the current state of the literature and organizing them in a logical flow to make them more digestible for the reader. We have clarified this point on page 4:

      “Back when we preregistered our hypotheses only a limited number of studies about confidence and transdiagnostic CIT were available. This resulted in us hypothesising to find support for the Metacognitive Control Mechanism and that highly compulsive individuals would offload more due to an increased need for checkpoints.”

      The biggest improvement we believe comes from our new Table 1, which we have included in the Methods section in response to the reviewer’s suggestion (pp. 21-22):

      “We preregistered 8 hypotheses (see Table 1), half of which were sanity checks (H1-H4) aimed to establish whether our task would generally lead to the same patterns as previous studies using a similar task (as reviewed in Gilbert et al., 2023).”

      We furthermore foreshadowed more explicitly how we would test the Metacognitive Control Mechanism in the Introduction section on page 4, as requested by the reviewer:

      “In the context of our study, a Metacognitive Control Mechanism would be reflected in a disrupted relationship between confidence levels and their tendency to set reminders (i.e., the interaction between the bias to be over- or underconfident and transdiagnostic CIT in a regression model predicting a bias to set reminders).”

      To avoid any confusion regarding the term ‘direct’ in the ‘Direct Mechanism’, we now explicitly clarify on page 4 that it refers to any non-metacognitive influences. Additionally, we had already emphasized in the Discussion section the need for future studies to specify these influences more directly.

      Page 4: “We refer to this as the Direct Mechanism and it constitutes any possible influences that affect reminder setting in highly-compulsive CIT participants outside of metacognitive mechanisms, such as perfectionism and the wish to control the task without external aids.”

      The reviewer was correct in pointing out that, in the Methods section, we incorrectly referred to ‘this hypothesis’ when we actually meant both of the previously mentioned hypotheses. We have corrected this on page 23:

      “We furthermore also tested these hypotheses but predicted raw confidence (percentage of circles participants predicted they would remember; H6b and H8b respectively), as well as extending the main model with the scores from the cognitive ability test (ICAR5) as an additional covariate (H6c and H8c respectively).”

      Finally, upon revisiting our Results section, we noticed that we had not made it sufficiently clear that hypothesis H6a was preregistered as non-directional. We have now clarified this on page 9:

      “We predicted that the metacognitive bias would correlate negatively with AD (Hypothesis 8a; more anxious-depressed individuals tend to be underconfident). For CIT, we preregistered a non-directional, significant link with metacognitive bias (Hypothesis H6a). We found support for both hypotheses, both for AD, β = -0.22, SE = 0.04, t = -5.00, p < 0.001, as well as CIT, β = 0.15, SE = 0.05, t = 3.30, p = 0.001, controlling for age, gender, and educational attainment (Figure 3; see also Table S1). Note that for CIT this effect was positive, more compulsive individuals tend to be overconfident.”

      (3) You say special circles are red, blue, or pink. Then, in the figure, the colors are cyan, orange, and magenta. These should be homogenized. 

      Apologies, this was not clear on our screens. We have corrected this now but used the labels “blue”, “orange” and “magenta” as our shade of blue is much darker than cyan:

      Page 16: “These circles flashed in a colour (blue, orange, or magenta) when they first appear on screen before fading to yellow.”

      (4) The task is not clearly described with respect to forced choice. From my understanding, "forced choice" was implicitly delivered by a "computer choosing for them". You should indicate in the graphic that this is what forced choice means in the graphic and description more clearly. 

      This is an excellent point. On pages 17 and 18 we now include a slightly changed Figure 6, which includes improved table row names and cell shading to indicate the choice people gave. Hopefully this clarifies what “forced choice” means.

      (5) If I have point (4) right, then a potential issue arises in your design. Namely, if a participant has a bias to use or not use reminders, they will experience more or less prediction errors during their forced choice. This kind of prediction error could introduce different mood impacts on subsequent performance, altering their accuracy. This will have an asymmetric effect on the different forced phases (ie forced reminders or not). For this reason, I think it would be worthwhile to run a version of the experiment, if feasible, where you simply remove choice prior to revealing the condition. For example, have a block of choices where people can "see how well you do with reminders" -- this removes expectation and PE effects. 

      [See also this point from the weaknesses listed in the public comments:]

      Although I think this design and study are very helpful for the field, I felt that a feature of the design might reduce the tasks's sensitivity to measuring dispositional tendencies to engage cognitive offloading. In particular, the design introduces prediction errors, that could induce learning and interfere with natural tendencies to deploy reminder-setting behavior. These PEs comprise whether a given selected strategy will be or not be allowed to be engaged. We know individuals with compulsivity can learn even when instructed not to learn (e.g., Sharp, Dolan, and Eldar, 2021, Psychological Medicine), and that more generally, they have trouble with structure knowledge (eg Seow et al; Fradkin et al), and thus might be sensitive to these PEs. Thus, a dispositional tendency to set reminders might be differentially impacted for those with compulsivity after an NPE, where they want to set a reminder, but aren't allowed to. After such an NPE, they may avoid more so the tendency to set reminders. Those with compulsivity likely have superstitious beliefs about how checking behaviors leads to a resolution of catastrophes, which might in part originate from inferring structure in the presence of noise or from purely irrelevant sources of information for a given decision problem. 

      It would be good to know if such learning effects exist if they're modulated by PE (you can imagine PEs are higher if you are more incentivized - e.g., 9 points as opposed to only 3 points - to use reminders, and you are told you cannot use them), and if this learning effect confounds the relationship between compulsivity and reminder-setting.

      We would like to thank the reviewer for providing this interesting perspective on our task. If we understand correctly, the situation most at risk for such effects occurs when participants choose to use a reminder. Not receiving a reminder in the following trial can be seen as a negative prediction error (PE), whereas receiving one would represent the control condition (zero PE). Therefore, we focused on these two conditions in our analysis.

      We indeed found that participants had a slightly higher tendency to choose reminders again after trials where they successfully requested them compared to after trials where they were not allowed reminders (difference = 4.4%). This effect was statistically significant, t(465) = 2.3, p = 0.024. However, it is important to note that other studies from our lab have reported a general, non-specific response ‘stickiness,’ where participants often simply repeat the same strategy in the next trial (Scarampi & Gilbert, 2020), which could have contributed to this pattern.

      When we used CIT to predict this effect in a simple linear regression model, we did not find a significant effect (β = -0.05, SE = 0.05, t = -1.13, p = 0.26).

      To further investigate this and potentially uncover an effect masked by the influence of the points participants could win in a given trial, we re-ran the model using a logistic mixed-effects regression model. This model predicted the upcoming trial’s choice (reminder or no reminder) from the presence of a negative prediction error in the current trial (dummy variable), the ztransformed number of points on offer, and the z-transformed CIT score (between-subject covariate), as well as the interaction of CIT and negative PE. In this model, we replicated the previous ‘stickiness’ effect, with a negative influence of a negative PE on the upcoming choice, β = -0.24, SE = 0.07, z = -3.44, p < 0.001. In other words, when a negative PE was encountered in the current trial, participants were less likely to choose reminders in the next trial. Additionally, there was a significant negative influence of points offered on the upcoming choice, β = -0.28, SE = 0.03, z = -8.82, p < 0.001. While this might seem counterintuitive, it could be due to a contrast effect: after being offered high rewards with reminders, participants might be deterred from using the reminder strategy in consecutive trials where lower rewards are likely to be offered, simply due to the bounded reward scale. CIT showed a small negative effect on upcoming reminder choice, β = -0.06, SE = 0.04, z = -1.69, p = 0.09, indicating that participants scoring higher on the CIT factor tended to be less likely to choose reminders, thus replicating one of the central findings of our study. It is unclear why this effect was not statistically significant, but this is likely due to the limited data on which the model was based (see below). Finally, and most importantly, the interaction between the current trial’s condition (negative PE or zero PE) and CIT was not significant, contrary to the reviewer’s hypothesis, β = 0.04, SE = 0.07, z = 0.57, p = 0.57.

      It should also be noted that this exploratory analysis is based on a limited number of data points: on average, participants had 2.5 trials (min = 0; max = 4) with a negative PE and 6.7 trials (min = 0; max = 12) with zero PE. There were more zero PE trials simply because to maximise the number of trials included in this analysis, each participant’s 8 choice-only trials were included and on those trials the participant always got what they requested (the trial then ended prematurely). Due to the fact that not all cells in the analysed design were filled, only 466 out of 600 participants could be included in the analysis. This may have caused the fit of the mixed model to be singular.

      In summary, given that these results are based on a limited number of data points, some models did not fit without issues, and no evidence was found to support the hypotheses, we suggest not including this exploratory analysis in the manuscript. However, if we have misunderstood the reviewer and should conduct a different analysis, we are happy to reconsider.

      Unfortunately, conducting an additional study without the forced-choice element is not feasible, as this would create imbalances in trial numbers for the design. The advantage of the current, condensed task is the result of several careful pilot studies that have optimized the task’s psychometric properties.

      Scarampi, C., & Gilbert, S. J. (2020). The effect of recent reminder setting on subsequent strategy and performance in a prospective memory task. Memory, 28(5), 677–691. https://doi.org/10.1080/09658211.2020.1764974

      (6) One can imagine that a process goes on in this task where a person must estimate their own efficacy in each condition. Thus, individuals with more forced-choice experience prior to choosing for themselves might have more informed choice. Presumably, this is handled by your large N and randomization, but could be worth looking into. 

      We would like to thank the reviewer for pointing this out, as we had not previously considered this aspect of our task. However, we believe it is not the experience with forced trials per se, but rather the frequency with which participants experience both strategies (reminder vs. no reminder), that could influence their ability to make more informed choices. To address this, we calculated the proportion of reminder trials during the first half of the task (excluding choiceonly trials, where the reminder strategy was not actually experienced). We hypothesized that the absolute distance of this ‘informedness’ parameter should correlate positively with the absolute reminder bias at the end of the task, with participants who experienced both conditions equally by the midpoint of the task being less biased towards or away from reminders. However, this was not the case, r = 0.05, p = 0.21.

      Given the lengthy and complex nature of our preregistered analysis, we prefer not to include this exploratory analysis in the manuscript.

      (7) Is the Actual indifference calculated from all choices? I believe so, given they don't know only till after their choice whether it's forced or not, but good to make this clear. 

      Indeed, we use all available choice data to calculate the AIP. We now make this clear in two places in the main text:

      Page 5: “The ‘actual indifference point’ was the point at which they were actually indifferent, based on all of their decisions.”

      Page 6: “Please note that all choices were used to calculate the AIP, as participants only found out whether or not they would use a reminder after the decision was made.”

      (8) Related to 7, I believe this implies that the objective and actual indifference points are not entirely independent, given the latter contains the former. 

      Yes, the OIP and AIP were indeed calculated in part from events that happened within the same trials. However, since these events are non-overlapping (e.g., the choice from trial 6 contributes to the AIP but the accuracy measured several seconds later from that trial contributes to the OIP) and since our design dictates whether or not reminders can be used on those trials in question (by randomly assigning them to the forced internal/forced external condition) this could not induce circularity.

      (9) I thought perfectionism might be a trait that could explain findings and it was nice to see convergence in thinking once I reached the conclusion. Along these lines, I was thinking that perhaps perfectionism has a curvilinear relationship with compulsivity (this is an intuition I'm not sure if it's backed up empirically). If it's really perfectionism, do you see that, at the extreme end of compulsivity, there's more reminder-setting? Ie did you try to model this relationship using a nonlinear function? You might clues simply by visual inspection. 

      It is interesting to note that the reviewer reached a similar interpretation of our results. We considered this question during our analysis and conducted an additional exploratory analysis to examine how CIT quantile relates to reminder bias (see Author response image 1). Each circle reflects a participant. As shown, no clear nonlinearities are evident, which challenges this interpretation. We believe that adding this to the already lengthy manuscript may not be necessary, but we are of course happy to reconsider if Reviewer 1 disagrees.

      Author response image 1.

      (10) [From the weaknesses listed in the public comments.] A more subtle point, I think this study can be more said to be an exploration than a deductive test of a particular model -> hypothesis > experiment. Typically, when we test a hypothesis, we contrast it with competing models. Here, the tests were two-sided because multiple models, with mutually exclusive predictions (over-use or under-use of reminders) were tested. Moreover, it's unclear exactly how to make sense of what is called the direct mechanism, which is supported by partial (as opposed to complete) mediation.

      The reviewer’s observation is accurate; some aspects of our study did take on a more exploratory nature, despite having preregistered hypotheses. This was partly due to the novelty of our research questions. We appreciate this feedback and will use it to refine our approach in future studies, aiming for more deductive testing.

      Reviewer #2:

      (1) Regarding the lack of relationship between AD and reminder setting, this result is in line with a recent study by Mohr et al (2023:https://osf.io/preprints/psyarxiv/vc7ye) investigating relationships between the same transdiagnostic symptom dimensions, confidence bias and another confidence-related behaviour: information seeking. Despite showing trial-by-trial under-confidence on a perceptual decision task, participants high in AD did not seek information any more than low AD participants. Hence, the under-confidence in AD had no knock-on effect on downstream information-seeking behaviour. I think it is interesting that converging evidence from your study and the Moher et al (2023) study suggest that high AD participants do not use the opportunity to increase their confidence (i.e., through reminder setting or information seeking). This may be because they do not believe that doing so will be effective or because they lack the motivation (i.e., through anhedonia and/or apathy) to do so. 

      This is indeed an interesting parallel and we would like to thank the reviewer for pointing out this recently published study, which we unfortunately have missed. We included it in the Discussion section, extending our sub-section on the missing downstream effects of the AD factor, as well as listing it in the references on page 27.

      Page 14: “Our findings align with those reported in a recent study by Mohr, Ince, and Benwell (2024). The authors observed that while high-AD participants were underconfident in a perceptual task, this underconfidence did not lead to increased information-seeking behaviour. Future research should explore whether this is due to their pessimism regarding the effectiveness of confidence-modulated strategies (i.e., setting reminders or seeking information) or whether it stems from apathy. Another possibility is that the relevant downstream effects of anxiety were not measured in our study and instead may lie in reminder-checking behaviours.”

      Mohr, G., Ince, R.A.A. & Benwell, C.S.Y. Information search under uncertainty across transdiagnostic psychopathology and healthy ageing. Transl Psychiatry 14, 353 (2024). https://doi.org/10.1038/s41398-024-03065-w

      (2) Fox et al 2023 are cited twice at the same point in the second paragraph of the intro. Not sure if this is a typo or if these are two separate studies? 

      Those are indeed two different studies and should have been formatted as such. We have corrected this mistake in the following places and furthermore also corrected one of the references as the study has recently been published:

      P. 2 (top): “Previous research links transdiagnostic compulsivity to impairments in metacognition, defined as thinking about one’s own thoughts, encompassing a broad spectrum of self-reflective signals, such as feelings of confidence (e.g., Rouault, Seow, Gillan & Fleming, 2018; Seow & Gillan, 2020; Benwell, Mohr, Wallberg, Kouadio, & Ince, 2022; Fox et al., 2023a;

      Fox et al., 2023b; Hoven, Luigjes, Denys, Rouault, van Holst, 2023a).”

      P. 2 (bottom): “More specifically, individuals characterized by transdiagnostic compulsivity have been consistently found to exhibit overconfidence (Rouault, Seow, Gillan & Fleming, 2018; Seow & Gillan, 2020; Benwell, Mohr, Wallberg, Kouadio, & Ince, 2022; Fox et al., 2023a; Fox et al., 2023b; Hoven et al., 2023a).”

      P. 4: “Prior evidence exists for overconfidence in compulsivity (Rouault et al., 2018; Seow & Gillan, 2020; Benwell et al., 2022; Fox et al., 2023a; Fox et al., 2023b; Hoven et al., 2023a), which would therefore result in fewer reminders.”

      P. 23: “Though we did not preregister a direction for this effect, in the light of recent findings it has now become clear that compulsivity would most likely be linked to overconfidence (Rouault et al., 2018; Seow & Gillan, 2020; Benwell et al., 2022; Fox et al., 2023a; Fox et al., 2023b; Hoven et al., 2023a).”

      P. 24: “Fox, C. A., Lee, C. T., Hanlon, A. K., Seow, T. X. F., Lynch, K., Harty, S., … Gillan, C. M. (2023a). An observational treatment study of metacognition in anxious-depression. ELife, 12, 1–17. https://doi.org/10.7554/eLife.87193”

      P. 24: “Fox, C. A., McDonogh, A., Donegan, K. R., Teckentrup, V., Crossen, R. J., Hanlon, A. K., … Gillan, C. M. (2024). Reliable, rapid, and remote measurement of metacognitive bias. Scientific Reports, 14(1), 14941. https://doi.org/10.1038/s41598-024-64900-0”

      (3) Typo in the Figure 1 caption: "The preregistered exclusion criteria for the for the accuracies with....".  

      Thank you so much for pointing this out. We haved changed the sentence in the caption of Figure 1 to read “The preregistered exclusion criteria for the accuracies with or without reminder are indicated as horizontal dotted lines (10% and 70% respectively).”

      Typo in the Figure 5 caption: "Standardised regression coefficients are given for each pat".

      Thank you so much for pointing this out to us, we have corrected the typo and the sentence in the caption of Figure 5 now reads “Standardised regression coefficients are given for each path.”

      [From the weaknesses listed in the public comments.] Participants only performed a single task so it remains unclear if the observed effects would generalise to reminder-setting in other cognitive domains.

      We appreciate the reviewer’s concern regarding the use of a single cognitive task in our study, which is indeed a common limitation in many cognitive neuroscience studies. The cognitive factors underlying offloading decisions are still under active debate. Notably, a previous study found that intention fulfilment in an earlier version of our task correlates with real-world behaviour, lending validity to our paradigm by linking it to realistic outcomes (Gilbert, 2015). Additionally, recent unpublished work (Grinschgl, 2024) has shown a correlation between offloading across two lab tasks, though a null effect was reported in another study with a smaller sample size by the same team (Meyerhoff et al., 2021), likely due to insufficient power. In summary, we agree that future research should replicate these findings with alternative tasks to enhance robustness.

      Gilbert, S. J. (2015). Strategic offloading of delayed intentions into the external environment. Quarterly Journal of Experimental Psychology, 68(5), 971–992. https://doi.org/10.1080/17470218.2014.972963

      Grinschgl, S. (2024). Cognitive Offloading in the lab and in daily life. 2nd Cognitive Offloading Meeting. [Talk]

      Meyerhoff, H. S., Grinschgl, S., Papenmeier, F., & Gilbert, S. J. (2021). Individual differences in cognitive offloading: a comparison of intention offloading, pattern copy, and short-term memory capacity. Cognitive Research: Principles and Implications, 6(1), 34. https://doi.org/10.1186/s41235-021-00298-x

      (6) [From the weaknesses listed in the public comments.] The sample consisted of participants recruited from the general population. Future studies should investigate whether the effects observed extend to individuals with the highest levels of symptoms (including clinical samples). 

      We agree that transdiagnostic research should ideally include clinical samples to determine, for instance, whether the subclinical variation commonly studied in transdiagnostic work differs qualitatively from clinical presentations. However, this approach poses challenges, as transdiagnostic studies typically require large sample sizes, and recruiting clinical participants can be more difficult. With advancements in online sampling platforms, such as Prolific, achieving better availability and targeting may make this more feasible in the future. We intend to monitor these developments closely and contribute to such studies whenever possible.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a valuable contribution to cardiac arrhythmia research by demonstrating long noncoding RNA Dachshund homolog 1 (lncDACH1) tunes sodium channel functional expression and affects cardiac action potential conduction and rhythms. Whereas the evidence for functional impact of lncDACH1 expression on cardiac sodium currents and rhythms is convincing, biochemical experiments addressing the mechanism of changes in sodium channel expression and subcellular localization are incomplete.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors show that a long-non coding RNA lncDACH1 inhibits sodium currents in cardiomyocytes by binding to and altering the localization of dystrophin. The authors use a number of methodologies to demonstrate that lncDACH1 binds to dystrophin and disrupts its localization to the membrane, which in turn downregulates NaV1.5 currents. Knockdown of lncDACH1 upregulates NaV1.5 currents. Furthermore, in heart failure, lncDACH1 is shown to be upregulated which suggests that this mechanism may have pathophysiolgoical relevance.

      Strengths:

      (1) This study presents a novel mechanism of Na channel regulation which may be pathophysiologically important.

      (2) The experiments are comprehensive and systematically evaluate the physiological importance of lncDACH1.

      Weaknesses:

      (1). What is indicated by the cytoplasmic level of NaV1.5, a transmembrane protein? The methods do not provide details regarding how this was determined. Do you authors means NaV1.5 retained in various intracellular organelles?

      Thank you for the good suggestion. Our study showed that Nav1.5 was transferred to the cell membrane by the scaffold protein Dystropin in response to the regulation of LncDACH1, but not all Nav1.5 in the cytoplasm was transferred to the cell membrane. Therefore, the cytoplasmic level of Nav1.5 represents the Nav1.5 protein that is not transferred to the cell membrane but stays in the cytoplasm and various organelles within the cytoplasm when Nav1.5 is regulated by LncDACH1

      (2) What is the negative control in Fig. 2b, Fig. 4b, Fig. 6e, Fig. 7c? The maximum current amplitude in these seem quite different. -40 pA/pF in some, -30 pA/pF in others and this value seems to be different than in CMs from WT mice (<-20 pA/pF). Is there an explanation for what causes this variability between experiments and/or increase with transfection of the negative control? This is important since the effect of lncDACH1 is less than 50% reduction and these could fall in the range depending on the amplitude of the negative control.

      Thank you for the insightful comment. The negative control in Fig. 2b, Fig. 4b, Fig. 6e are primary cardiomyocytes transfected with empty plasmids. The negative control in Fig.7c are cardiomyocytes of wild-type mice injected with control virus. When we prepare cells before the patch-clamp experiments, the transfection efficiency of the transfection reagent used in different batches of cells, as well as the different cell sizes, ultimately lead to differences in CMS.

      (3) NaV1.5 staining in Fig. 1E is difficult to visualize and to separate from lncDACH1. Is it possible to pseudocolor differently so that all three channels can be visualized/distinguished more robustly?

      Thank you for the good suggestion. We have re-added color to the original image to distinguish between the three channels.

      Author response image 1.

      (4) The authors use shRNA to knockdown lncDACH1 levels. It would be helpful to have a scrambled ShRNA control.

      Thank you for the insightful comment. The control group we used was actually the scrambled shRNA, but we labeled the control group as NC in the article, maybe this has caused you to misunderstand.

      (5) Is there any measurement on the baseline levels of LncDACH1 in wild-type mice? It seems quite low and yet is a substantial increase in NaV1.5 currents upon knocking down LncDACH1. By comparison, the level of LncDACH1 seems to be massively upregulated in TAC models. Have the authors measured NaV1.5 currents in these cells? Furthermore, does LncDACH1 knockdown evoke a larger increase in NaV1.5 currents?

      Thank you for the insightful comment.

      (1).The baseline protein levels of LncDACH1 in wild-type mice and LncDACH1-CKO mice has been verified in a previously published article(Figure 3).(Hypertension. 2019;74:00-00. DOI: 10.1161/HYPERTENSIONAHA.119.12998.)

      Author response image 2.

      (2). We did not measure the Nav1.5 currents in cardiomyocytes of the TAC model mice in this artical, but in another published paper, we found that the Nav1.5 current in the TAC model mice was remarkably reduced than that in wild-type mice(Figure 4).(Gene Ther. 2023 Feb;30(1-2):142-149. DOI: 10.1038/s41434-022-00348-z)

      Author response image 3.

      This is consistent with our results in this artical, and our results show that LncDACH1 levels are significantly upregulated in the TAC model, then in the LncDACH1-TG group, the Nav1.5 current is significantly reduced after the LncDACH1 upregulation(Figure 3).

      Author response image 4.

      (6) What do error bars denote in all bar graphs, and also in the current voltage relationships?

      Thank you for the good comment. All the error bars represent the mean ± SEM. They represent the fluctuation of all individuals of a set of data based on the average value of this set of data, that is, the dispersion of a set of data.

      Reviewer #2 (Public Review):

      This manuscript by Xue et al. describes the effects of a long noncoding RNA, lncDACH1, on the localization of Nav channel expression, the magnitude of INa, and arrhythmia susceptibility in the mouse heart. Because lncDACH1 was previously reported to bind and disrupt membrane expression of dystrophin, which in turn is required for proper Nav1.5 localization, much of the findings are inferred through the lens of dystrophin alterations.

      The results report that cardiomyocyte-specific transgenic overexpression of lncDACH1 reduces INa in isolated cardiomyocytes; measurements in whole heart show a corresponding reduction in conduction velocity and enhanced susceptibility to arrhythmia. The effect on INa was confirmed in isolated WT mouse cardiomyocytes infected with a lncDACH1 adenoviral construct. Importantly, reducing lncDACH1 expression via either a cardiomyocyte-specific knockout or using shRNA had the opposite effect: INa was increased in isolated cells, as was conduction velocity in heart. Experiments were also conducted with a fragment of lnDACH1 identified by its conservation with other mammalian species. Overexpression of this fragment resulted in reduced INa and greater proarrhythmic behavior. Alteration of expression was confirmed by qPCR.

      The mechanism by which lnDACH1 exerts its effects on INa was explored by measuring protein levels from cell fractions and immunofluorescence localization in cells. In general, overexpression was reported to reduce Nav1.5 and dystrophin levels and knockout or knockdown increased them.

      Thank you for summarizing our work and thank you very much for your appreciation on our work.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, the authors report the first evidence of Nav1.5 regulation by a long noncoding RNA, LncRNA-DACH1, and suggest its implication in the reduction in sodium current observed in heart failure. Since no direct interaction is observed between Nav1.5 and the LncRNA, they propose that the regulation is via dystrophin and targeting of Nav1.5 to the plasma membrane.

      Strengths:

      (1) First evidence of Nav1.5 regulation by a long noncoding RNA.

      (2) Implication of LncRNA-DACH1 in heart failure and mechanisms of arrhythmias.

      (3) Demonstration of LncRNA-DACH1 binding to dystrophin.

      (4) Potential rescuing of dystrophin and Nav1.5 strategy.

      Thank you very much for your appreciation on our work.

      Weaknesses:

      (1) Main concern is that the authors do not provide evidence of how LncRNA-DACH1 regulates Nav1.5 protein level. The decrease in total Nav1.5 protein by about 50% seems to be the main consequence of the LncRNA on Nav1.5, but no mechanistic information is provided as to how this occurs.

      Thank you for the insightful comment.

      (1) The mechanism of the whole article is as mentioned in the discussion at the end of the article: LncDACH1 binds to dystrophin and thus inhibits membrane trafficking of Nav1.5, Dystrophin is a well-characterized Nav1.5 partner protein. It indirectly interacts with Nav1.5 via syntrophin, which binds with the C-terminus of dystrophin and with the SIV motif on the C-terminus of Nav1.5(Circ Res. 2006;99:407-414. doi: 10.1161/01.RES.0000237466.13252.5e)(Circulation.2014;130:147-160.doi:10.1161/CIRCULATIONAHA.113.007852).

      And we performed pulldown and RNA immunoprecipitation experiments to verify it (Figure 1).

      Author response image 5.

      2) Then we found that overexpression of lncDACH1 increased the ubiquitination of Nav1.5, which explains the downregulation of total Nav1.5 protein (Online Supplementary Figure 12).

      Author response image 6.

      3). Lastly,we found that lncDACH1 failed to pulldown Nav1.5 and anti-Nav1.5 did not precipitate lncDACH1( Supplementary Fig. 1).

      Author response image 7.

      These data indicated that lncDACH does not interact with Nav1.5 directly. It participates in the regulation of Nav1.5 by binding to dystrophin.Cytoplasmic Nav1.5 that failed to target on plasma membrane may be quickly distinguished and then degraded by these ubiquitination enzymes.

      (2) The fact that the total Nav1.5 protein is reduced by 50% which is similar to the reduction in the membrane reduction questions the main conclusion of the authors implicating dystrophin in the reduced Nav1.5 targeting. The reduction in membrane Nav1.5 could simply be due to the reduction in total protein.

      Thank you for the insightful comment. We do not rule out the possibility that the reduction in membrane Nav1.5 maybe be due to the reduction in total protein, but we don't think this is the main mechanism. Our data indicates that the membrane and total protein levels of Nav1.5 were reduced by 50%. However, the cytoplasmic Nav1.5 increased in the hearts of lncDACH1-TG mice than WT controls rather than reduced like membrane and total protein(Figure 1).

      Author response image 8.

      Therefore, we think the mian mechanism of the whole article is as mentioned in the discussion at the end of the article: LncDACH1 binds to dystrophin and thus inhibits membrane trafficking of Nav1.5.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) In Fig. 6E the error bars are only in one direction for cF-lncDACH1. It seems that this error overlaps for NC and cF-lncDACH1 at several voltages, yet it is marked as statistically significant. Also in Fig. 7C, what statistical test was used? Do the authors account for multiple comparisons?

      Thank you for the insightful comment.

      (1) We have recalculated the two sets of data and confirmed that there are indeed statistically significant between the two sets of data for NC and cF-lncDACH1 at In Fig. 6E, The overlaps in the picture may only be visually apparent.

      (2) The data in Fig. 7C are expressed as mean ± SEM. Statistical analysis was performed using unpaired Student’s t test or One-Way Analysis of Variance (ANOVA) followed by Tukey’s post-hoc analysis.

      (2) line 57, "The Western blot" remove "The"

      Sorry for the mistake. We have corrected it.

      (3) line 61, "The opposite data were collected" It is unclear what is meant by opposite.

      Sorry for the mistake. We have corrected it.

      (4) Lines 137-140. This sentence is complex, I would simplify as two sentences.

      Sorry for the mistake. We have corrected it.

      (5) Line 150, "We firstly validated" should be "we first validated"

      Sorry for the mistake. We have corrected it.

      (6) Line 181, "Consistently, the membrane" Is this statement meant to indicate that the experiments yielded a consistent results or that this statement is consistent with the previous one? In either case, this sentence should be reworded for clarification.

      Sorry for the mistake. We have corrected it.

      (7) Line 223, "In consistent, the ex vivo" I am not sure what In consistent means here.

      Thank you for the good suggestion. We mean that the results of ex vivo is consistent with the results of in vivo. We have corrected it to make it clearer.

      (8) Line 285. "a bunch of studies" could be rephrased as "multiple studies"

      Sorry for the mistake. We have corrected it.

      (9) Line 299 "produced no influence" Do you mean produced no change?

      Thank you for the good suggestion.As you put it,we mean it produced no change.

      (10) Line 325 "is to interact with the molecules" no need for "the molecules

      Sorry for the mistake. We have corrected it.

      (11) lines 332-335. This sentence is very confusing.

      Thank you for the insightful comment. We have corrected it.

      (12) Lines 341-342. It is unnecessary to claim primacy here.

      Thank you for the good suggestion. We have removed this sentence.

      (13) Line 373. "Sodium channel remodeling is commonly occured in" perhaps rephrase as occurs commonly

      Thank you for the insightful comment. We have corrected it.

      Reviewer #2 (Recommendations For The Authors):

      Critique

      (1) Aside from some issues with presentation noted below, these data provide convincing evidence of a link between lncDACH1 and Na channel function. The identification of a lncDACH1 segment conserved among mammalian species is compelling. The observation that lncDACH1 is increased in a heart failure model and provides a plausible hypothesis for disease mechanism.

      Thank you very much for your appreciation on our work.

      (2) Has a causal link between dystrophin and Na channel surface expression has been made, or is it an argument based on correlation? Is it possible to rule out a direct effect of lncDACH1 on Na channel expression? A bit more discussion of the limitations of the study would help here.

      Thank you for the insightful comment.

      (1). Dystrophin is a well-characterized Nav1.5 partner protein. It indirectly interacts with Nav1.5 via syntrophin, which binds with the C-terminus of dystrophin and with the SIV motif on the C-terminus of Nav1.5(Circ Res. 2006;99:407-414. doi: 10.1161/01.RES.0000237466.13252.5e)(Circulation.2014;130:147-160.doi:10.1161/CIRCULATIONAHA.113.007852).

      Author response image 9.

      (2).we performed pulldown and RNA immunoprecipitation experiments. The data showed that lncDACH1 failed to pulldown Nav1.5 and anti-Nav1.5 did not precipitate lncDACH1 (Online Supplementary Figure 11). These data indicated that lncDACH does not interact with Nav1.5 directly. ( Supplementary Fig. 1)

      Author response image 10.

      (3) What normalization procedures were used for qPCR quantification? I could not find these.

      Thank you for the good suggestion.The expression levels of mRNA were calculated using the comparative cycle threshold (Ct) method (2−ΔΔCt). Each data point was then normalized to ACTIN as an internal control in each sample. The final results are expressed as fold changes by normalizing the data to the values from control subjects. We have added the normalization procedures in the methods section of the article.

      (4) In general, I found the IF to be unconvincing - first, because the reported effects were not very apparent to me, but more importantly, because only exemplars were shown without quantification of a larger sample size.

      Thank you for the good suggestion. Accordingly, we quantified the immunostaining data. The data have been included in Supplementary Figure 2- 16.The sample size is labeled in the caption.

      Author response image 11.

      Fluorescence intensity of lncDACH1, dystrophin and Nav1.5 in isolated cardiomyocytes of lncDACH1-TG mice. a,b, Membrane levels of dystrophin (dys) and Nav1.5. N=9 for dys. N=8 for Nav1.5. P<0.05 versus WT group. c,d, Cytoplasm levels of dystrophin and Nav1.5. N=9. P<0.05 versus WT group. e, Fluorescence in situ hybridization (FISH) images of LncDACH1. N=10. *P<0.05 versus WT group. P-values were determined by unpaired t test.

      Author response image 12.

      Fluorescence intensity of dystrophin and Nav1.5 in cultured neonatal cardiomyocyte overexpressing lncDACH1. a,b, Membrane levels of dystrophin and Nav1.5. N=9. P<0.05 versus NC group. c,d, Cytoplasm levels of dystrophin and Nav1.5. N=9 for dys. N=12 for Nav1.5. P<0.05 versus NC group. P-values were determined by unpaired t test.

      Author response image 13.

      Fluorescence intensity of lncDACH1, dystrophin and Nav1.5 in isolated cardiomyocytes of lncDACH1-cKO mice. a,b, Membrane levels of dystrophin (dys) and Nav1.5. N=12 for dys. N=8 for Nav1.5. P<0.05 versus WT group. c,d, Distribution of cytoplasm levels of dystrophin and Nav1.5. N=12. P<0.05 versus WT group. e, Fluorescence in situ hybridization (FISH) images of LncDACH1 expression. N=8. *P<0.05 versus WT group. P-values were determined by unpaired t test.

      Author response image 14.

      Fluorescence intensity of dystrophin and Nav1.5 in cultured neonatal cardiomyocytes after knocking down of lncDACH1. a,b, Distribution of membrane levels of dystrophin and Nav1.5. N=11 for dys. N=8 for Nav1.5.P<0.05 versus NC group. c,d, Distribution of cytoplasm levels of dystrophin and Nav1.5. N=12 for dys. N=9 for Nav1.5.P<0.05 versus NC group. P-values were determined by unpaired t test.

      Author response image 15.

      Fluorescence intensity of dystrophin and Nav1.5 in isolated cardiomyocytes overexpressing cF-lncDACH1. a,b, Membrane levels of dystrophin (dys) and Nav1.5. N=9 for dys. N=7 for Nav1.5. P<0.05 versus NC group. c,d, Cytoplasm levels of dystrophin and Nav1.5. N=6 for dys. N=7 for Nav1.5. P<0.05 versus NC group. P-values were determined by unpaired t test.

      Author response image 16.

      Fluorescence intensity of dystrophin and Nav1.5 in cultured neonatal cardiomyocytes overexpressing cF-lncDACH1. a,b, Membrane levels of dystrophin and Nav1.5. N=10 for dys. N=11 for Nav1.5. P<0.05 versus NC group. c,d, Cytoplasm levels of dystrophin and Nav1.5. N=7 for dys. N=6 for Nav1.5.P<0.05 versus NC group. P-values were determined by unpaired t test.

      Author response image 17.

      Fluorescence intensity of Nav1.5 in human iPS differentiated cardiomyocytes overexpressing cF-lncDACH1. a, Membrane levels of Nav1.5. N=8 for Nav1.5. P<0.05 versus NC group. b, Cytoplasm levels of Nav1.5. N=10 for Nav1.5.P<0.05 versus NC group. P-values were determined by unpaired t test.

      (5) More information on how the fractionation kit works would be helpful. How are membrane v. cytoplasm fractions identified?

      a. I presume the ER is part of the membrane fraction? When Nav1.5 is found in the cytoplasmic fraction, what subcompartment is it in - the proteasome?

      b. In the middle panel of A - is the dystrophin signal visible on the WB for WT? I assume the selected exemplar is the best of the blots and so this raises concerns. Much is riding on the confidence with which the fractions report "membrane" v "cytoplasm."

      Thank you for the insightful comment.

      (1). How the fractionation kit works:

      The kit utilizes centrifuge column technology to obtain plasma membrane structures with native activity and minimal cross-contamination with organelles without the need for an ultracentrifuge and can be used for a variety of downstream assays. Separation principle: cells/tissues are sensitized by Buffer A, the cells pass through the centrifuge column under the action of 16000Xg centrifugation, the cell membrane is cut to make the cell rupture, and then the four components of nucleus, cytoplasm, organelle and plasma membrane will be obtained sequentially through differential centrifugation and density centrifugation, which can be used for downstream detection.

      Author response image 18.

      (2). How are membrane v. cytoplasm fractions identified:

      The membrane proteins and cytosolic proteins isolated by the kit, and then the internal controls we chose when performing the western blot experiment were :membrane protein---N-cadherin cytosolic protein---β-Actin

      Most importantly, when we incubate either the primary antibody of N-cadherin with the PVDF membrane of the cytosolic protein, or the primary antibody of the cytosolic control β-Actin with the PVDF membrane of the membrane protein, the protein bands cannot be obtained in the scan results

      Author response image 19.

      (6) More detail in Results, figures, and figure legends will assist the reader.

      a. In Fig. 5, it would be helpful to label sinus rhythm vs. arrhythmia segments.

      Thank you for the good suggestion. We've marked Sinus Rhythm and Arrhythmia segments with arrows

      Author response image 20.

      b. Please explain in the figure legend what the red bars in 5A are

      Thank you for the insightful comment. We've added the explanation to the figure legend .The red lines in the ECG traces indicate VT duration.

      c. In 5C, what the durations pertain to.

      Thank you for the good suggestion. 720ms-760ms refers to the duration of one action potential, with 720ms being the peak of one action potential and 760ms being the peak of another action potential.The interval duration is not fixed, in this artical, we use 10ms as an interval to count the phase singularities from the Consecutive phase maps. Because the shorter the interval duration, the larger the sample size and the more convincing the data.

      d. In the text, please define "breaking points" and explain what the physiological underpinning is. Define "phase singularity."

      Thank you for the insightful comment. Cardiac excitation can be viewed as an electrical wave, with a wavefront corresponding to the action potential upstroke (phase 0) and a waveback corresponding to rapid repolarization (phase 3). Normally, Under normal circumstances, cardiac conduction is composed of a sequence of well-ordered action potentials, and in the results of optical mapping experiments, different colors represent different phases.when a wave propagates through cardiac tissue, wavefront and waveback never touch.when arrhythmias occur in the heart, due to factors such as reenfrant phenomenon, the activation contour will meet the refractory contour and waves will break up, initiating a newly spiral reentry. Corresponding to the optical mapping result graph, different colors representing different time phases (including depolarization and repolarization) come together to form a vortex, and the center of the vortex is defined as the phase singularity.

      (7) In reflecting on why enhanced INa is not proarrhythmic, it is noted that the kinetics are not altered. I agree that is key, but perhaps the consequence could be better articulated. Because lncDACH1 does not alter Nav1.5 gating, the late Na current may not be enhanced to the same effect as observed with LQT gain-of-function Nav1.5 mutations, in which APD prolongation is attributed to gating defects that increase late Na current.

      Thank you for the good suggestion. Your explanation is very brilliant and important for this article. We have revised the discussion section of the article and added these explanations to it.

      Reviewer #3 (Recommendations For The Authors):

      (1) Experiments to specifically address the reduction in total Nav1.5 protein should be included.

      Thank you for the insightful comment. We examined the ubiquitination of Nav1.5. We found that overexpression of lncDACH1 increased the ubiquitination of Nav1.5, which explains the downregulation of total Nav1.5 protein (Online Supplementary Figure 12).

      Author response image 21.

      (2) Experiments to convincingly demonstrate that LncRNA-DACH1 regulates Nav1.5 targeting via dystrophin are missing. As it is, total reduction in Nav1.5 seems to be the explanation as to why there is a decrease in membrane Nav1.5.

      Thank you for the insightful comment. we performed pulldown and RNA immunoprecipitation experiments. The data showed that lncDACH1 can pulldown dystrophin(Figure 1),but failed to pulldown Nav1.5 and anti-Nav1.5 did not precipitate lncDACH1( Supplementary Fig. 1). These data indicated that lncDACH does not interact with Nav1.5 directly. It participates in the regulation of Nav1.5 by binding to dystrophin.

      Author response image 22.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This study focuses on the role of GABA in semantic memory and its neuroplasticity. The researchers stimulated the left ATL and control site (vertex) using cTBS, measured changes in GABA before and after stimulation using MRS, and measured changes in BOLD signals during semantic and control tasks using fMRI. They analyzed the effects of stimulation on GABA, BOLD, and behavioral data, as well as the correlation between GABA changes and BOLD changes caused by the stimulation. The authors also analyzed the relationship between individual differences in GABA levels and behavioral performance in the semantic task. They found that cTBS stimulation led to increased GABA levels and decreased BOLD activity in the ATL, and these two changes were highly correlated. However, cTBS stimulation did not significantly change participants' behavioral performance on the semantic task, although behavioral changes in the control task were found after stimulation. Individual levels of GABA were significantly correlated with individuals' accuracy on the semantic task, and the inverted U-shaped (quadratic) function provides a better fit than the linear relationship. The authors argued that the results support the view that GABAergic inhibition can sharpen activated distributed semantic representations. They also claimed that the results revealed, for the first time, a non-linear, inverted-U-shape relationship between GABA levels in the ATL and semantic function, by explaining individual differences in semantic task performance and cTBS responsiveness

      Strengths:

      The findings of the research regarding the increase of GABA and decrease of BOLD caused by cTBS, as well as the correlation between the two, appear to be reliable. This should be valuable for understanding the biological effects of cTBS.

      We appreciated R1’s positive evaluation of our manuscript.

      Weaknesses:

      Regarding the behavioral effects of GABA on semantic tasks, especially its impact on neuroplasticity, the results presented in the article are inadequate to support the claims made by the authors. There are three aspects of results related to this: 1) the effects of cTBS stimulation on behavior, 2) the positive correlation between GABA levels and semantic task accuracy, and 3) the nonlinear relationship between GABA levels and semantic task accuracy. Among these three pieces of evidence, the clearest one is the positive correlation between GABA levels and semantic task accuracy. However, it is important to note that this correlation already exists before the stimulation, and there are no results supporting that it can be modulated by the stimulation. In fact, cTBS significantly increases GABA levels but does not significantly improve performance on semantic tasks. According to the authors' interpretation of the results in Table 1, cTBS stimulation may have masked the practice effects that were supposed to occur. In other words, the stimulation decreased rather than enhanced participants' behavioral performance on the semantic task.

      The stimulation effect on behavioral performance could potentially be explained by the nonlinear relationship between GABA and performance on semantic tasks proposed by the authors. However, the current results are also insufficient to support the authors' hypothesis of an inverted U-shaped curve. Firstly, in Figure 3C and Figure 3D, the last one-third of the inverted U-shaped curve does not have any data points. In other words, as the GABA level increases the accuracy of the behavior first rises and then remains at a high level. This pattern of results may be due to the ceiling effect of the behavioral task's accuracy, rather than an inverted U-shaped ATL GABA function in semantic memory. Second, the article does not provide sufficient evidence to support the existence of an optimal level of GABA in the ATL. Fortunately, this can be tested with additional data analysis. The authors can estimate, based on pre-stimulus data from individuals, the optimal level of GABA for semantic functioning. They can then examine two expectations: first, participants with pre-stimulus GABA levels below the optimal level should show improved behavioral performance after stimulation-induced GABA elevation; second, participants with pre-stimulus GABA levels above the optimal level should exhibit a decline in behavioral performance after stimulation-induced GABA elevation. Alternatively, the authors can categorize participants into groups based on whether their behavioral performance improves or declines after stimulation, and compare the pre- and post-stimulus GABA levels between the two groups. If the improvement group shows significantly lower pre-stimulus GABA levels compared to the decline group, and both groups exhibit an increase in GABA levels after stimulation, this would also provide some support for the authors' hypothesis.

      Another issue in this study is the confounding of simulation effects and practice effects. According to the results, there is a significant improvement in performance after the simulation, at least in the control task, which the authors suggest may reflect a practice effect. The authors argue that the results in Table 1 suggest a similar practice effect in the semantic task, but it is masked by the simulation of the ATL. However, since no significant effects were found in the ANOVA analysis of the semantic task, it is actually difficult to draw a conclusion. This potential confound increases the risk in data analysis and interpretation. Specifically, for Figure 3D, if practice effects are taken into account, the data before and after the simulation should not be analyzed together.

      We thank for the R1’s thoughtful comments. Due to the limited dataset, it is challenging to determine the optimal level of ATL GABA. Here, we re-grouped the participants into the responders and non-responders to address the issues R1 raised. It is important to note that we applied cTBS over the ATL, an inhibitory protocol, which decreases cortical excitability within the target region and semantic task performance (Chiou et al., 2014; Jung and Lambon Ralph, 2016). Therefore, responders and non-responders were classified according to their semantic performance changes after the ATL stimulation: subjects showing a decrease in task performance at the post ATL cTBS compared to the baseline were defined as responders; whereas subjects showing no changes or an increase in their task performance after the ATL cTBS were defined as non-responders. Here, we used the inverse efficiency (IE) score (RT/1-the proportion of errors) as individual semantic task performance to combine accuracy and RT. Accordingly, we had 7 responders and 10 non-responders.

      Recently, we demonstrated that the pre-stimulation neurochemical profile of the ATL was associated with cTBS responsiveness on semantic processing (Jung et al., 2022). Specifically, the baseline GABA and Glx levels in the ATL predicted cTBS induced semantic task performance changes: individuals with higher GABA and lower Glx in the ATL would show bigger inhibitory effects and responders who decreased semantic task performance after ATL stimulation. Importantly, the baseline semantic task performance was significantly better in responders compared to non-responders. Thus, we expected that responders would show better semantic task performance along with higher ATL GABA levels in their pre-stimulation session relative to non-responders. We performed the planned t-tests to examine the difference in task performance and ATL GABA levels in pre-stimulation session. The results revealed that responders had lower IE (better task performance, t = -1.756, p = 0.050) and higher ATL GABA levels (t = 2.779, p = 0.006) in the pre-stimulation session (Figure 3).

      In addition, we performed planned paired t-test to investigate the cTBS effects on semantic task performance and regional ATL GABA levels according to the groups (responders and non-responders). Responders showed significant increase of IE (poorer performance, t = -1.937, p = 0.050) and ATL GABA levels (t = -2.203, p = 0.035) after ATL cTBS. Non-responders showed decreased IE (better performance, t = 2.872, p = 0.009) and increased GABA levels in the ATL (t = -3.912, p = 0.001) after the ATL stimulation. The results were summarised in Figure 3.

      It should be noted that there was no difference between the responders and non-responders in the control task performance at the pre-stimulation session. Both groups showed better performance after the ATL stimulation – practice effects (Author response image 1 below).

      Author response image 1.

      As we expected, our results replicated the previous findings (Jung et al., 2022) that responders who showed the inhibitory effects on semantic task performance after the ATL stimulation had higher GABA levels in the ATL than non-responders at their baseline, the pre-stimulation session. Importantly, cTBS increased ATL GABA levels in both responders and non-responders. These findings support our hypothesis – the inverted U-shaped ATL GABA function for cTBS response (Figure 4B). cTBS over the ATL resulted in the inhibition of semantic task performance among individuals initially characterized by higher concentrations of GABA in the ATL, indicative of better baseline semantic capacity. Conversely, the impact of cTBS on individuals with lower semantic ability and relatively lower GABA levels in the ATL was either negligible or exhibited a facilitatory effect. This study posits that individuals with elevated GABA levels in the ATL tend to be more responsive to cTBS, displaying inhibitory effects on semantic task performance (responders). On the contrary, those with lower GABA concentrations and reduced semantic ability were less likely to respond or even demonstrated facilitatory effects following ATL cTBS (non-responders). Moreover, our findings suggest the critical role of the baseline neurochemical profile in individual responsiveness to cTBS in the context of semantic memory. This highlights substantial variability among individuals in terms of semantic memory and its plasticity induced by cTBS.

      Our analyses with responders and non-responders have highlighted significant inter-individual variability in both pre- and post-ATL stimulation sessions, including behavioural outcomes and ATL GABA levels. Responders showed distinctive neurochemical profiles in the ATL, associating with their task performance and responsiveness to cTBS in semantic memory. Our findings suggest that responders may possess an optimal level of ATL GABA conducive to efficient semantic processing. This results in enhanced semantic task performance and increased responsiveness to cTBS, leading to inhibitory effects on semantic processing following an inverted U-shaped function. On the contrary, non-responders, characterized by relatively lower ATL GABA levels, exhibited poorer semantic task performance compared to responders at the baseline. The cTBS-induced increase in GABA may contribute to their subsequent improvement in semantic performance. These results substantiate our hypothesis regarding the inverted U-shape function of ATL GABA and its relationship with semantic behaviour.

      To address the confounding of simulation effects and practice effects in behavioural data, we used the IE and computed cTBS-induced performance changes (POST-PRE). Employing a 2 x 2 ANOVA with stimulation (ATL vs. Vertex) and task (Semantic vs. Control) as within subject factors, we found a significant task effect (F<sub>1, 15</sub> = 6.656, p = 0.021) and a marginally significant interaction between stimulation and task (F<sub>1, 15</sub> = 4.064, p = 0.061). Post hoc paired t-test demonstrated that ATL stimulation significantly decreased semantic task performance (positive IE) compared to both vertex stimulation (t = 1.905, p = 0.038) and control task (t = 2.814, p = 0.006). Facilitatory effects (negative IE) were observed in the control stimulation and control task. Please, see the Author response image 2 below. Thus, we believe that ATL cTBS induced task-specific inhibitory effects in semantic processing.

      Author response image 2.

      Accordingly, we have revised the Methods and Materials (p 25, line 589), Results (p8, line 188, p9-11, line 202- 248), Discussion (p19, line 441) and Figures (Fig. 2-3 & all Supplementary Figures).

      Reviewer #2 (Public Review):

      Summary:

      The authors combined inhibitory neurostimulation (continuous theta-burst stimulation, cTBS) with subsequent MRI measurements to investigate the impact of inhibition of the left anterior temporal lobe (ATL) on task-related activity and performance during a semantic task and link stimulation-induced changes to the neurochemical level by including MR spectroscopy (MRS). cTBS effects in the ATL were compared with a control site in the vertex. The authors found that relative to stimulation of the vertex, cTBS significantly increased the local GABA concentration in the ATL. cTBS also decreased task-related semantic activity in the ATL and potentially delayed semantic task performance by hindering a practice effect from pre to post. Finally, pooled data from their previous MRS study suggest an inverted U-shape between GABA concentration and behavioral performance. These results help to better understand the neuromodulatory effects of non-invasive brain stimulation on task performance.

      Strengths:

      Multimodal assessment of neurostimulation effects on the behavioral, neurochemical, and neural levels. In particular, the link between GABA modulation and behavior is timely and potentially interesting.

      We appreciated R2’s positive evaluation of our manuscript.

      Weaknesses:

      The analyses are not sound. Some of the effects are very weak and not all conclusions are supported by the data since some of the comparisons are not justified. There is some redundancy with a previous paper by the same authors, so the novelty and contribution to the field are overall limited. A network approach might help here.

      Thank you for your thoughtful critique. We have taken your comments into careful consideration and have made efforts to address them.

      We acknowledge the limitations regarding the strength of some effects and the potential lack of justification for certain conclusions drawn from the data. In response, we have reviewed our analyses and performed new analyses to address the behavioural discrepancies and strengthened the justifications for our conclusions.

      Regarding the redundancy with a previous paper by the same authors, we understand your concern about the novelty and contribution to the field. We aim to clarify the unique contributions of our current study compared to our previous work. The main novelty lies in uncovering the neurochemical mechanisms behind cTBS-induced neuroplasticity in semantic representation and establishing a non-linear relationship between ATL GABA levels and semantic representation. Our previous work primarily demonstrated the linear relationship between ATL GABA levels and semantic processing. In the current study, we aimed to address two key objectives: 1) investigate the role of GABA in the ATL in short-term neuroplasticity in semantic representation, and 2) explore a biologically more plausible function between ATL GABA levels and semantic function using a larger sample size by combining data from two studies.

      Additionally, we appreciate your suggestion regarding a network approach. We have explored the relationship between ATL GABA and cTBS-induced functional connectivity changes in our new analysis. However, there was no significant relationship between them. In the current study, our decision to focus on the mechanistic link between ATL GABA, task-induced activity, and individual semantic task performance reflects our intention to provide a detailed exploration of the role of GABA in the ATL and semantic neuroplasticity.

      We have addressed the specific weaknesses raised by Reviewer #2 in detail in our response to 'Reviewer #2 Recommendations For The Authors'.

      Reviewer #3 (Public Review):

      Summary:

      The authors used cTBS TMS, magnetic resonance spectroscopy (MRS), and functional magnetic resonance imaging (fMRI) as the main methods of investigation. Their data show that cTBS modulates GABA concentration and task-dependent BOLD in the ATL, whereby greater GABA increase following ATL cTBS showed greater reductions in BOLD changes in ATL. This effect was also reflected in the performance of the behavioural task response times, which did not subsume to practice effects after AL cTBS as opposed to the associated control site and control task. This is in line with their first hypothesis. The data further indicates that regional GABA concentrations in the ATL play a crucial role in semantic memory because individuals with higher (but not excessive) GABA concentrations in the ATLs performed better on the semantic task. This is in line with their second prediction. Finally, the authors conducted additional analyses to explore the mechanistic link between ATL inhibitory GABAergic action and semantic task performance. They show that this link is best captured by an inverted U-shaped function as a result of a quadratic linear regression model. Fitting this model to their data indicates that increasing GABA levels led to better task performance as long as they were not excessively low or excessively high. This was first tested as a relationship between GABA levels in the ATL and semantic task performance; then the same analyses were performed on the pre and post-cTBS TMS stimulation data, showing the same pattern. These results are in line with the conclusions of the authors.

      Strengths:

      I thoroughly enjoyed reading the manuscript and appreciate its contribution to the field of the role of the ATL in semantic processing, especially given the efforts to overcome the immense challenges of investigating ATL function by neuroscientific methods such as MRS, fMRI & TMS. The main strengths are summarised as follows:

      • The work is methodologically rigorous and dwells on complex and complementary multimethod approaches implemented to inform about ATL function in semantic memory as reflected in changes in regional GABA concentrations. Although the authors previously demonstrated a negative relationship between increased GABA levels and BOLD signal changes during semantic processing, the unique contribution of this work lies within evidence on the effects of cTBS TMS over the ATL given by direct observations of GABA concentration changes and further exploring inter-individual variability in ATL neuroplasticity and consequent semantic task performance.

      • Another major asset of the present study is implementing a quadratic regression model to provide insights into the non-linear relationship between inhibitory GABAergic activity within the ATLs and semantic cognition, which improves with increasing GABA levels but only as long as GABA levels are not extremely high or low. Based on this finding, the authors further pinpoint the role of inter-individual differences in GABA levels and cTBS TMS responsiveness, which is a novel explanation not previously considered (according to my best knowledge) in research investigating the effect of TMS on ATLs.

      • There are also many examples of good research practice throughout the manuscript, such as the explicitly stated exploratory analyses, calculation of TMS electric fields, using ATL optimised dual echo fRMI, links to open source resources, and a part of data replicates a previous study by Jung et. al (2017).

      We appreciated R3’s very positive evaluation of our manuscript.

      Weaknesses:

      • Research on the role of neurotransmitters in semantic memory is still very rare and therefore the manuscript would benefit from more context on how GABA contributes to individual differences in cognition/behaviour and more justification on why the focus is on semantic memory. A recommendation to the authors is to highlight and explain in more depth the particular gaps in evidence in this regard.

      This is an excellent suggestion. Accordingly, we have revised our introduction, highlighting the role of GABA on individual differences in cognition and behaviour and research gap in this field.

      Introduction p3, line 77   

      “Research has revealed a link between variability in the levels of GABA in the human brain and  individual differences in cognitive behaviour (for a review, see 5). Specifically, GABA levels in the sensorimotor cortex were found to predict individual performance in the related tasks: higher GABA levels were correlated with a slower reaction time in simple motor tasks (12) as well as improved motor control (13) and sensory discrimination (14, 15). Visual cortex GABA concentrations were positively correlated with a stronger orientation illusion (16), a prolonged binocular rivalry (17), while displaying a negative correlation with motion suppression (17). Individuals with greater frontal GABA concentrations demonstrated enhanced working memory capacity (18, 19). Studies on learning have reported the importance of GABAergic changes in the motor cortex for motor and perceptual learning: individuals showing bigger decreases in local GABA concentration can facilitate this plasticity more effectively (12, 20-22). However, the relationship between GABAergic inhibition and higher cognition in humans remains unclear. The aim of the study was to investigate the role of GABA in relation to human higher cognition – semantic memory and its neuroplasticity at individual level.”

      • The focus across the experiments is on the left ATL; how do the authors justify this decision? Highlighting the justification for this methodological decision will be important, especially given that a substantial body of evidence suggests that the ATL should be involved in semantics bilaterally (e.g. Hoffman & Lambon Ralph, 2018; Lambon Ralph et al., 2009; Rice et al., 2017; Rice, Hoffman, et al., 2015; Rice, Ralph, et al., 2015; Visser et al., 2010).

      This is an important point, which we thank R3 for. Supporting the bilateral ATL systems in semantic representation, previous rTMS studies delivered an inhibitory rTMS in the left and right ATL and both ATL stimulation significantly decreased semantic task performance (Pobric et al., 2007 PNAS; 2010 Neuropsychologia; Lambon Ralph et al., 2009 Cerebral Cortex). Importantly, there was no significant difference on rTMS effects between the left and right ATL stimulation. Therefore, we assume that either left or right ATL stimulation could produce similar, intended rTMS effects on semantic processing. In the current study, we combined the cTBS with multimodal imaging to examine the cTBS effects in the ATL. Due to the design of the study (having a control site, control task, and control stimulation) and limitation of scanning time, we could have a target region for the simulation and chose the left ATL, which was the same MRS VOI of our precious study (Jung et al., 2017). This enabled us to combine the datasets to explore GABAergic function in the ATL.

      • When describing the results, (Pg. 11; lines 233-243), the authors first show that the higher the BOLD signal intensity in ATL as a response to the semantic task, the lower the GABA concentration. Then, they state that individuals with higher GABA concentrations in the ATL perform the semantic task better. Although it becomes clearer with the exploratory analysis described later, at this point, the results seem rather contradictory and make the reader question the following: if increased GABA leads to less task-induced ATL activation, why at this point increased GABA also leads to facilitating and not inhibiting semantic task performance? It would be beneficial to acknowledge this contradiction and explain how the following analyses will address this discrepancy.

      We apologised that our description was not clear. As R1 also commented this issue, we re-analysed behavioural results and demonstrated inter-individual variability in response to cTBS (Please, see the reply to R1 above).

      • There is an inconsistency in reporting behavioural outcomes from the performance on the semantic task. While experiment 1 (cTBS modulates regional GANA concentrations and task-related BOLD signal changes in the ATL) reports the effects of cTBS TMS on response times, experiment 2 (Regional GABA concentrations in the ATL play a crucial role in semantic memory) and experiment 3 (The inverted U-shaped function of ATL GABA concentration in semantic processing) report results on accuracy. For full transparency, the manuscript would benefit from reporting all results (either in the main text or supplementary materials) and providing further explanations on why only one or the other outcome is sensitive to the experimental manipulations across the three experiments.

      Regarding the inconsistency of behavioural outcome, first, there were inter- individual differences in our behavioural data (see the Figure below). Our new analyses revealed that there were responders and non-responders in terms of cTBS responsiveness (please, see the reply to R1 above. It should be noted that the classification of responders and non-responders was identical when we used semantic task accuracy). In addition, RT was compounded by practice effects (faster in the post-stimulation sessions), except for the ATL-post session. Second, we only found the significant relationship between semantic task accuracy and ATL GABA concentrations in both previous (Jung et al., 2017) and current study. ATL GABA levels were not correlated with semantic RT (Jung et al., 2017: r = 0.34, p = 0.14, current study: r = 0.26, p = 0.14). It should be noted that there were no significant correlations between ATL GABA levels and semantic inverse efficiency (IE) in both studies (Jung et al., 2017: r = 0.13, p = 0.62, current study: r = 0.22, p = 0.44). As a result, we found no significant linear and non-linear relationship between ATL GABA levels and RT (linear function R<sup>2</sup> = 0.21, p =0.45, quadratic function: R<sup>2</sup> = 0.17, p = 0.21) and between ATL GABA levels and IE (linear function R<sup>2</sup> = 0.24, p =0.07, quadratic function: R<sup>2</sup> = 2.24, p = 0.12). Thus, our data suggests that GABAergic action in the ATL may sharpen activated distributed semantic representations through lateral inhibition, leading to more accurate semantic performance (Isaacson & Scanziani., 2011; Jung et al., 2017).

      We agreed with R3’s suggestion to report all results. The results of control task and control stimulation were included in Supplementary information (Figure S1, S4-5).

      Overall, the most notable impact of this work is the contribution to a better understanding of individual differences in semantic behaviour and the potential to guide therapeutic interventions to restore semantic abilities in neurological populations. While I appreciate that this is certainly the case, I would be curious to read more about how this could be achieved.

      Thank you once again to R3 for the positive evaluation of our study. We acknowledge your interest in understanding the practical implications of our findings. It is crucial to highlight the substantial variability in the effectiveness of rTMS and TBS protocols among individuals. Previous studies in healthy subjects have reported response rates ranging from 40% to 70% in the motor cortex, and in patients, the remission rate for rTMS treatment in treatment-resistant depression is around 29%. Presently, the common practice in rTMS treatment is to apply the same protocol uniformly to all patients.

      Our study demonstrated that 40% of individuals in our sample were classified as responders to ATL cTBS. Notably, we observed differences in ATL GABA levels before stimulation between responders and non-responders. Responders exhibited higher baseline ATL GABA levels, along with better semantic performance at the baseline (as mentioned in our response to R1). This suggests that establishing the optimal level of ATL GABA by assessing baseline GABA levels before stimulation could enable the tailoring of an ideal protocol for each individual, thereby enhancing their semantic capability. To achieve this, more data is needed to delineate the proposed inverted U-shaped function of ATL GABA in semantic memory.

      Our ongoing efforts involve collecting additional data from both healthy aging and dementia cohorts using the same protocol. Additionally, future pharmacological studies aim to modulate GABA, providing a deeper understanding of the individual variations in semantic function. These initiatives contribute to the potential development of personalized therapeutic interventions for individuals with semantic impairments.

      Reviewer #1 (Recommendations For The Authors):

      My major suggestion is to include an analysis regarding the "existence of an optimal GABA level". This would be the most direct test for the authors' hypothesis on the relationship between GABA and semantic memory and its neuroplasticity. Please refer to the public review section for details.

      Here are some other suggestions and questions.

      (1) The sample size of this study is relatively small. Although the sample size was estimated, a small sample size can bring risks to the generalizability of the results to the population. How did the author consider this risk? Is it necessary to increase the sample size?

      We agreed with R1’s comments. However, the average of sample size in healthy individuals was 17.5 in TMS studies on language function (number of studies = 26, for a review, see Qu et al, 2022 Frontiers in Human Neuroscience), 18.3 in the studies employing rTMS and fMRI on language domain (number of studies = 8, for a review, see Hartwigsen & Volz., 2021 NeuroImage), and 20.8 in TMS combined MRS studies (number of studies = 11, for a review, see Cuypers & Marsman., 2021 NeuroImage). Notably, only two studies utilizing rTMS, fMRI, and MRS had sample sizes of N = 7 (Grohn et al., 2019 Frontiers in Neuroscience) and N = 16 (Rafique & Steeves. 2020 Brain and Behavior). Despite having 19 participants in our current study, it is noteworthy that our sample size aligns closely with studies employing similar approaches and surpasses those employing the same methodology.

      As a result of the changes in a scanner and the relocation of the authors to different institutes, it is impossible to increase the sample size for this study.

      (2) How did the authors control practice effects? How many practice trials were arranged before the experiment? Did you avoid the repetition of stimuli in tasks before and after the stimuli?

      At the beginning of the experiment, participants performed the practice session (20 trials) for each tasks outside of the scanner. Stimuli in tasks were not repeated before and after stimulation sessions.

      (3) In Figures 2D and E, does the vertical axis of the BOLD signal refer to the semantic task itself or the difference between the semantic and control tasks? Could you provide the respective patterns of the BOLD signal before and after the stimuli in the semantic and control tasks in a figure?

      We apologised that the names of axis of Figure 2 were not clear. In Fig 2D-E, the BOLD signal changes refer to the semantic task itself. Accordingly, we have revised the Fig. 2.

      (4) Figure 1A shows that MRS ATL always comes before MRS Vertex. Was the order of them counterbalanced across participants?

      The order of MRS acquisition was not counterbalanced across participants.

      (5) I am confused by the statement "Our results provide strong evidence that regional GABA levels increase following inhibitory cTBS in the human associative cortex, specifically in the ATL, a representational semantic hub. Notably, the observed increase was specific to the ATL and semantic processing, as it was not observed in the control region (vertex) and not associated with control processing (visuospatial processing)". GABA levels are obtained in the MRS, and this stage does not involve any behavioral tasks. Why do the authors state that the increase in GABA levels was specific to semantic processing and was not associated with control processing?

      Following R1’s suggestion, we have re-analysed behavioural data and showed cTBS-induced suppression in semantic task performance after ATL stimulation only (please, see the reply above). There were no cTBS effects in the control task performance, control site (vertex) and no correlations between the ATL GABA levels and control task performance. The Table was added to the Supplementary Information as Table S3.

      (6) In Figure 3, the relationship between GABA levels in the ATL and performance on semantic tasks is presented. What is the relationship between GABA levels at the control site and performance on semantic tasks? Should a graph be provided to illustrate this?

      As the vertex was not involved in semantic processing (no activation during semantic processing), we did not perform the analysis between vertex GABA levels and semantic task performance. Following R3’s suggestion, we performed a linear regression between vertex GABA levels and semantic task performance in the pre-stimulation session, accounting for GM volume, age, and sex. As we expected that there was no significant relationship between them. (R<sup>2</sup> = 0.279, p = 0.962).

      (7) The author claims that GABA can sharpen distributed semantic representations. However, even though there is a positive correlation between GABA levels and semantic performance, there is no direct evidence supporting the inference that this correlation is achieved through sharpening distributed semantic representations. How did the author come to this conclusion? Are there any other possibilities?

      We showed that ATL GABA concentrations in pre-stimulation was ‘negatively’ correlated with task-induced regional activity in the ATL and ‘positively’ correlated with semantic task performance. In our semantic task, such as recognizing a camel (Fig. 1), the activation of all related information in the semantic representation (e.g., mammal, desert, oasis, nomad, humps, & etc.) occurs. To respond accurately to the task (a cactus), it becomes essential to suppress irrelevant meanings through an inhibitory mechanism. Therefore, the inhibitory processing linked to ATL GABA levels may contribute to more efficient processing in this task.

      Animal studies have proposed a related hypothesis in the context of the close interplay between activation and inhibition in sensorimotor cortices (Isaacson & Scanziani., 2011). Liu et al (2011, Neuron) demonstrated that the rise of excitatory glutamate in the visual cortex is followed by the increase of inhibitory GABA in response to visual stimuli. Tight coupling of these paired excitatory-inhibitory functions results in a sharpening of the activated representation. (for a review, see Isaacson & Scanziani., 2011 Neuron How Inhibition Shapes Cortical Activity). In human, Kolasinski et al (2017, Current Biology) revealed that higher sensorimotor GABA levels are associated with more selective cortical tuning measured fMRI, which in turn is associated with enhanced perception (better tactile discrimination). They claimed that the relationship between inhibition and cortical tuning could result from GABAergic signalling, shaping the selective response profiles of neurons in the primary sensory regions of the brain. This process is crucial for the topographic organization (task-induced fMRI activation in the sensorimotor cortex) vital to sensory perception.

      Building on these findings, we suggest a similar mechanism may operate in higher-order association cortices, including the ATL semantic hub. This suggests a process that leads to more sharply defined semantic representations associated with more selective task-induced activation in the ATL and, consequently, more accurate semantic performance (Jung et al., 2017).

      Reviewer #2 (Recommendations For The Authors):

      Major issues:

      (1) It wasn't completely clear what the novel aspect of this study relative to their previous one on GABAergic modulation in semantic memory issue, this should be clarified. If I understand correctly, the main difference from the previous study is that this study considers the TMS-induced modulation of GABA?

      We apologise that the novelty of study was not clear. The main novelty lies in uncovering the neurochemical mechanisms behind cTBS-induced neuroplasticity in semantic representation and establishing a non-linear relationship between ATL GABA levels and semantic representation. Our previous work firstly demonstrated the linear relationship between the ATL GABA levels and semantic processing. In the current study, we aimed to address two key objectives: 1) investigate the role of GABA in the ATL in short-term neuroplasticity in semantic representation, and 2) explore a biologically more plausible function between ATL GABA levels and semantic function using a larger sample size by combining data from two studies.

      The first part of the experiment in this study mirrored our previous work, involving multimodal imaging during the pre-stimulation session. We conducted the same analysis as in our previous study to replicate the findings in a different cohort. Subsequently, we combined the data from both studies to examine the potential inverted U-shape function between ATL GABA levels and semantic function/neuroplasticity.

      Accordingly, we have revised the Introduction by adding the following sentences.

      “The study aimed to investigate the neural mechanisms underlying cTBS-induced neuroplasticity in semantic memory by linking cortical neurochemical profiles, task-induced regional activity, and variability in semantic memory capability within the ATL.”

      “Furthermore, to address and explore the relationship between regional GABA levels in the ATL and semantic memory function, we combined data from our previous study (Jung et al., 2017) with the current study’s data.”

      (2) I found the scope of the study very narrow. I guess everyone agrees that TMS induces network effects, but the authors selectively focus on the modulation in the ATL. This is unfortunate since semantic memory requires the interaction between several brain regions and a network perspective might add some novel aspect to this study which has a strong overlap with their previous one. I am aware that MRS can only measure pre-defined voxels but even these changes could be related to stimulation-induced effects on task-related activity at the whole brain level.

      We appreciate R2's thoughtful comments and acknowledge the concern about the perceived narrow scope of the study. We agreed with the notion that cTBS induces network-level changes. In our investigation, we did observe cTBS over the ATL influencing task-induced regional activity in other semantic regions and functional connectivity within the semantic system. Specifically, ATL cTBS increased activation in the right ATL after ATL stimulation compared to pre-stimulation, along with increased functional connectivity between the left and right ATL, between the left ATL and right semantic control regions (IFG and pMTG), and between the left ATL and right angular gyrus. These results were the replication of Jung & Lambon Ralph (2016) Cerebral Cortex.

      However, it is important to note that we did not find any significant correlations between ATL GABA changes and cTBS-induced changes in the functional connectivity. Consequently, we are currently preparing another paper that specifically addresses the network-level changes induced by ATL cTBS. In the current study, our decision to focus on the mechanistic link between ATL GABA, task-induced activity, and individual semantic task performance reflects our intention to provide a detailed exploration of the role of GABA in the ATL and semantic neuroplasticity.

      (3) On a related note, I think the provided link between GABAergic modulation and behavioral changes after TMS is somehow incomplete because it ignores the stimulation effects on task-related activity. Could these be linked in a regression analysis with two predictors (with behavior or GABA level as a criterion and the other two variables as predictors)?

      In response to R2’s suggestion, we performed a multiple regression analysis, by modelling cTBS-induced ATL GABA changes (POST-PRE), task-related BODL signal changes (POST-PRE), and semantic task performance (IE) changes (POST-PRE). The model with GABA changes (POST-PRE) as a criterion was significant (F<sub>2, 14</sub> = 8.77, p = 0.003), explaining 56% of cTBS-induced ATL GABA changes (adjusted R<sup>2</sup>) with cTBS-related ATL BOLD signal changes and semantic task performance changes. However, the model with semantic task performance change (POST-PRE) as a criterion was not significant (F = 0.26, p = 0.775). Therefore, cTBS-induced changes in ATL BOLD signals and semantic task performance significantly predicted the cTBS-induced ATL GABA changes. It was found that cTBS-induced ATL BOLD signal changes significantly predicted cTBS-induced GABA changes in the ATL (β = -4.184, p = 0.001) only, aligning with the results of our partial correlation analysis.

      Author response table 1.

      (4) Several statements in the intro and discussion need to be rephrased or toned down. For example, I would not agree that TBS "made healthy individuals mimic semantic dementia patients". This is clearly overstated. TMS protocols slightly modulate brain functions, but this is not similar to lesions or brain damage. Please rephrase. In the discussion, it is stated that the results provide "strong evidence". I disagree based on the overall low values for most comparisons.

      Hence, we have revised both the Introduction and the Discussion.

      “Perturbing the ATL with inhibitory repetitive transcranial magnetic stimulation (rTMS) and theta burst stimulation (TBS) resulted in healthy individuals exhibiting slower reaction times during semantic processing.”

      “Our results demonstrated an increase in regional GABA levels following inhibitory cTBS in human associative cortex, specifically in the ATL, a representational semantic hub.”

      (5) Changes in the BOLD signal in the ATL: There is a weak interaction between stimulation and VOI and post hoc comparisons with very low values reported. Are these corrected for multiple comparisons? I think that selectively reporting weak values with small-volume corrections (if they were performed) does not provide strong evidence. What about whole-brain effects and proper corrections for multiple comparisons?

      There was no significant interaction between the stimulation (ATL vs. Vertex) and session (pre vs post) in the ATL BOLD signal changes (p = 0.29). Our previous work combining rTMS with fMRI (Binney et al., 2015; Jung & Lambon Ralph, 2016) demonstrated that there was no significant rTMS effects on the whole brain analysis and only ROI analyses revealed the subtle but significant rTMS effects in the target site (reduction of task-induced ATL activity). In the current study, we focused our hypothesis on the anticipated decrease in task-induced regional activity in the ATL during semantic processing following the inhibitory cTBS. Accordingly, we conducted planned paired t-tests specifically within the ATL for BOLD signal changes without applying multiple comparison corrections. It's noted that these results were derived from regions of interest (ROIs) and not from small-volume corrections. Furthermore, no significant findings emerged from the comparison of the ATL post-session vs. Vertex post-session and the ATL pre-session vs. ATL post-session in the whole-brain analysis (see Supplementary figure 2).

      Accordingly, we have added the Figure S2 in the Supplementary Information.

      (6) Differences between selected VOIs: Numerically, the activity (BOLD signal effect) is higher in the vertex than the ATL, even in the pre-TMS session (Figure 2D). What does that mean? Does that indicate that the vertex also plays a role in semantic memory?

      We apologise that the figure was not clear. Fig. 2D displays the BOLD signal changes in the ATL VOI for the ATL and Vertex stimulation. As there was no activation in the vertex during semantic processing, we did not present the fMRI results of vertex VOI (please, see Author response image 3 below). Accordingly, we have revised the label of Y axis of the Figure 2D – ATL BOLD signal change.

      Author response image 3.

      The cTBS effects within the Vertex VOI during semantic processing

      (7) Could you provide the e-field for the vertex condition?

      We have added it in the Supplementary Information as Supplementary Figure 6.

      (8) Stimulation effects on performance (RTs): There is a main effect of the session in the control task. Post-hoc tests show that control performance is faster in the post-pre comparison, while the semantic task is not faster after ATL TMS (as it might be delayed). I think you need to perform a 3-way ANOVA here including the factor task if you want to show task specificity (e.g., differences for the control but not semantic task) and then a step-down ANOVA or t-tests.

      Thanks for R2’s suggestion. We have addressed this issue in reply to R1. Please, see the reply to R1 for semantic task performance analysis.

      Minor issue:

      In the visualization of the design, it would be helpful to have the timing/duration of the different measures to directly understand how long the experiment took.

      We have added the duration of the experiment design in the Figure 1.

      Reviewer #3 (Recommendations For The Authors):

      Further Recommendations:

      • Pg. 6; lines 138-147: There is a sense of uncertainty about the hypothesis conveyed by expressions such as 'may' or 'could be'. A more confident tone would be beneficial.

      Thanks for R3’s thoughtful suggestion. We have revised the Introduction.

      • Pg. 6; line 155: left or bilateral ATL, please specify.

      We have added ‘left’ in the manuscript.

      • Pg. 8; line 188: Can the authors provide a table with peak activations to complement the figure?

      We have added the Table for the fMRI results in the Supplementary Information (Table S1).

      • Pg 9; Figure 2C: The ATL activation elicited by the semantic task seems rather medial. What are the exact peak coordinates for this cluster, and how can the authors demonstrate that the electric fields induced by TMS, which seem rather lateral (Figure 2A), also impacted this area? Please explain.

      We apologise that the Figure was not clear. cTBS was delivered to the peak coordinate of the left ventral ATL [-36, -15, -30] determined by previous fMRI studies (Binney et al., 2010; Visser et al., 2012). To confirm the cTBS effects at the target region, we conducted ROI analysis centred in the ventral ATL [-36, -15, -30] and the results demonstrated a reduced ATL activity after ATL stimulation during semantic processing (t = -2.43, p = 0.014) (please, see Author response image 4 below). Thus, cTBS successfully modulated the ATL activity reaching to the targe coordinate.

      Author response image 4.

      • Pg.23; line 547: What was the centre coordinate of the ROI (VOI), and was it consistent across all participants? Please specify.

      We used the ATL MRS VOI (a hexahedron with 4cm x 2cm x 2cm) for our regions of interest analysis and the central coordinate was around -45, -12, -20 (see Author response image 5). As we showed in Fig. 1C, the location of ATL VOI was consistent across all participants.

      Author response image 5.

      • Pg. 24; line 556-570: What software was used for performing the statistical analyses? Please specify.

      We have added the following sentence.

      “Statistical analyses were undertaken using Statistics Package for the Social Sciences (SPSS, Version 25, IBM Cary, NC, USA) and RStudio (2023).”

      • Pg. 21; line 472-480: It is not clear if and how neuronavigation was used (e.g. were T1scans or an average MNI template used, what was the exact coordinate of stimulation and how was it decided upon). Please specify.

      We apologised the description was not clear. We have added a paragraph describing the procedure.

      “The target site in the left ATL was delineated based on the peak coordinate (MNI -36 -15 -30), which represents maximal peak activation observed during semantic processing in previous distortion-corrected fMRI studies (38, 41). This coordinate was transformed to each individual’s native space using Statistical Parametric Mapping software (SPM8, Wellcome Trust Centre for Neuroimaging, London, UK). T1 images were normalised to the MNI template and then the resulting transformations were inverted to convert the target MNI coordinate back to the individual's untransformed native space coordinate. These native-space ATL coordinates were subsequently utilized for frameless stereotaxy, employing the Brainsight TMS-MRI co-registration system (Rogue Research, Montreal, Canada). The vertex (Cz) was designated as a control site following the international 10–20 system.”

      • Miscellaneous

      - line 57: insert 'about' to the following sentence: '....little is known the mechanisms linking'

      - line 329: 'Previous, we demonstrated'....should be Previously we demonstrated....

      We thank for R3’s thorough evaluation our manuscript. We have revised them.

      Furthermore, it would be an advantage to make the data freely available for the benefit of the broader scientific community.

      We appreciate Reviewer 3’s suggestion. Currently, this data is being used in other unpublished work. However, upon acceptance of this manuscript, we will make the data freely available for the benefit of the broader scientific community.

      Chiou R, Sowman PF, Etchell AC, Rich AN (2014) A conceptual lemon: theta burst stimulation to the left anterior temporal lobe untangles object representation and its canonical color. J Cogn Neurosci 26:1066-1074.

      Jung J, Lambon Ralph MA (2016) Mapping the Dynamic Network Interactions Underpinning Cognition: A cTBS-fMRI Study of the Flexible Adaptive Neural System for Semantics. Cereb Cortex 26:3580-3590.

      Jung J, Williams SR, Sanaei Nezhad F, Lambon Ralph MA (2017) GABA concentrations in the anterior temporal lobe predict human semantic processing. Sci Rep 7:15748.

      Jung J, Williams SR, Nezhad FS, Lambon Ralph MA (2022) Neurochemical profiles of the anterior temporal lobe predict response of repetitive transcranial magnetic stimulation on semantic processing. Neuroimage 258:119386.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Weaknesses

      (1) The authors face a technical challenge (which they acknowledge): they use two numbers (mean and variance) to characterize synaptic variability, whereas in the brain there are three numbers (number of vesicles, release probability, and quantal size). Turning biological constraints into constraints on the variance, as is done in the paper, seems somewhat arbitrary. This by no means invalidates the results, but it means that future experimental tests of their model will be somewhat nuanced.

      Agreed. There are two points to make here.

      First, the mean and variance are far more experimentally accessible than n, p and q. The EPSP mean and variance is measured directly in paired-patch experiments, whereas getting n, p and q either requires far more extensive experimentation, or making strong assumptions. For instance, the data from Ko et al. (2013) gives the EPSP mean and variance, but not (directly) n, p and q. Thus, in some ways, predictions about means and variances are easier to test than predictions about n, p and q.

      That said, we agree that in the absence of an extensive empirical accounting of the energetic costs at the synapse, there is inevitably some arbitrariness as we derive our energetic costs. That was why we considered four potential functional forms for the connection between the variance and energetic cost, which covered a wide range of sensible forms for this energetic cost. Our results were robust to this wide range functional forms, indicating that the patterns we describe are not specifically due to the particular functional form, but arise in many settings where there is an energetic cost for reliable synaptic transmission.

      (2) The prediction that the learning rate should increase with variability relies on an optimization scheme in which the learning rate is scaled by the inverse of the magnitude of the gradients (Eq. 7). This seems like an extra assumption; the energy efficiency framework by itself does not predict that the learning rate should increase with variability. Further work will be needed to disentangle the assumption about the optimization scheme from the energy efficiency framework.

      Agreed. The assumption that learning rates scale with synapse importance is separate. However, it is highly plausible as almost all modern state-of-the-art deep learning training runs use such an optimization scheme, as in practice it learns far faster than other older schemes. We have added a sentence to the main text (line 221), indicating that this is ultimately an assumption.

      Major

      (1) The correspondence between the entropy term in the variational inference description and the reliability cost in the energetic description is a bit loose. Indeed, the entropy term scales as −log(σ) while reliability cost scales as σ−ρ. While the authors do make the point that σ−ρ upper bounds −log(σ) (up to some constant), those two cost terms are different. This raises two important questions:

      a. Is this difference important, i.e. are there scenarios for which the two frameworks would have different predictions due to their different cost functions?

      b. Alternatively, is there a way to make the two frameworks identical (e.g. by choosing a proposal distribution Q(w) different from a Gaussian distribution (and tuneable by a free parameter that could be related to ρ) and therefore giving rise to an entropy term consistent with the reliability cost of the energy efficiency framework)?

      To answer b first, there is no natural way to make the two frameworks identical (unless we assume the reliability cost is proportional to log_σsyn_, and we don’t think there’s a biophysical mechanism that would give rise to such a cost). Now, to answer a, in Fig. 7 we extensively assessed the differences between the energy efficient σsyn and the Bayesian σpost. In Fig.7bc, we find that σsyn and σpost are positively correlated in all models. This positive correlation indicates that the qualitative predictions made by the two frameworks (Bayesian inference and energy efficiency) are likely to be very similar. Importantly though, there are systematic differences highlighted by Fig. 7ab. Specifically, the energy efficient σsyn tends to vary less than the Bayesian σpost. This appears in Fig. 7b which shows the relationship between σsyn (on the y-axis) and σpost (on the x-axis). Specifically, this plot has a slope that is smaller than one for all our models of the biophysical cost. Further, the pattern also appears in the covariance ellipses in Fig. 7a, in that the Bayesian covariance ellipses tend to be long and thin, while the energy efficient covariance ellipsis are rounder. Critically though both covariance ellipses show the same pattern in that there is more noise along less important directions (as measured by the Hessian).

      We have added a sentence (line 273) noting that the search for a theoretical link is motivated by our observations in Fig. 7 of a strong, but not perfect link between the pattern of variability predicted by Bayesian and energy-efficient synapses.

      (2) Even though I appreciate the effort of the authors to look for experimental evidence, I still find that the experimental support (displayed in Fig. 6) is moderate for three reasons.

      a. First, the experimental and simulation results are not displayed in a consistent way. Indeed, Fig 6a displays the relative weight change |Dw|/w as a function of the normalised variability σ_2/|_µ| in experiments whereas the simulation results in Fig 5c display the variance σ_2 as a function of the learning rate. Also, Fig 6b displays the normalised variability _σ_2/|_µ| as a function of the input rate whereas Fig 5b displays the variance _σ_2 as a function of the input rate. As a consequence the comparison between experimental and simulation results is difficult.

      b. Secondly, the actual power-law exponents in the experiments (see Fig 6a resp. 6b) should be compared to the power-law exponents obtained in simulation (see Fig 5c resp. Fig 5b). The difficulty relies here on the fact that the power-law exponents obtained in the simulations directly depend on the (free) parameter ρ. So far the authors precisely avoided committing to a specific ρ, but rather argued that different biophysical mechanisms lead to different reliability exponents ρ. Therefore, since there are many possible exponents ρ (and consequently many possible power-law exponents in simulation results in Fig 5), it is likely that one of them will match the experimental data. For the argument to be stronger, one would need to argue which synaptic mechanism is dominating and therefore come up with a single prediction that can be falsified experimentally (see also point 4 below).

      c, Finally, the experimental data presented in Fig6 are still “clouds of points". A coefficient of r \= 0_.52 (in Fig 6a) is moderate evidence while the coefficient of _r \= −0_._26 (in Fig 6b) is weak evidence.

      The key thing to remember is that our paper is not about whether synapses are “really" Bayesian or energy efficient (or both/neither). Instead, the key point of our paper, as expressed in the title, is to show that the experimental predictions of Bayesian synapses are very similar to the predictions from energy efficient synapses. And therefore energy efficient synapses are very difficult to distinguish experimentally from Bayesian synapses. In that context, the two plots in Fig. 6 are not really intended to present evidence in favour of the energy efficiency / Bayesian synapses. In fact, Fig. 6 isn’t meant to constitute a contribution of the paper at all, instead, Fig. 6 serves merely as illustrations of the kinds of experimental result that have (Aitchison et al. 2021) or might (Schug et al. 2021) be used to support Bayesian synapses. As such, Fig. 6 serves merely as a jumping-off point for discussing how very similar results might equally arise out of Bayesian and energy-efficiency viewpoints.

      We have modified our description of Fig. 6 to further re-emphasise that the panels in Fig. 6 is not our contribution, but is taken directly from Schug et al. 2021 and Aitchison et al. 2021 (we have also modified Fig 6 to be precisely what was plotted in Schug et al. 2021, again to re-emphasise this point). Further, we have modified the presentation to emphasise that these plots serve merely as jumping off points to discuss the kinds of predictions that we might consider for Bayesian and energy efficient synapses.

      This is important, because we would argue that the “strength of support" should be assessed for our key claim, made in the title, that “Signatures of Bayesian inference emerge from energy efficient synapses".

      a) To emphasise that these are previously published results, we have chosen axes to matchthose used in the original work (Aitchison et al. 2021) and (Schug et al. 2021).

      b) We agree that a close match between power-law exponents would constitute strong evidencefor energy-efficiency / Bayesian inference, and might even allow us to distinguish them. We did consider such a comparison, but found it was difficult for two reasons. First, while the confidence intervals on the slopes exclude zero, they are pretty broad. Secondly, while the slopes in a one-layer network are consistent and match theory (Appendix 5) the slopes in deeper networks are far more inconsistent. This is likely to be due to a number of factors such as details of the optimization algorithm and initialization. Critically, if details of the optimization algorithm matter in simulation, they may also matter in the brain. Therefore, it is not clear to us that a comparison of the actual slopes is can be relied upon.

      To reiterate, the point of our article is not to make judgements about the strength ofevidence in previously published work, but to argue that Bayesian and energy efficient synapses are difficult to distinguish experimentally as they produce similar predictions. That said, it is very difficult to make blanket statements about the strength of evidence for an effect based merely on a correlation coefficient. It is perfectly possible to have moderate correlation coefficients along with very strong evidence of an effect (and e.g. very strong p-values), e.g. if there is a lot of data. Likewise, it is possible to have a very large correlation coefficient along with weak evidence of an effect (e.g. if we only have three or four datapoints, which happen to lie in a straight line). A small correlation coefficient is much more closely related to the effect-size. Specifically, the effect-size, relative to the “noise", which usually arises from unmeasured factors of variation. Here, we know there are many, many unmeasured factors of variation, so even in the case that synapses are really Bayesian / energy-efficient, the best we can hope for is low correlation coefficients

      As mentioned in the public review, a weakness in the paper is the derivation of the constraints on σi given the biophysical costs, for two reasons.

      a.First, it seemed a bit arbitrary whether you hold n fixed or p fixed.

      b.Second, at central synapses, n is usually small – possibly even usually 1: REF(Synaptic vesicles transiently dock to refill release sites, Nature Neuroscience 23:1329-1338, 2020); REF(The ubiquitous nature of multivesicular release Trends Neurosci. 38:428-438, 2015). Fixing n would radically change your cost function. Possibly you can get around this because when two neurons are connected there are multiple contacts (and so, effectively, reasonably large n). It seems like this is worth discussing.

      a) Ultimately, we believe that the “real” biological cost function is very complex, and most likely cannot be written down in a simple functional form. Further, we certainly do not have the experimental evidence now, and are unlikely to have experimental evidence for a considerable period into the future to pin down this cost function precisely. In that context, we are forced to resort to two strategies. First, using simplifying assumptions to derive a functional form for the cost (such as holding n or p fixed). Second, considering a wide range of functional forms for the cost, and ensuring our argument works for all of them.

      b) We appreciate the suggestion that the number of connections could be used as a surrogate where synapses have only a single release site. As you suggest we can propose an alternative model for this case where n represents the number of connections between neurons. We have added this alternative interpretation to our introduction of the quantal model under title “Biophysical costs". For a fixed PSP mean we could either have many connections with small vesicles or less connections with larger vesicles. Similarly for the actin cost we would certainly require more actin if the number of connections were increased.

      Minor

      (1) A few additional references could further strengthen some claims of the paper:

      Davis, Graeme W., and Martin Muller. “Homeostatic Control of Presynaptic Neurotransmitter Release." Annual Review of Physiology 77, no. 1 (February 10, 2015): 251-70. https://doi.org/10.1146/annurev-physiol-021014-071740. This paper provides elegant experimental support for the claim (in line 538 now 583) that µ is kept constant and q acts as a compensatory variable.

      Jegminat, Jannes, Simone Carlo Surace, and Jean-Pascal Pfister. “Learning as Filtering: Implications for Spike-Based Plasticity." Edited by Blake A Richards. PLOS Computational Biology 18, no. 2 (February 23, 2022): e1009721. https://doi.org/10.1371/journal.pcbi.1009721.

      This paper also showed that a lower uncertainty implies a lower learning rate (see e.g. in line 232), but in the context of spiking neurons.

      Figure 1 of the the first suggested paper indeed shows that quantal size is a candidate for homeostatic scaling (fixing µ). This review also references lots of further evidence of quantal scaling and evidence for both presynaptic and postsynaptic scaling of q leaving space for speculation on whether vesicle radius or postsynaptic receptor number is the source of a compensatory q. On line 583 we have added a few lines pointing to the suggested review paper.

      The second reference demonstrates Bayesian plasticity in the context of STDP, proposing learning rates tuned to the covariance in spike timing. We have added this as extra support for assuming an optimisation scheme that tunes learning rates to synapse importance and synapse variability (line 232).

      In the numerical simulations, the reliability cost is implemented with a single power-law expression (reliability cost ). However, in principle, all the reliability costs will play in conjunction, i.e. reliability cost . While I do recognise that it may be difficult to estimate the biophysical values of the various ci, it might be still relevant to comment on this.

      Agreed. Limitations in the literature meant that we could only form a cursory review of the relative scale of each cost using estimates by Atwell, (2001), Engl, (2015). On line 135 we have added a paragraph explaining the rationale for considering each cost independently.

      (3) In Eq. 8: σ_2 doesn’t depend on variability in _q, which would add another term; barring algebra mistakes, it’s . It seems worth mentioning why you didn’t include it. Can you argue that it’s a small effect?

      Agreed. Ultimately, we dropped this term because we expected it to be small relative to variability in vesicle release, and because it would be difficult to quantify In practice, the variability is believed to be contributed mostly by variability in vesicle release. The primary evidence for this is histograms of EPSP amplitudes which show classic multi-peak structure, corresponding to one, two three etc. EPSPs. Examples of these plots include:

      - “The end-plate potential in mammalian muscle”, Boyd and Martin (1956); Fig. 8.

      - “Structure and function of a neocortical synapse”, Holler-Rickauer et al. (2019); Extended Figure 5.

      (3) On pg. 7 now pg. 8, when the Hessian is introduced, why not say what it is? Or at least the diagonal elements, for which you just sum up the squared activity. That will make it much less mysterious. Or are we relying too much on the linear model given in App 2? If so, you should tell us how the Hessian was calculated in general. Probably in an appendix.

      With the intention of maintaining the interest of a wide audience we made the decision to avoid a mathematical definition of the Hessian, opting instead for a written definition i.e. line 192 - “Hii; the second derivatives of the objective with respect to wi.” and later on a schematic (Fig. 4) for how the second derivative can be understood as a measure of curvature and synapse importance. Nonetheless, this review point has made us aware that the estimated Hessian values plotted in Fig. 5a have been insufficiently explained so we have added a reference on line 197 to the appendix section where we show how we estimated the diagonal values of the Hessian.

      (4) Fig. 5: assuming we understand things correctly, Hessian ∝ |x|2. Why also plot σ_2 versus |_x|? Or are we getting the Hessian wrong?

      The Hessian is proportional to . If you assume that time steps are small and neurons spike, then , and . it is difficult to say what timestep is relevant in practice.

      (5) To get Fig. 6a, did you start with Fig. Appendix 1-figure 4 from Schug et al, and then use , drop the q, and put 1 − p on the x-axis? Either way, you should provide details about where this came from. It could be in Methods.

      We have modified Fig. 6 to use the same axes as in the original papers.

      (6) Lines 190-3: “The relationship between input firing rate and synaptic variability was first observed by Aitchison et al. (2021) using data from Ko et al. (2013) (Fig. 6a). The relationship between learning rate and synaptic variability was first observed by Schug et al. (2021), using data from Sjostrom et al. (2003) as processed by Costa et al. (2017) (Fig. 6b)." We believer 6a and 6b should be interchanged in that sentence.

      Thank you. We have switched the text appropriately.

      (7) What is posterior variance? This seems kind of important.

      This refers to the “posterior variance" obtained using a Bayesian interpretation of the problem of obtaining good synaptic weights (Aitchison et al. 2021). In our particular setting, we estimate posterior variances by setting up the problem as variational inference: see Appendix 4 and 5, which is now referred to in line 390.

      (8) Lines 244-5: “we derived the relationships between the optimized noise, σi and the posterior variable, σpost as a function of ρ (Fig. 7b;) and as a function of c (Fig. 7c)." You should tell the reader where you derived this. Which is Eq. 68c now 54c. Except you didn’t actually derive it; you just wrote it down. And since we don’t know what posterior variance is, we couldn’t figure it out.

      If H is the Hessian of the log-likelihood, and if the prior is negligable relative to the the likelihood, then we get Eq. 69c. We have added a note on this point to the text.

      (9) We believe Fig. 7a shows an example pair of synapses. Is this typical? And what about Figs. 7b and c. Also an example pair? Or averages? It would be helpful to make all this clear to the reader.

      Fig. 7a shows an illustrative pair of synapses, chosen to best display the relative patterns of variability under energy efficient and Bayesian synapses. We have noted this point in the legend for Fig. 7. Fig. 7bc show analytic relationships between energy efficient and Bayesian synapses, so each line shows a whole continuum of synapses(we have deleted the misleading points at the ends of the lines in Fig. 7bc).

      (10)  The y-axis of Fig 6a refers to the synaptic weight as w while the x-axis refers to the mean synaptic weight as mu. Shouldn’t it be harmonised? It would be particularly nice if both were divided by µ, because then the link to Fig. 5c would be more clear.

      We have changed the y-axis label of Fig. 6a from w to µ. Regarding the normalised variance, we did try this but our Gaussian posteriors allowed the mean to become small in our simulations, giving a very high normalised variance. To remedy this we would likely need to assume a log- posterior, but this was out of scope for the present work.

      (11) Line 250 (now line 281): “Finally, in the Appendix". Please tell us which Appendix. Also, why not point out here that the bound is tightest at small ρ?

      We have added the reference to the the section of the appendix with the derivation of the biological cost as a bound on the ELBO. We have also referenced the equation that gives the limit of the biological cost as ρ tends to zero.

      (12) When symbols appear that previously appeared more than about two paragraphs ago, please tell us where they came from. For instance, we spent a lot of time hunting for ηi. And below we’ll complain about undefined symbols. Which might mean we just missed them; if you told us where they were, that problem would be eliminated.

      We have added extra references for the symbols in the text following Eq. 69.

      (13) Line 564, typo (we think): should be σ−2.

      Good spot. This has been fixed.

      (14)  A bit out of order, but we don’t think you ever say explicitly that r is the radius of a vesicle. You do indicate it in Fig. 1, but you should say it in the main text as well.

      We have added a note on this to the legend in Fig. 1.

      (15) Eq. 14: presumably there’s a cost only if the vesicle is outside the synapse? Probably worth saying, since it’s not clear from the mechanism.

      Looking at Pulido and Ryan (2021) carefully, it is clear that they are referring to a cost for vesicles inside the presynaptic side of the synapse. (Importantly, vesciles don’t really exist outside the synapse; during the release process, the vesicle membrane becomes part of the cell membrane, and the contents of the vesicle is ejected into the synaptic cleft).

      (16) App. 2: why solve for mu, and why compute the trace of the Hessian? Not that it hurts, but things are sort of complicated, and the fewer side points the better.

      Agreed, we have removed the solution for μ, and the trace, and generally rewritten Appendix 2 to clarify definitions, the Hessian etc.

      (17) Eq. 35: we believe you need a minus sign on one side of the equation. And we don’t believe you defined p(d|w). Also, are you assuming g = partial log p(d|w)/partial w? This should be stated, along with its implications. And presumably, it’s not really true; people just postulate that p(d|w) ∝ exp(−log_loss_)?

      We have replaced p(d|w) with p(y, x|w), and we replaced “overall cost” with log P(y|w, x). Yes, we are also postulating that p(y|w, x) ∝ exp(−log loss), though in our case that does make sense as it corresonds to a squared loss.

      As regards the minus sign, in the orignal manuscript, we had the second derivative of the cost. There is no minus sign for the cost, as the Hessian of the cost at the mode is positive semi-definite. However, once we write the expression in terms of a log-likelihood, we do need a minus sign (as the Hessian of the log-likelihood at a mode is negative semi-definite).

      (18) Eq. 47 now Eq. 44: first mention of CBi;i?

      We have added a note describing CB around these equations.

      (19) The “where" doesn’t make sense for Eqs. 49 and 50; those are new definitions.

      We have modified the introduction of these equations to avoid the problematic “where”.

      (20) Eq. 57 and 58 are really one equation. More importantly: where does Eq. 58 come from? Is this the H that was defined previously? Either way, you should make that clear.

      We have removed the problematic additional equation line number, and added a reference to where H comes from.

      (21) In Eq. 59 now Eq. 60 aren’t you taking the trace of a scalar? Seems like you could skip this.

      We have deleted this derivation, as it repeats material from the new Appendix 2.

      (22) Eq. 66 is exactly the same as Eq. 32. Which is a bit disconcerting. Are they different derivations of the same quantity? You should comment on this.

      We have deleted lots of the stuff in Appendix 5 as, we agree, it repeats material from Appendix 2 (which has been rewritten and considerably clarified).

      (23) Eq. 68 now 54, left column: please derive. we got:

      gai = gradient for weight i on trial

      where the second equality came from Eq. 20. Thus

      Is that correct? If so, it’s a lot to expect of the reader. Either way, a derivation would

      be helpful.

      We agree it was unnecessary and overly complex, so we have deleted it.

      (24) App 5–Figure 2: presumably the data for panel b came from Fig. 6a, with the learning rate set to Δw/w? And the data for panel c from Fig. 6b? This (or the correct statement, if this is wrong) should be mentioned.

      Yes, the data for panel c came from Fig. 6b. We have deleted the data in panel b, as there are some subtleties in interpretation of the learning rates in these settings.

      (25) line 952 now 946: typo, “and the from".

      Corrected to “and from".

    1. Author response:

      The following is the authors’ response to the original reviews

      Response to the Editors’ Comments

      Thankyou for this summary of the reviews and recommendations for corrections. We respond to each in turn, and have documented each correction with specific examples contained within our response to reviewers below.

      ‘They all recommend to clarify the link between hypotheses and analyses, ground them more clearly in, and conduct critical comparisons with existing literature, and address a potential multiple comparison problem.’

      We have restructured our introduction to include the relevant literature outlined by the reviewers, and to be more clearly ground the goals of our model and broader analysis. We have additionally corrected for multiple comparisons within our exploratory associative analyses. We have additionaly sign posted exploratory tests more clearly.

      ‘Furthermore, R1 also recommends to include a formal external validation of how the model parameters relate to participant behaviour, to correct an unjustified claim of causality between childhood adversity and separation of self, and to clarify role of therapy received by patients.’

      We have now tempered our language in the abstract which unintentionally implied causality in the associative analysis between childhood trauma and other-to-self generalisation. To note, in the sense that our models provide causal explanations for behaviour across all three phases of the task, we argue that our model comparison provides some causal evidence for algorithmic biases within the BPD phenotype. We have included further details of the exclusion and inclusion criteria of the BPD participants within the methods.

      R2 specifically recommends to clarify, in the introduction, the specific aim of the paper, what is known already, and the approach to addressing it.’

      We have more thoroughly outlined the current state of the art concerning behavioural and computational approaches to self insertion and social contagion, in health and within BPD. We have linked these more clearly to the aims of the work.

      ‘R2 also makes various additional recommendations regarding clarification of missing information about model comparison, fit statistics and group comparison of parameters from different models.’

      Our model comparison approach and algorithm are outlined within the original paper for Hierarchical Bayesian Model comparison (Piray et al., 2019). We have outlined the concepts of this approach in the methods. We have now additionally improved clarity by placing descriptions of this approach more obviously in the results, and added points of greater detail in the methods, such as which statistics for comparison we extracted on the group and individual level.

      In addition, in response to the need for greater comparison of parameters from different models, we have also hierarchically force-fitted the full suite of models (M1-M4) to all participants. We report all group differences from each model individually – assuming their explanation of the data - in Table S2. We have also demonstrated strong associations between parameters of equivalent meaning from different models to support our claims in Fig S11. Finally, we show minimal distortion to parameter estimates in between-group analysis when models are either fitted hierarchically to the entire population, or group wise (Figure S10).

      ‘R3 additionally recommends to clarify the clinical and cognitive process relevance of the experiment, and to consider the importance of the Phase 2 findings.’

      We have now included greater reference to the assumptions in the social value orientation paradigm we use in the introduction. We have also responded to the specific point about the shift in central tendencies in phase 2 from the BPD group, noting that, while BPD participants do indeed get more relatively competitive vs. CON participants, they remain strikingly neutral with respect to the overall statespace. Importantly, model M4 does not preclude more competitive distributions existing.

      ‘Critically, they also share a concern about analyzing parameter estimates fit separately to two groups, when the best-fitting model is not shared. They propose to resolve this by considering a model that can encompass the full dynamics of the entire sample.’

      We have hierarchically force-fitted the full suite of models (M1-M4) to all participants to allow for comparison between parameters within each model assumption. We report all group differences from each model individually – assuming their explanation of the data - in Table S2 and Table S3. We have also demonstrated strong associations between parameters of equivalent meaning from different models to support our claims in Fig S11. We also show minimal distortion to parameter estimates in between-group analysis when models are either fitted hierarchically to the entire population, or group wise (Figure S10).

      Within model M1 and M2, the parameters quantify the degree to which participants believe their partner to be different from themselves. Under M1 and M2 model assumptions, BPD participants have meaningfully larger versus CON (Fig S10), which supports the notion that a new central tendency may be more parsimonious in phase 2 (as in the case of the optimal model for BPD, M4). We also show strong correlations across models between under M1 and M2, and the shift in central tendenices of beliefs between phase 1 and 2 under M3 and M4. This supports our primary comparison, and shows that even under non-dominant model assumptions, parameters demonstrate that BPD participants expect their partner’s relative reward preferences to be vastly different from themselves versus CON.

      ‘A final important point concerns the psychometric individual difference analyses which seem to be conducted on the full sample without considering the group structure.’

      We have now more clearly focused our psychometric analysis. We control for multiple comparisons, and compare parameters across the same model (M3) when assessing the relationship between paranoia, trauma, trait mentalising, and social contagion. We have relegated all other exploratory analyses to the supplementary material and noted where p values survive correction using False Discovery Rate.

      Reviewer 1:

      ‘The manuscript's primary weakness relates to the number of comparisons conducted and a lack of clarity in how those comparisons relate to the authors' hypotheses. The authors specify a primary prediction about disruption to information generalization in social decision making & learning processes, and it is clear from the text how their 4 main models are supposed to test this hypothesis. With regards to any further analyses however (such as the correlations between multiple clinical scales and eight different model parameters, but also individual parameter comparisons between groups), this is less clear. I recommend the authors clearly link each test to a hypothesis by specifying, for each analysis, what their specific expectations for conducted comparisons are, so a reader can assess whether the results are/aren't in line with predictions. The number of conducted tests relating to a specific hypothesis also determines whether multiple comparison corrections are warranted or not. If comparisons are exploratory in nature, this should be explicitly stated.’

      We have now corrected for multiple comparisons when examining the relationship between psychometric findings and parameters, using partial correlations and bootstrapping for robustness. These latter analyses were indeed not preregistered, and so we have more clearly signposted that these tests were exploratory. We chose to focus on the influence of psychometrics of interest on social contagion under model M3 given that this model explained a reasonable minority of behaviour in each group. We have now fully edited this section in the main text in response, and relegated all other correlations to the supplementary materials.

      ‘Furthermore, the authors present some measures for external validation of the models, including comparison between reaction times and belief shifts, and correlations between model predicted accuracy and behavioural accuracy/total scores. However it would be great to see some more formal external validation of how the model parameters relate to participant behaviour, e.g., the correlation between the number of pro-social choices and ß-values, or the correlation between the change in absolute number of pro-social choices and the change in ß. From comparing the behavioural and computational results it looks like they would correlate highly, but it would be nice to see this formally confirmed.’

      We have included this further examination within the Generative Accuracy and Recovery section:

      ‘We also assessed the relationship (Pearson rs) between modelled participant preference parameters in phase 1 and actual choice behaviour: was negatively correlated with prosocial versus competitive choices (r=-0.77, p<0.001) and individualistic versus competitive choices (r=-0.59, p<0.001); was positively correlated with individualistic versus competitive choices (r=0.53, p<0.001) and negatively correlated with prosocial versus individualistic choices (r=-0.69, p<0.001).’

      ‘The statement in the abstract that 'Overall, the findings provide a clear explanation of how self-other generalisation constrains and assists learning, how childhood adversity disrupts this through separation of internalised beliefs' makes an unjustified claim of causality between childhood adversity and separation of self - and other beliefs, although the authors only present correlations. I recommend this should be rephrased to reflect the correlational nature of the results.’

      Sorry – this was unfortunate wording: we did not intend to imply causation with our second clause in the sentence mentioned. We have amended the language to make it clear this relationship is associative:

      ‘Overall, the findings provide a clear explanation of how self-other generalisation constrains and assists learning, how childhood adversity is associated with separation of internalised beliefs, and makes clear causal predictions about the mechanisms of social information generalisation under uncertainty.’

      ‘Currently, from the discussion the findings seem relevant in explaining certain aberrant social learning and -decision making processes in BPD. However, I would like to see a more thorough discussion about the practical relevance of their findings in light of their observation of comparable prediction accuracy between the two groups.’

      We have included a new paragraph in the discussion to address this:

      ‘Notably, despite differing strategies, those with BPD achieved similar accuracy to CON participants in predicting their partners. All participants were more concerned with relative versus absolute reward; only those with BPD changed their strategy based on this focus. Practically this difference in BPD is captured either through disintegrated priors with a new median (M4) or very noisy, but integrated priors over partners (M1) if we assume M1 can account for the full population. In either case, the algorithm underlying the computational goal for BPD participants is far higher in entropy and emphasises a less stable or reliable process of inference. In future work, it would be important to assess this mechanism alongside momentary assessments of mood to understand whether more entropic learning processes contribute to distressing mood fluctuation.’

      ‘Relatedly, the authors mention that a primary focus of mentalization based therapy for BPD is 'restoring a stable sense of self' and 'differentiating the self from the other'. These goals are very reminiscent of the findings of the current study that individuals with BPD show lower uncertainty over their own and relative reward preferences, and that they are less susceptible to social contagion. Could the observed group differences therefore be a result of therapy rather than adverse early life experiences?’

      This is something that we wish to explore in further work. While verbal and model descriptions appear parsimonious, this is not straight forward. As we see, clinical observation and phenomenological dynamics may not necessarily match in an intuitive way to parameters of interest. It may be that compartmentalisation of self and other – as we see in BPD participants within our data – may counter-intuitively express as a less stable self. The evolutionary mechanisms that make social insertion and contagion enduring may also be the same that foster trust and learning.

      ‘Regarding partner similarity: It was unclear to me why the authors chose partners that were 50% similar when it would be at least equally interesting to investigate self-insertion and social contagion with those that are more than 50% different to ourselves? Do the authors have any assumptions or even data that shows the results still hold for situations with lower than 50% similarity?’

      While our task algorithm had a high probability to match individuals who were approximately 50% different with respect to their observed behaviour, there was variation either side of this value. The value of 50% median difference was chosen for two reasons: 1. We wanted to ensure participants had to learn about their partner to some degree relative to their own preferences and 2. we did not want to induce extreme over or under familiarity given the (now replicated) relationship between participant-partner similarity and intentional attributions (see below). Nevertheless, we did have some variation around the 50% median. Figure 3A in the top left panel demonstrates this fluctuation in participant-partner similarity and the figure legend further described this distribution (mean = 49%, sd = 12%). In future work we want to more closely manipulate the median similarity between participants and partners to understand how this facilitates or inhibits learning and generalisation.

      There is some analysis of the relationship between degrees of similiarity and behaviour. In the third paragraph of page 15 we report the influence of participant-partner similarity on reaction times. In prior work (Barnby et al., 2022; Cognition) we had shown that similarity was associated with reduced attributions of harm about a partner, irrespective of their true parameters (e.g. whether they were prosocial/competitive). We replicate this previous finding with a double dissociation illustrated in Figure 4, showing that greater discrepancies in participant-partner prosociality increases explicit harmful intent attributions (but not self-interest), and discrepancies in participant-partner individualism reduces explicit self-interest attributions (but not harmful intent). We have made these clearer in our results structure, and included FDR correction values for multiple comparisons.

      The methods section is rather dense and at least I found it difficult to keep track of the many different findings. I recommend the authors reduce the density by moving some of the secondary analyses in the supplementary materials, or alternatively, to provide an overall summary of all presented findings at the end of the Results section.

      We have now moved several of our exploratory findings into the supplementary materials, noteably the analysis of participant-partner similarity on reaction times (Fig S9), as well as the uncorrected correlation between parameters (Fig S7).

      Fig 2C) and Discussion p. 21: What do the authors mean by 'more sensitive updates'? more sensitive to what?

      We have now edited the wording to specify ‘more belief updating’ rather than ‘sensitive’ to be clearer in our language.

      P14 bottom: please specify what is meant by axial differences.

      We have changed this to ‘preference type’ rather than using the term ‘axial’.

      It may be helpful to have Supplementary Figure 1 in the main text.

      Thank you for this suggestion. Given the volume of information in the main text we hope that it is acceptable for Figure S1 to remain in the supplementary materials.

      Figure 3D bottom panel: what is the difference between left and right plots? Should one of them be alpha not beta?

      The left and right plots are of the change in standard deviation (left) and central tendency (right) of participant preference change between phase 1 and 3. This is currently noted in the figure legend, but we had added some text to be clearer that this is over prosocial-competitive beliefs specifically. We chose to use this belief as an example given the centrality of prosocial-comeptitive beliefs in the learning process in Figure 2. We also noticed a small labelling error in the bottom panels of 3D which should have noted that each plot was either with respect to the precision or mean-shift in beliefs during phase 3.

      ‘The relationship between uncertainty over the self and uncertainty over the other with respect to the change in the precision (left) and median-shift (right) in phase 3 prosocial-competitive beliefs .’

      Supplementary Figure 4: The prior presented does not look neutral to me, but rather right-leaning, so competitive, and therefore does indeed look like it was influenced by the self-model? If I am mistaken please could the authors explain why.

      This example distribution is taken from a single BPD participant. In this case, indeed, the prior is somewhat right-shifted. However, on a group level, priors over the partner were closely centred around 0 (see reported statistics in paragraph 2 under the heading ‘Phase 2 – BPD Participants Use Disintegrated and Neutral Priors). However, we understand how this may come across as misleading. For clarity we have expanded upon Figure S4 to include the phase 1 and prior phase 2 distributions for the entire BPD population for both prosocial and individualistic beliefs. This further demonstrates that those with BPD held surprisingly neutral beliefs over the expectations about their partners’ prosociality, but had minor shifts between their own individualistic preferences and the expected individualistic preferences of their partners. This is also visible in Figure S2.

      Reviewer 2:

      ‘There are two major weaknesses. First, the paper lacks focus and clarity. The introduction is rather vague and, after reading it, I remained confused about the paper's aims. Rather than relying on specific predictions, the analysis is exploratory. This implies that it is hard to keep track, and to understand the significance, of the many findings that are reported.’

      Thank you for this opportunity to be clearer in our framing of the paper. While the model makes specific causal predictions with respect to behavioural dynamics conditional on algorithmic differences, our other analyses were indeed exploratory. We did not preregister this work but now given the intriguing findings we intent to preregister our future analyses.

      We have made our introduction clearer with respect to the aims of the paper:

      ‘Our present work sought to achieve two primary goals: 1. Extend prior causal computational theories to formalise the interrelation between self-insertion and social contagion within an economic paradigm, the Intentions Game and 2., Test how a diagnosis of BPD may relate to deficits in these forms of generalisation. We propose a computational theory with testable predictions to begin addressing this question. To foreshadow our results, we found that healthy participants employ a mixed process of self-insertion and contagion to predict and align with the beliefs of their partners. In contrast, individuals with BPD exhibit distinct, disintegrated representations of self and other, despite showing similar average accuracy in their learning about partners. Our model and data suggest that the previously observed computational characteristics in BPD, such as reduced self-anchoring during ambiguous learning and a relative impermeability of the self, arise from the failure of information about others to transfer to and inform the self. By integrating separate computational findings, we provide a foundational model and a concise, dynamic paradigm to investigate uncertainty, generalization, and regulation in social interactions.’

      ‘Second, although the computational approach employed is clever and sophisticated, there is important information missing about model comparison which ultimately makes some of the results hard to assess from the perspective of the reader.’

      Our model comparison employed what is state of the art random-effects Bayesian model comparison (Piray et al., 2019; PLOS Comp. Biol.). It initially fits each individual to each model using Laplace approximation, and subsequently ‘races’ each model against each other on the group level and individual level through hierarchical constraints and random-effect considerations. We included this in the methods but have now expanded on the descrpition we used to compare models:

      In the results -

      ‘All computational models were fitted using a Hierarchical Bayesian Inference (HBI) algorithm which allows hierarchical parameter estimation while assuming random effects for group and individual model responsibility (Piray et al., 2019; see Methods for more information). We report individual and group-level model responsibility, in addition to protected exceedance probabilities between-groups to assess model dominance.’

      We added to our existing description in the methods –

      ‘All computational models were fitted using a Hierarchical Bayesian Inference (HBI) algorithm which allows hierarchical parameter estimation while assuming random effects for group and individual model responsibility (Piray et al., 2019). During fitting we added a small noise floor to distributions (2.22e<sup>-16</sup>) before normalisation for numerical stability. Parameters were estimated using the HBI in untransformed space drawing from broad priors (μM\=0, σ<sup>2</sup><sub>M</sub> = 6.5; where M\={M1, M2, M3, M4}). This process was run independently for each group. Parameters were transformed into model-relevant space for analysis. All models and hierarchical fitting was implemented in Matlab (Version R2022B). All other analyses were conducted in R (version 4.3.3; arm64 build) running on Mac OS (Ventura 13.0). We extracted individual and group level responsibilities, as well as the protected exceedance probability to assess model dominance per group.’

      (1) P3, third paragraph: please define self-insertion

      We have now more clearly defined this in the prior paragraph when introducing concepts.

      ‘To reduce uncertainty about others, theories of the relational self (Anderson & Chen, 2002) suggest that people have availble to them an extensive and well-grounded representation of themselves, leading to a readily accessible initial belief (Allport, 1924; Kreuger & Clement, 1994) that can be projected or integrated when learning about others (self-insertion).’

      (2) Introduction: the specific aim of the paper should be clarified - at the moment, it is rather vague. The authors write: "However, critical questions remain: How do humans adjudicate between self-insertion and contagion during interaction to manage interpersonal generalization? Does the uncertainty in self-other beliefs affect their generalizability? How can disruptions in interpersonal exchange during sensitive developmental periods (e.g., childhood maltreatment) inform models of psychiatric disorders?". Which of these questions is the focus of the paper? And how does the paper aim at addressing it?

      (3) Relatedly, from the introduction it is not clear whether the goal is to develop a theory of self-insertion and social contagion and test it empirically, or whether it is to study these processes in BPD, or both (or something else). Clarifying which specific question(s) is addressed is important (also clarifying what we already know about that specific question, and how the paper aims at elucidating that specific question).

      We have now included our specific aims of the paper. We note this in the above response to the reviwers general comments.

      (4) "Computational models have probed social processes in BPD, linking the BPD phenotype to a potential over-reliance on social versus internal cues (Henco et al., 2020), 'splitting' of social latent states that encode beliefs about others (Story et al., 2023), negative appraisal of interpersonal experiences with heightened self-blame (Mancinelli et al., 2024), inaccurate inferences about others' irritability (Hula et al., 2018), and reduced belief adaptation in social learning contexts (Siegel et al., 2020). Previous studies have typically overlooked how self and other are represented in tandem, prompting further investigation into why any of these BPD phenotypes manifest." Not clear what the link between the first and second sentence is. Does it mean that previous computational models have focused exclusively on how other people are represented in BPD, and not on how the self is represented? Please spell this out.

      Thank you for the opportunity to be clearer in our language. We have now spelled out our point more precisely, and included some extra relevant literature helpfully pointed out by another reviewer.

      ‘Computational models have probed social processes in BPD, although almost exclusively during observational learning. The BPD phenotype has been associated with a potential over-reliance on social versus internal cues (Henco et al., 2020), ‘splitting’ of social latent states that encode beliefs about others (Story et al., 2023), negative appraisal of interpersonal experiences with heightened self-blame (Mancinelli et al., 2024), inaccurate inferences about others’ irritability (Hula et al., 2018), and reduced belief adaptation in social learning contexts (Siegel et al., 2020). Associative models have also been adapted to characterize  ‘leaky’ self-other reinforcement learning (Ereira et al., 2018), finding that those with BPD overgeneralize (leak updates) about themselves to others (Story et al., 2024). Altogether, there is currently a gap in the direct causal link between insertion, contagion, and learning (in)stability.’

      (5) P5, first paragraph. The description of the task used in phase 1 should be more detailed. The essential information for understanding the task is missing.

      We have updated this section to point toward Figure 1 and the Methods where the details of the task are more clearly outlined. We hope that it is acceptable not to explain the full task at this point for brevity and to not interrupt the flow of the results.

      “Detailed descriptions of the task can be found in the methods section and Figure 1.’

      (6) P5, second paragraph: briefly state how the Psychometric data were acquired (e.g., self-report).

      We have now clarified this in the text.

      ‘All participants also self-reported their trait paranoia, childhood trauma, trust beliefs, and trait mentalizing (see methods).’

      (7) "For example, a participant could make prosocial (self=5; other=5) versus individualistic (self=10; other=5) choices, or prosocial (self=10; other=10) versus competitive (self=10; other=5) choices". Not sure what criteria are used for distinguishing between individualistic and competitive - they look the same?

      Sorry. This paragraph was not clear that the issue is that the interpretation of the choice depends on both members of the pair of options. Here, in one pair {(self=5,other=5) vs (self=10,other=5)}, it is highly pro-social for the self to choose (5,5), sacrificing 5 points for the sake of equality. In the second pair {(self=10,other=10) vs (self=10,other=5)}, it is highly competitive to choose (10,5), denying the other 5 points at no benefit to the self. We have clarified this:

      ‘We analyzed the ‘types’ of choices participants made in each phase (Supplementary Table 1). The interpretation of a participant’s choice depends on both values in a choice. For example, a participant could make prosocial (self=5; other=5) versus individualistic (self=10; other=5) choices, or prosocial (self=10; other=10) versus competitive (self=10; other=5) choices. There were 12 of each pair in phases 1 and 3 (individualistic vs. prosocial; prosocial vs. competitive; individualistic vs. competitive).’  

      (8) "In phase 1, both CON and BPD participants made prosocial choices over competitive choices with similar frequency (CON=9.67[3.62]; BPD=9.60[3.57])" please report t-test - the same applies also various times below.

      We have now included the t test statistics with each instance.

      ‘In phase 3, both CON and BPD participants continued to make equally frequent prosocial versus competitive choices (CON=9.15[3.91]; BPD=9.38[3.31]; t=-0.54, p=0.59); CON participants continued to make significantly less prosocial versus individualistic choices (CON=2.03[3.45]; BPD=3.78 [4.16]; t=2.31, p=0.02). Both groups chose equally frequent individualistic versus competitive choices (CON=10.91[2.40]; BPD=10.18[2.72]; t=-0.49, p=0.62).’

      (9) P 9: "Models M2 and M3 allow for either self-insertion or social contagion to occur independently" what's the difference between M2 and M3?

      Model M2 hypothesises that participants use their own self representation as priors when learning about the other in phase 2, but are not influenced by their partner. M3 hypothesises that participants form an uncoupled prior (no self-insertion) about their partner in phase 2, and their choices in phase 3 are influenced by observing their partner in phase 2 (social contagion). In Figure 1 we illustrate the difference between M2 and M3. In Table 1 we specifically report the parameterisation differences between M2 and M3. We have also now included a correlational analysis of parameters between models to demonstrate the relationship between model parameters of equivalent value between models (Fig S11). We have also force fitted all models (M1-M4) to the data independently and reported group differences within each (see Table S2 and Table S3).

      (10) P 9, last paragraph: I did not understand the description of the Beta model.

      The beta model is outlined in detail in Table 1. We have also clarified the description of the beta model on page 9:

      ‘The ‘Beta model’ is equivalent to M1 in its causal architecture (both self-insertion and social contagion are hypothesized to occur) but differs in richness: it accommodates the possibility that participants might only consider a single dimension of relative reward allocation, which is typically emphasized in previous studies (e.g., Hula et al., 2018).’

      (11) P 9: I wonder whether one could think about more intuitive labels for the models, rather than M1, M2 etc.. This is just a suggestion, as I am not sure a short label would be feasible here.

      Thank you for this suggestion. We apologise that it is not very intitutive. The problem is that given the various terms we use to explain the different processes of generalisation that might occur between self and other, and given that each model is a different combination of each, we felt that numbering them was a lesser evil. We hope that the reader will be able to reference both Figure 1 and Table 1 to get a good feel for how the models and their causal implications differ.

      (12) Model comparison: the information about what was done for model comparison is scant, and little about fit statistics is reported. At the moment, it is hard for a reader to assess the results of the model comparison analysis.

      Model comparison and fitting was conducted using simultaneous hierarchical fitting and random-effects comparison. This is employed through the HBI package (Piray et al., 2019) where the assumptions and fitting proceedures are outlined in great detail. In short, our comparison allows for individual and group-level hierarchical fitting and comparison. This overcomes the issue of interdependence between and within model fitting within a population, which is often estimated separately.

      We have outlined this in the methods, although appreciate we do not touch upon it until the reader reaches that point. We have added a clarification statement on page 9 to rectify this:

      ‘All computational models were fitted using a Hierarchical Bayesian Inference (HBI) algorithm which allows hierarchical parameter estimation while assuming random effects for group and individual model responsibility (Piray et al., 2019; see Methods for more information). We report individual and group-level model responsibility, in addition to protected exceedance probabilities between-groups to assess model dominance.’

      (13) P 14, first paragraph: "BPD participants were also more certain about both types of preference" what are the two types of preferences?

      The two types of preferences are relative (prosocial-competitive) and absolute (individualistic) reward utility. These are expressed as b and a respectively. We have expanded the sentence in question to make this clearer:

      ‘BPD participants were also more certain about both self-preferences for absolute and relative reward ( = -0.89, 95%HDI: -1.01, -0.75; = -0.32, 95%HDI: -0.60, -0.04) versus CON participants (Figure 2B).’

      (14) "Parameter Associations with Reported Trauma, Paranoia, and Attributed Intent" the results reported here are intriguing, but not fully convincing as there is the problem of multiple comparisons. The combinations between parameters and scales are rather numerous. I suggest to correct for multiple comparisons and to flag only the findings that survive correction.

      We have now corrected this and controlled for multiple comparisons through partial correlation analysis, bootstrapping assessment for robustness, permutation testing, and False Detection Rate correction. We only report those that survive bootstrapping and permutation testing, reporting both corrected (p[fdr]) and uncorrected (p) significance.

      (15) Results page 14 and page 15. The authors compare the various parameters between groups. I would assume that these parameters come from M1 for controls and from M4 for BDP? Please clarify if this is indeed the case. If it is the case, I am not sure this is appropriate. To my knowledge, it is appropriate to compare parameters between groups only if the same model is fit to both groups. If two different models are fit to each group, then the parameters are not comparable, as the parameter have, so to speak, different "meaning" in two models. Now, I want to stress that my knowledge on this matter may be limited, and that the authors' approach may be sound. However, to be reassured that the approach is indeed sound, I would appreciate a clarification on this point and a reference to relevant sources about this approach.

      This is an important point. First, we confirmed all our main conclusions about parameter differences using the maximal model M1 to fit all the participants. We added Supplementary Table 2 to report the outcome of this analysis. Second, we did the same for parameters across all models M1-M4, fitting each to participants without comparison. This is particularly relevant for M3, since at least a minority of participants of both groups were best explained by this model. We report these analyses in Fig S11:

      Since the M4 is nested within M1, we argue that this comparison is still meaningful, and note explanations in the text for why the effects noted between groups may occur given the differences in their causal meaning, for example in the results under phase 2 analyses:

      ‘Belief updating in phase 2 was less flexible in BPD participants. Median change in beliefs (from priors to posteriors) about a partner’s preferences was lower versus. CON ( = -5.53, 95%HDI: -7.20, -4.00; = -10.02, 95%HDI: -12.81, -7.30). Posterior beliefs about partner were more precise in BPD versus CON ( = -0.94, 95%HDI: -1.50, -0.45;  = -0.70, 95%HDI: -1.20, -0.25).  This is unsurprising given the disintegrated priors of the BPD group in M4, meaning they need to ‘travel less’ in state space. Nevertheless, even under assumptions of M1 and M2 for both groups, BPD showed smaller posteriors median changes versus CON in phase 2 (see Table T2). These results converge to suggest those with BPD form rigid posterior beliefs.’

      (16) "We built and tested a theory of interpersonal generalization in a population of matched participants" this sentence seems to be unwarranted, as there is no theory in the paper (actually, as it is now, the paper looks rather exploratory)

      We thank the reviewer for their perspective. Formal models can be used as a theoretical statement on the casual algorithmic process underlying decision making and choice behaviour; the development of formal models are an essential theoretical tool for precision and falsification (Haslbeck et al., 2022). In this sense, we have built several competing formal theories that test, using casual architectures, whether the latent distribution(s) that generate one’s choices generalise into one’s predictions about another person, and simultaneously whether one’s latent distribution(s) that represent beliefs about another person are used to inform future choices.

      Reviewer 3:

      ‘My broad question about the experiment (in terms of its clinical and cognitive process relevance): Does the task encourage competition or give participants a reason to take advantage of others? I don't think it does, so it would be useful to clarify the normative account for prosociality in the introduction (e.g., some of Robin Dunbar's work).’

      We agree that our paradigm does not encourage competition. We use a reward structure that makes it contingent on participants to overcome a particular threshold before earning rewards, but there is no competitive element to this, in that points earned or not earned by partners have no bearing on the outcomes for the participant. This is important given the consideration of recursive properties that arise through mixed-motive games; we wanted to focus purely on observational learning in phase 2, and repercussion-free choices made by participants in phase 1 and 3, meaning the choices participants, and decisions of a partner, are theoretically in line with self-preferences irrespective of the judgement of others. We have included a clearer statement of the structure of this type of task, and more clearly cited the origin for its structure (Murphy & Ackerman, 2011):

      ‘Our present work sought to achieve two primary goals. 1. Extend prior causal computational theories to formalise and test the interrelation between self-insertion and social contagion on learning and behaviour to better probe interpersonal generalisation in health, and 2., Test whether previous computational findings of social learning changes in BPD can be explained by infractions to self-other generalisation. We accomplish these goals by using a dynamic, sequential social value economic paradigm, the Intentions Game, building upon a Social Value Orientation Framework (Murphy & Ackerman, 2011) that assumes motivational variation in joint reward allocation.’

      Given the introductions structure as it stands, we felt providing another paragraph on the normative assumptions of such a game was outside the scope of this article.

      ‘The finding that individuals with BPD do not engage in self-other generalization on this task of social intentions is novel and potentially clinically relevant. The authors find that BPD participants' tendency to be prosocial when splitting points with a partner does not transfer into their expectations of how a partner will treat them in a task where they are the passive recipient of points chosen by the partner. In the discussion, the authors reasonably focus on model differences between groups (Bayesian model comparison), yet I thought this finding -- BPD participants not assuming prosocial tendencies in phase 2 while CON participant did -- merited greater attention. Although the BPD group was close to 0 on the \beta prior in Phase 2, their difference from CON is still in the direction of being more mistrustful (or at least not assuming prosociality). This may line up with broader clinical literature on mistrustfulness and attributions of malevolence in the BPD literature (e.g., a 1992 paper by Nigg et al. in Journal of Abnormal Psychology). My broad point is to consider further the Phase 2 findings in terms of the clinical interpretation of the shift in \beta relative to controls.’

      This is an important point, that we contextualize within the parameterisation of our utility model. While the shift toward 0 in the BPD participants is indeed more competitive, as the reviewer notes, it is surprisingly centred closely around 0, with only a slight bias to be prosocial (mean = -0.47;  = -6.10, 95%HDI: -7.60, -4.60). Charitably we might argue that BPD participants are expecting more competitive preferences from their partner. However even so, given their variance around their priors in phase 2, they are uncertain or unconfident about this. We take a more conservative approach in the paper and say that given the tight proximity to 0 and the variance of their group priors, they are likely to be ‘hedging their bets’ on whether their partner is going to be prosocial or competitive. While the movement from phase 1 to 2 is indeed in the competitive direction it still lands in neutral territory. Model M4 does not preclude central tendancies at the start of Phase 2 being more in the competitive direction.

      ‘First, the authors note that they have "proposed a theory with testable predictions" (p. 4 but also elsewhere) but they do not state any clear predictions in the introduction, nor do they consider what sort of patterns will be observed in the BPD group in view of extant clinical and computational literature. Rather, the paper seems to be somewhat exploratory, largely looking at group differences (BPD vs. CON) on all of the shared computational parameters and additional indices such as belief updating and reaction times. Given this, I would suggest that the authors make stronger connections between extant research on intention representation in BPD and their framework (model and paradigm). In particular, the authors do not address related findings from Ereira (2020) and Story (2024) finding that in a false belief task that BPD participants *overgeneralize* from self to other. A critical comparison of this work to the present study, including an examination of the two tasks differ in the processes they measure, is important.’

      Thank you for this opportunity to include more of the important work that has preceded the present manuscript. Prior work has tended to focus on either descriptive explanations of self-other generalisation (e.g. through the use of RW type models) or has focused on observational learning instability in absence of a causal model from where initial self-other beliefs may arise. While the prior work cited by the reviewer [Ereira (2020; Nat. Comms.) and Story (2024; Trans. Psych.)] does examine the inter-trial updating between self-other, it does not integrate a self model into a self’s belief about an other prior to observation. Rather, it focuses almost exclusively on prediction error ‘leakage’ generated during learning about individual reward (i.e. one sided reward). These findings are important, but lie in a slightly different domain. They also do not cut against ours, and in fact, we argue in the discussion that the sort of learning instability described above and splitting (as we cite from Story ea. 2024; Psych. Rev.) may result from a lack of self anchoring typical of CON participants. Nevertheless we agree these works provide an important premise to contrast and set the groundwork for our present analysis and have included them in the framing of our introduction, as well as contrasting them to our data in the discussion.

      In the introduction:

      ‘The BPD phenotype has been associated with a potential over-reliance on social versus internal cues (Henco et al., 2020), ‘splitting’ of social latent states that encode beliefs about others (Story et al., 2023), negative appraisal of interpersonal experiences with heightened self-blame (Mancinelli et al., 2024), inaccurate inferences about others’ irritability (Hula et al., 2018), and reduced belief adaptation in social learning contexts (Siegel et al., 2020). Associative models have also been adapted to characterize  ‘leaky’ self-other reinforcement learning (Ereira et al., 2018), finding that those with BPD overgeneralize (leak updates) about themselves to others (Story et al., 2024). Altogether, there is currently a gap in the direct causal link between insertion, contagion, and learning (in)stability.’

      In the discussion:

      ‘Disruptions in self-to-other generalization provide an explanation for previous computational findings related to task-based mentalizing in BPD. Studies tracking observational mentalizing reveal that individuals with BPD, compared to those without, place greater emphasis on social over internal reward cues when learning (Henco et al., 2020; Fineberg et al., 2018). Those with BPD have been shown to exhibit reduced belief adaptation (Siegel et al., 2020) along with ‘splitting’ of latent social representations (Story et al., 2024a). BPD is also shown to be associated with overgeneralisation in self-to-other belief updates about individual outcomes when using a one-sided reward structure (where participant responses had no bearing on outcomes for the partner; Story et al., 2024b). Our analyses show that those with BPD are equal to controls in their generalisation of absolute reward (outcomes that only affect one player) but disintegrate beliefs about relative reward (outcomes that affect both players) through adoption of a new, neutral belief. We interpret this together in two ways: 1. There is a strong concern about social relativity when those with BPD form beliefs about others, 2. The absence of constrained self-insertion about relative outcomes may predispose to brittle or ‘split’ beliefs. In other words, those with BPD assume ambiguity about the social relativity preferences of another (i.e. how prosocial or punitive) and are quicker to settle on an explanation to resolve this. Although self-insertion may be counter-intuitive to rational belief formation, it has important implications for sustaining adaptive, trusting social bonds via information moderation.’

      In addition, perhaps it is fairer to note more explicitly the exploratory nature of this work. Although the analyses are thorough, many of them are not argued for a priori (e.g., rate of belief updating in Figure 2C) and the reader amasses many individual findings that need to by synthesized.’

      We have now noted the primary goals of our work in the introduction, and have included caveats about the exploratory nature of our analyses. We would note that our model is in effect a causal combination of prior work cited within the introduction (Barnby et al., 2022; Moutoussis et al., 2016). This renders our computational models in effect a causal theory to test, although we agree that our dissection of the results are exploratory. We have more clearly signposted this:

      ‘Our present work sought to achieve two primary goals. 1. Extend prior causal computational theories to formalise and test the interrelation between self-insertion and social contagion on learning and behaviour to better probe interpersonal generalisation in health, and 2., Test whether previous computational findings of social learning changes in BPD can be explained by infractions to self-other generalisation. We accomplish these goals by using a dynamic, sequential economic paradigm, the Intentions Game, building upon a Social Value Orientation Framework (Murphy & Ackerman, 2011) that assumes innate motivational variation in joint reward allocation.‘

      ‘Second, in the discussion, the authors are too quick to generalize to broad clinical phenomena in BPD that are not directly connected to the task at hand. For example, on p. 22: "Those with a diagnosis of BPD also show reduced permeability in generalising from other to self. While prior research has predominantly focused on how those with BPD use information to form impressions, it has not typically examined whether these impressions affect the self." Here, it's not self-representation per se (typically, identity or one's view of oneself), but instead cooperation and prosocial tendencies in an economic context. It is important to clarify what clinical phenomena may be closely related to the task and which are more distal and perhaps should not be approached here.’

      Thank you for this important point. We agree that social value orientation, and particularly in this economically-assessed form, is but one aspect of the self, and we did not test any others. A version of the social contagion phenomena is also present in other aspects of the self in intertemporal (Moutoussis et al., 2016), economic (Suzuki et al., 2016) and moral preferences (Yu et al., 2021). It would be most interesting to attempt to correlate the degrees of insertion and contagion across the different tasks.

      We take seriously the wider concern that behaviour in our tasks based on economic preferences may not have clinical validity. This issue is central in the whole field of computational psychiatry, much of which is based on generalizing from tasks like ours, and discussing correlations with psychometric measures. We hope that it is acceptable to leave such discussions to the many reviews on computational psychiatry (Montague et al., 2012; Hitchcock et al., 2022; Huys et al., 2016). Here, we have just put a caveat in the dicussion:

      ‘Finally, a limitation may be that behaviour in tasks based on economic preferences may not have clinical validity. This issue is central to the field of computational psychiatry, much of which is based on generalising from tasks like that within this paper and discussing correlations with psychometric measures. Extrapolating  economic tasks into the real world has been the topic of discussion for the many reviews on computational psychiatry (e.g. Montague et al., 2012; Hitchcock et al., 2022; Huys et al., 2016). We note a strength of this work is the use of model comparison to understand causal algorithmic differences between those with BPD and matched healthy controls. Nevertheless, we wish to further pursue how latent characteristics captured in our models may directly relate to real-world affective change.’

      ‘On a more technical level, I had two primary concerns. First, although the authors consider alternative models within a hierarchical Bayesian framework, some challenges arise when one analyzes parameter estimates fit separately to two groups, particularly when the best-fitting model is not shared. In particular, although the authors conduct a model confusion analysis, they do not as far I could tell (and apologies if I missed it) demonstrate that the dynamics of one model are nested within the other. Given that M4 has free parameters governing the expectations on the absolute and relative reward preferences in Phase 2, is it necessarily the case that the shared parameters between M1 and M4 can be interpreted on the same scale? Relatedly, group-specific model fitting has virtues when believes there to be two distinct populations, but there is also a risk of overfitting potentially irrelevant sample characteristics when parameters are fit group by group.

      To resolve these issues, I saw one straightforward solution (though in modeling, my experience is that what seems straightforward on first glance may not be so upon further investigation). M1 assumes that participants' own preferences (posterior central tendency) in Phase 1 directly transfer to priors in Phase 2, but presumably the degree of transfer could vary somewhat without meriting an entirely new model (i.e., the authors currently place this question in terms of model selection, not within-model parameter variation). I would suggest that the authors consider a model parameterization fit to the full dataset (both groups) that contains free parameters capturing the *deviations* in the priors relative to the preceding phase's posterior. That is, the free parameters $\bar{\alpha}_{par}^m$ and $\bar{\beta}_{par}^m$ govern the central tendency of the Phase 2 prior parameter distributions directly, but could be reparametrized as deviations from Phase 1 $\theta^m_{ppt}$ parameters in an additive form. This allows for a single model to be fit all participants that encompasses the dynamics of interest such that between-group parameter comparisons are not biased by the strong assumptions imposed by M1 (that phase 1 preferences and phase 2 observations directly transfer to priors). In the case of controls, we would expect these deviation parameters to be centred on 0 insofar as the current M1 fit them best, whereas for BPD participants should have significant deviations from earlier-phase posteriors (e.g., the shift in \beta toward prior neutrality in phase 2 compared to one's own prosociality in phase 1). I think it's still valid for the authors to argue for stronger model constraints for Bayesian model comparison, as they do now, but inferences regarding parameter estimates should ideally be based on a model that can encompass the full dynamics of the entire sample, with simpler dynamics (like posterior -> prior transfer) being captured by near-zero parameter estimates.’

      Thank you for the chance to be clearer in our modelling. In particular, the suggestion to include a model that can be fit to all participants with the equivalent of the likes of partial social insertion, to check if the results stand, can actually be accomplished through our existing models.  That is, the parameter that governs the flexibility over beliefs in phase 2 under models M1 (dominant for CON participant) and M2 parameterises the degree to which participants think their partner may be different from themselves. Thus, forcibly fitting M1 and M2 hierarchically to all participants, and then separately to BPD and CON participants, can quantify the issue raised: if BPD participants indeed distinguish partners as vastly different from themselves enough to warent a new central tendency, should be quantitively higher in BPD vs CON participants under M1 and M2.

      We therefore tested this, reporting the distributional differences between for BPD and CON participants under M1, both when fitted together as a population and as separate groups. As is higher for BPD participants under both conditions for M1 and M2 it supports our claim and will add more context for the comparison - may be large enough in BPD that a new central tendency to anchor beliefs is a more parsimonious explanation.

      We cross checked this result by assessing the discrepancy between the participant’s and assumed partner’s central tendencies for both prosocial and individualistic preferences via best-fitting model M4 for the BPD group. We thereby examined whether belief disintegration is uniform across preferences (relative vs abolsute reward) or whether one tendency was shifted dramatically more than another.  We found that beliefs over prosocial-competitive preferences were dramatically shifted, whereas those over individualistic preferences were not.

      We have added the following to the main text results to explain this:

      Model Comparison:

      ‘We found that CON participants were best fit at the group level by M1 (Frequency = 0.59, Protected Exceedance Probability = 0.98), whereas BPD participants were best fit by M4 (Frequency = 0.54, Protected Exceedance Probability = 0.86; Figure 2A). We first analyse the results of these separate fits. Later, in order to assuage concerns about drawing inferences from different models, we examined the relationships between the relevant parameters when we forced all participants to be fit to each of the models (in a hierarchical manner, separated by group). In sum, our model comparison is supported by convergence in parameter values when comparisons are meaningful. We refer to both types of analysis below.’

      Phase 1:

      ‘These differences were replicated when considering parameters between groups when we fit all participants to the same models (M1-M4; see Table S2).’

      Phase 2:

      ‘To check that these conclusions about self-insertion did not depend on the different models, we found that only under M1 and M2 were consistently larger in BPD versus CON. This supports the notion that new central tendencies for BPD participants in phase 2 were required, driven by expectations about a partner’s relative reward. (see Fig S10 & Table S2). and parameters under assumptions of M1 and M2 were strongly correlated with median change in belief between phase 1 and 2 under M3 and M4, suggesting convergence in outcome (Fig S11).’

      ‘Furthermore, even under assumptions of M1-M4 for both groups, BPD showed smaller posterior median changes versus CON in phase 2 (see Table T2). These results converge to suggest those with BPD form rigid posterior beliefs.’

      ‘Assessing this same relationship under M1- and M2-only assumptions reveals a replication of this group effect for absolute reward, but the effect is reversed for relative reward (see Table S3). This accords with the context of each model, where under M1 and M2, BPD participants had larger phase 2 prior flexibility over relative reward (leading to larger initial surprise), which was better accounted for by a new central tendency under M4 during model comparison. When comparing both groups under M1-M4 informational surprise over absolute reward was consistently restricted in BPD (Table S3), suggesting a diminished weight of this preference when forming beliefs about an other.’

      Phase 3

      ‘In the dominant model for the BPD group—M4—participants are not influenced in their phase 3 choices following exposure to their partner in phase 2. To further confirm this we also analysed absolute change in median participant beliefs between phase 1 and 3 under the assumption that M1 and M3 was the dominant model for both groups (that allow for contagion to occur). This analysis aligns with our primary model comparison using M1 for CON and M4 for BPD  (Figure 2C). CON participants altered their median beliefs between phase 1 and 3 more than BPD participants (M1: linear estimate = 0.67, 95%CI: 0.16, 1.19; t = 2.57, p = 0.011; M3: linear estimate = 1.75, 95%CI: 0.73, 2.79; t = 3.36, p < 0.001). Relative reward was overall more susceptible to contagion versus absolute reward (M1: linear estimate = 1.40, 95%CI: 0.88, 1.92; t = 5.34, p<0.001; M3: linear estimate = 2.60, 95%CI: 1.57, 3.63; t = 4.98, p < 0.001). There was an interaction between group and belief type under M3 but not M1 (M3: linear estimate = 2.13, 95%CI: 0.09, 4.18, t = 2.06, p=0.041). There was only a main effect of belief type on precision under M3 (linear estimate = 0.47, 95%CI: 0.07, 0.87, t = 2.34, p = 0.02); relative reward preferences became more precise across the board. Derived model estimates of preference change between phase 1 and 3 strongly correlated between M1 and M3 along both belief types (see Table S2 and Fig S11).’

      ‘My second concern pertains to the psychometric individual difference analyses. These were not clearly justified in the introduction, though I agree that they could offer potentially meaningful insight into which scales may be most related to model parameters of interest. So, perhaps these should be earmarked as exploratory and/or more clearly argued for. Crucially, however, these analyses appear to have been conducted on the full sample without considering the group structure. Indeed, many of the scales on which there are sizable group differences are also those that show correlations with psychometric scales. So, in essence, it is unclear whether most of these analyses are simply recapitulating the between-group tests reported earlier in the paper or offer additional insights. I think it's hard to have one's cake and eat it, too, in this regard and would suggest the authors review Preacher et al. 2005, Psychological Methods for additional detail. One solution might be to always include group as a binary covariate in the symptom dimension-parameter analyses, essentially partialing the correlations for group status. I remain skeptical regarding whether there is additional signal in these analyses, but such controls could convince the reader. Nevertheless, without such adjustments, I would caution against any transdiagnostic interpretations such as this one in the Highlights: "Higher reported childhood trauma, paranoia, and poorer trait mentalizing all diminish other-to-self information transfer irrespective of diagnosis." Since many of these analyses relate to scales on which the groups differ, the transdiagnostic relevance remains to be demonstrated.’

      We have restructured the psychometric section to ensure transparency and clarity in our analysis. Namely, in response to these comments and those of the other reviewers, we have opted to remove the parameter analyses that aimed to cross-correlate psychometric scores with latent parameters from different models: as the reviewer points out, we do not have parity between dominant models for each group to warrant this, and fitting the same model to both groups artificially makes the parameters qualitatively different. Instead we have opted to focus on social contagion, or rather restrictions on , between phases 1 and 3 explained by M3. This provides us with an opportunity to examine social contagion on the whole population level isolated from self-insertion biases. We performed bootstrapping (1000 reps) and permutation testing (1000 reps) to assess the stability and significance of each edge in the partial correlation network, and then applied FDR correction (p[fdr]), thus controlling for multiple comparisons. We note that while we focused on M3 to isolate the effect across the population, social contagion across both relative and absolute reward under M3 strongly correlated with social contagion under M1 (see Fig S11).

      ‘We explored whether social contagion may be restricted as a result of trauma, paranoia, and less effective trait mentalizing under the assumption of M3 for all participants (where everyone is able to be influenced by their partner). To note, social contagion under M3 was highly correlated with contagion under M1 (see Fig S11). We conducted partial correlation analysis to estimate relationships conditional on all other associations and retained all that survived bootstrapping (1000 reps), permutation testing (1000 reps), and subsequent FDR correction. Persecution and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p = 0.004, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p=0.019, p[fdr]=0.02). MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p=0.026, p[fdr]=0.043). CTQ scores were also directly and negatively associated with shifts in individualistic preferences (; r = -0.24, 95%CI: -0.44, -0.13, p=0.052, p[fdr]=0.065). This provides some preliminary evidence that trauma impacts beliefs about individualism directly, whereas trauma and persecutory beliefs impact beliefs about prosociality through impaired mentalising (Figure 4A).’

      (1) As far as I could tell, the authors didn't provide an explanation of this finding on page 5: "However, CON participants made significantly fewer prosocial choices when individualistic choices were available" While one shouldn't be forced to interpret every finding, the paper is already in that direction and I found this finding to be potentially relevant to the BPD-control comparison.

      Thank you for this observation. This sentance reports the fact that CON participants were effectively more selfish than BPD participants. This is captured by the lower value of reported in Figure 2, and suggests that CON participants were more focused on absolute value – acting in a more ‘economically rational’ manner – versus BPD participants. This fits in with our fourth paragraph of the discussion where we discuss prior work that demonstrates a heightened social focus in those with BPD. Indeed, the finding the reviewer highlights further emphasises the point that those with BPD are much more sensitive, and motived to choose, options concerning relative reward than are CON participants. The text in the discussion reads:

      ‘We also observe this in self-generated participant choice behaviour, where CON participants were more concerned over absolute reward versus their BPD counterparts, suggesting a heighted focus on relative vs. absolute reward in those with BPD.’

      (2) The adaptive algorithm for adjusting partner behavior in Phase 2 was clever and effective. Did the authors conduct a manipulation check to demonstrate that the matching resulted in approximately 50% difference between one's behavior in Phase 1 and the partner in Phase 2? Perhaps Supplementary Figure suffices, but I wondered about a simpler metric.

      Thanks for this point. We highlight this in Figure 3B and within the same figure legend although appreciate the panel is quite small and may be missed.  We have now highlighted this manipulation check more clearly in behavioural analysis section of the main text:

      ‘Server matching between participant and partner in phase 2 was successful, with participants being approximately 50% different to their partners with respect to the choices each would have made on each trial in phase 2 (mean similarity=0.49, SD=0.12).’

      (3) The resolution of point-range plots in Figure 4 was grainy. Perhaps it's not so in the separate figure file, but I'd suggest checking.

      Apologies. We have now updated and reorganised the figure to improve clarity.

      (4) p. 21: Suggest changing to "different" as opposed to "opposite" since the strategies are not truly opposing: "but employed opposite strategies."

      We have amended this.

      (5) p. 21: I found this sentence unclear, particularly the idea of "similar updating regime." I'd suggest clarifying: "In phase 2, CON participants exhibited greater belief sensitivity to new information during observational learning, eventually adopting a similar updating regime to those with BPD."

      We have clarified this statement:

      ‘In observational learning in phase 2, CON participants initially updated their beliefs in response to new information more quickly than those with BPD, but eventually converged to a similar rate of updating.’

      (6) p. 23: The content regarding psychosis seemed out of place, particularly as the concluding remark. I'd suggest keeping the focus on the clinical population under investigation. If you'd like to mention the paradigm's relevance to psychosis (which I think could be omitted), perhaps include this as a future direction when describing the paradigm's strengths above.

      We agree the paragraph is somewhat speculative. We have omitted it in aid of keeping the messaging succinct and to the point.

      (7) p. 24: Was BPD diagnosis assess using unstructured clinical interview? Although psychosis was exclusionary, what about recent manic or hypomanic episodes or Bipolar diagnosis? A bit more detail about BPD sample ascertainment would be useful, including any instruments used to make a diagnosis and information about whether you measured inter-rater agreement.

      Participants diagnosed with BPD were recruited from specialist personality disorder services across various London NHS mental health trusts. The diagnosis of BPD was established by trained assessors at the clinical services and confirmed using the Structured Clinical Interview for DSM-IV (SCID-II) (First et al., 1997). Individuals with a history of psychotic episodes, severe learning disability or neurological illness/trauma were excluded. We have now included this extra detail within our methods in the paper:

      ‘The majority of BPD participants were recruited through referrals by psychiatrists, psychotherapists, and trainee clinical psychologists within personality disorder services across 9 NHS Foundation Trusts in the London, and 3 NHS Foundation Trusts across England (Devon, Merseyside, Cambridgeshire). Four BPD participants were also recruited by self-referral through the UCLH website, where the study was advertised. To be included in the study, all participants needed to have, or meet criteria for, a primary diagnosis of BPD (or emotionally-unstable personality disorder or complex emotional needs) based on a professional clinical assessment conducted by the referring NHS trust (for self-referrals, the presence of a recent diagnosis was ascertained through thorough discussion with the participant, whereby two of the four also provided clinical notes). The patient participants also had to be under the care of the referring trust or have a general practitioner whose details they were willing to provide. Individuals with psychotic or mood disorders, recent acute psychotic episodes, severe learning disability, or current or past neurological disorders were not eligible for participation and were therefore not referred by the clinical trusts.‘

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1:

      Point 1.1

      Summary: This paper describes a reanalysis of data collected by Gagne et al. (2020), who investigated how human choice behaviour differs in response to changes in environmental volatility. Several studies to date have demonstrated that individuals appear to increase their learning rate in response to greater volatility and that this adjustment is reduced amongst individuals with anxiety and depression. The present authors challenge this view and instead describe a novel Mixture of Strategies (MOS) model, that attributes individual differences in choice behaviour to different weightings of three distinct decision-making strategies. They demonstrate that the MOS model provides a superior fit to the data and that the previously observed differences between patients and healthy controls may be explained by patients opting for a less cognitively demanding, but suboptimal, strategy. 

      Strengths: 

      The authors compare several models (including the original winning model in Gagne et al., 2020) that could feasibly fit the data. These are clearly described and are evaluated using a range of model diagnostics. The proposed MOS model appears to provide a superior fit across several tests. 

      The MOS model output is easy to interpret and has good face validity. This allows for the generation of clear, testable, hypotheses, and the authors have suggested several lines of potential research based on this. 

      We appreciate the efforts in understanding our manuscript. This is a good summary.

      Point 1.2

      The authors justify this reanalysis by arguing that learning rate adjustment (which has previously been used to explain choice behaviour on volatility tasks) is likely to be too computationally expensive and therefore unfeasible. It is unclear how to determine how "expensive" learning rate adjustment is, and how this compares to the proposed MOS model (which also includes learning rate parameters), which combines estimates across three distinct decision-making strategies. 

      We are sorry for this confusion. Actually, our motivation is that previous models only consider the possibility of learning rate adaptation to different levels of environmental volatility. The drawback of previous computational modeling is that they require a large number of parameters in multi-context experiments. We feel that learning rate adaptation may not be the only mechanisms or at least there may exist alternative explanations. Understanding the true mechanisms is particularly important for rehabilitation purposes especially in our case of anxiety and depression. To clarify, we have removed all claims about the learning rate adaptation is “too complex to understand”.

      Point 1.3

      As highlighted by the authors, the model is limited in its explanation of previously observed learning differences based on outcome value. It's currently unclear why there would be a change in learning across positive/negative outcome contexts, based on strategy choice alone. 

      Thanks for mentioning this limitation. We want to highlight two aspect of work.

      First, we developed the MOS6 model primarily to account for the learning rate differences between stable and volatile contexts, and between healthy controls and patients, not for between positive and negative outcomes. In the other words, our model does not eliminate the possibility of different learning rate in positive and negative outcomes.

      Second, Figure 3A shows that FLR (containing different learning parameters for positive/negative outcomes) even performed worse than MOS6 (setting identical learning rate for positive/negative outcomes). This result question whether learning rate differences between positive/negative outcomes exist in our dataset.

      Action: We now include this limitation in lines 784-793 in discussion:

      “The MOS model is developed to offer context-free interpretations for the learning rate differences observed both between stable and volatile contexts and between healthy individuals and patients. However, we also recognize that the MOS account may not justify other learning rate effects based solely on strategy preferences. One such example is the valence-specific learning rate differences, where learning rates for better-than-expected outcomes are higher than those for worse-than-expected outcomes (Gagne et al., 2020). When fitted to the behavioral data, the context-dependent MOS22 model does not reveal valence-specific learning rates (Supplemental Note 4). Moreover, the valence-specific effect was not replicated in the FLR22 model when fitted to the synthesized data of MOS6.”

      Point 1.4

      Overall the methods are clearly presented and easy to follow, but lack clarity regarding some key features of the reversal learning task.

      Throughout the method the stimuli are referred to as "right" and "left". It's not uncommon in reversal learning tasks for the stimuli to change sides on a trial-by-trial basis or counterbalanced across stable/volatile blocks and participants. It is not stated in the methods whether the shapes were indeed kept on the same side throughout. If this is the case, please state it. If it was not (and the shapes did change sides throughout the task) this may have important implications for the interpretation of the results. In particular, the weighting of the habitual strategy (within the Mixture of Strategies model) could be very noisy, as participants could potentially have been habitual in choosing the same side (i.e., performing the same motor movement), or in choosing the same shape. Does the MOS model account for this? 

      We are sorry for the confusion. Yes, two shapes indeed changed sides throughout the task. We replaced the “left” and “right” with “stimulus 1” and “stimulus 2”. We also acknowledge the possibility that participants may develop a habitual preference for a particular side, rather than a shape. Due to the counterbalance design, habitual on side will introduce a random selection noise in choices, which should be captured by the MOS model through the inverse temperature parameter.  

      Point 1.5

      Line 164: "Participants received points or money in the reward condition and an electric shock in the punishment condition." What determined whether participants received points or money, and did this differ across participants? 

      Thanks! We have the design clarified in lines 187-188:

      “Each participant was instructed to complete two blocks of the volatile reversal learning task, one in the reward context and the other in the aversive context”,

      and in lines:

      “A total of 79 participants completed tasks in both feedback contexts. Four participants only completed the task in the reward context, while three participants only completed the aversive task.”

      Point 1.6

      Line 167: "The participant received feedback only after choosing the correct stimulus and received nothing else" Is this correct? In Figure 1a it appears the participant receives feedback irrespective of the stimulus they chose, by either being shown the amount 1-99 they are being rewarded/shocked, or 0. Additionally, what does the "correct stimulus" refer to across the two feedback conditions? It seems intuitive that in the reward version, the correct answer would be the rewarding stimulus - in the loss version is the "correct" answer the one where they are not receiving a shock? 

      Thanks for raising this issue. We removed the term “correct stimulus” and revised the lines 162-166 accordingly:

      “Only one of the two stimuli was associated with actual feedback (0 for the other one). The feedback magnitude, ranged between 1-99, is sampled uniformly and independently for each shape from trial to trial. Actual feedback was delivered only if the stimulus associated with feedback was chosen; otherwise, a number “0” was displayed on the screen, signifying that the chosen stimulus returns nothing.”

      Point 1.7

      Line 176: "The whole experiment included two runs each for the two feedback conditions." Does this mean participants completed the stable and volatile blocks twice, for each feedback condition? (i.e., 8 blocks total, 4 per feedback condition). 

      Thanks! We have removed the term “block”, and now we refer to it as “context”. In particular, we removed phrases like “stable block” and “volatile block” and used “context” instead.

      Action: See lines 187-189 for the revised version.

      “Each participant was instructed to complete two runs of the volatile reversal learning task, one in the reward context and the other in the aversive context. Each run consisted of 180 trials, with 90 trials in the stable context and 90 in the volatile context (Fig. 1B).”

      Point 1.8

      In the expected utility (EU) strategy of the Mixture or Strategies model, the expected value of the stimulus on each trial is produced by multiplying the magnitude and probability of reward/shock. In Gagne et al.'s original paper, they found that an additive mixture of these components better-captured participant choice behaviour - why did the authors not opt for the same strategy here? 

      Thanks for asking this. Their strategy basic means the mixture of PF+MO+HA, where PF stands for the feedback probability (e.g., 0.3 or 0.7) without multiplying feedback magnitude. However, ours are EU+MO+HA, where EU stands for feedback probability x feedback magnitude. We did compare these two strategies and the model using their strategy performed much worse than ours (see the red box below).

      Author response image 1.

      Thorough model comparison.

      Point 1.9

      How did the authors account for individuals with poor/inattentive responding, my concern is that the habitual strategy may be capturing participants who did not adhere to the task (or is this impossible to differentiate?). 

      The current MOS6 model distinguishes between the HA strategy and the inattentive response. Due to the counterbalance design, the HA strategy requires participants to actively track the stimuli on the screen. In contrast, the inattentive responding, like the same motor movement mentioned in Point 1.4, should exhibit random selection in their behavioral data, which should be account by the inverse temperature parameter.

      Point 1.10

      The authors provide a clear rationale for, and description of, each of the computational models used to capture participant choice behaviour. 

      • Did the authors compare different combinations of strategies within the MOS model (e.g., only including one or two strategies at a time, and comparing fit?) I think more explanation is needed as to why the authors opted for those three specific strategies. 

      We appreciate this great advice. Following your advice, we conducted a thorough model comparisons. Please refer to Figure R1 above. The detailed text descriptions of all the models in Figure R1 are included in Supplemental Note 1.

      Point 1.11

      Please report the mean and variability of each of the strategy weights, per group. 

      Thanks. We updated the mean of variability of the strategies in lines 490-503:

      “We first focused on the fitted parameters of the MOS6 model. We compared the weight parameters (, , ) across groups and conducted statistical tests on their logits (, , ). The patient group showed a ~37% preference towards the EU strategy, which is significantly weaker than the ~50% preference in healthy controls (healthy controls’ : M = 0.991, SD = 1.416; patients’ : M = 0.196, SD = 1.736; t(54.948) = 2.162, p = 0.035, Cohen’s d = 0.509; Fig. 4A). Meanwhile, the patients exhibited a weaker preference (~27%) for the HA strategy compared to healthy controls (~36%) (healthy controls’ : M = 0.657,  SD = 1.313; patients’ : M = -0.162, SD = 1.561; t(56.311) = 2.455, p = 0.017, Cohen’s d = 0.574), but a stronger preference for the MO strategy (36% vs. 14%; healthy controls’ : M = -1.647,  SD = 1.930; patients’ : M = -0.034, SD = 2.091; t(63.746) = -3.510, p = 0.001, Cohen’s d = 0.801). Most importantly, we also examined the learning rate parameter in the MOS6 but found no group differences (t(68.692) = 0.690, p = 0.493, Cohen’s d = 0.151). These results strongly suggest that the differences in decision strategy preferences can account for the learning behaviors in the two groups without necessitating any differences in learning rate per se.”

      Point 1.12

      The authors compare the strategy weights of patients and controls and conclude that patients favour more simpler strategies (see Line 417), based on the fact that they had higher weights for the MO, and lower on the EU.

      (1) However, the finding that control participants were more likely to use the habitual strategy was largely ignored. Within the control group, were the participants significantly more likely to opt for the EU strategy, over the HA? 2) Further, on line 467 the authors state "Additionally, there was a significant correlation between symptom severity and the preference for the HA strategy (Pearson's r = -0.285, p = 0.007)." Apologies if I'm mistaken, but does this negative correlation not mean that the greater the symptoms, the less likely they were to use the habitual strategy?

      I think more nuance is needed in the interpretation of these results, particularly in the discussion. 

      Thanks. The healthy participants seemed more likely to opt for the EU strategy, although this difference did not reach significance (paired-t(53) = 1.258, p = 0.214, Cohen’s d = 0.242). We systematically explore the role of HA. Compared to the MO, the HA saves cognitive resources but yields a significantly higher hit rate (Fig. 4A). Therefore, a preference for the HA over the MO strategy may reflect a more sophisticated balance between reward and complexity within an agent: when healthier subjects run out of cognitive resources for the EU strategy, they will cleverly resort to the HA strategy, adopting a simpler strategy but still achieving a certain level of hit rate. This explains the negative symptom-HA correlation. As clever as the HA strategy is, it is not surprising that the health control participants opt more for the HA during decision-making.

      However, we are cautious to draw strong conclusion on (1) non-significant difference between EU and HA within health controls and (2) the negative symptom-HA correlation. The reason is that the MOS22, the context-dependent variant, 1) exhibited a significant higher preference for EU over HA (paired-t(53) = 4.070, p < 0.001, Cohen’s d = 0.825) and 2) did not replicate this negative correlation (Supplemental Information Figure S3).

      Action: Simulation analysis on the effects of HA was introduced in lines 556-595 and Figure 4. We discussed the effects of HA in lines 721-733:

      “Although many observed behavioral differences can be explained by a shift in preference from the EU to the MO strategy among patients, we also explore the potential effects of the HA strategy. Compared to the MO, the HA strategy also saves cognitive resources but yields a significantly higher hit rate (Fig. 4A). Therefore, a preference for the HA over the MO strategy may reflect a more sophisticated balance between reward and complexity within an agent (Gershman, 2020): when healthier participants exhaust their cognitive resources for the EU strategy, they may cleverly resort to the HA strategy, adopting a simpler strategy but still achieving a certain level of hit rate. This explains the stronger preference for the HA strategy in the HC group (Fig. 3A) and the negative correlation between HA preferences and symptom severity  (Fig. 5). Apart from shedding light on the cognitive impairments of patients, the inclusion of the HA strategy significantly enhances the model’s fit to human behavior (see examples in Daw et al. (2011); Gershman (2020); and also Supplemental Note 1 and Supplemental Figure S3).”

      Point 1.13

      Line 513: "their preference for the slowest decision strategy" - why is the MO considered the slowest strategy? Is it not the least cognitively demanding, and therefore, the quickest? 

      Sorry for the confusion. In Fig. 5C, we conducted simulations to estimate the learning speed for each strategy. As shown below, the MO strategy exhibits a flat learning curve. Our claim on the learning speed was based solely on simulation outcomes without referring to cognitive demands. Note that our analysis did not aim to compare the cognitive demands of the MO and HA strategies directly.

      Action: We explain the learning speed of the three strategies in lines 571-581.

      Point 1.14

      The authors argue that participants chose suboptimal strategies, but do not actually report task performance. How does strategy choice relate to the performance on the task (in terms of number of rewards/shocks)? Did healthy controls actually perform any better than the patient group? 

      Thanks for the suggestion. The answers are: 1) EU is the most rewarding > the HA > the MO (Fig. 5A), and 2) yes healthy controls did actually perform better than patients in terms of hit rate (Fig. 2).

      Action: We included additional sections on above analyses in lines 561-570 and lines 397-401.

      Point 1.15

      The authors speculate that Gagne et al. (2020) did not study the relationship between the decision process and anxiety and depression, because it was too complex to analyse. It's unclear why the FLR model would be too complex to analyse. My understanding is that the focus of Gagne's paper was on learning rate (rather than noise or risk preference) due to this being the main previous finding. 

      Thanks! Yes, our previous arguments are vague and confusing. We have removed all this kind of arguments.

      Point 1.16

      Minor Comments: 

      • Line 392: Modeling fitting > Model fitting 

      • Line 580 reads "The MO and HA are simpler heuristic strategies that are cognitively demanding."

      - should this read as less cognitively demanding? 

      • Line 517: health > healthy 

      • Line 816: Desnity > density 

      Sorry for the typo! They have all been fixed.

      Reviewer #2:

      Point 2.1

      Summary: Previous research shows that humans tend to adjust learning in environments where stimulus-outcome contingencies become more volatile. This learning rate adaptation is impaired in some psychiatric disorders, such as depression and anxiety. In this study, the authors reanalyze previously published data on a reversal-learning task with two volatility levels. Through a new model, they provide some evidence for an alternative explanation whereby the learning rate adaptation is driven by different decision-making strategies and not learning deficits. In particular, they propose that adjusting learning can be explained by deviations from the optimal decision-making strategy (based on maximizing expected utility) due to response stickiness or focus on reward magnitude. Furthermore, a factor related to the general psychopathology of individuals with anxiety and depression negatively correlated with the weight on the optimal strategy and response stickiness, while it correlated positively with the magnitude strategy (a strategy that ignores the probability of outcome). 

      Thanks for evaluating our paper. This is a good summary.

      Point 2.2

      My main concern is that the winning model (MOS6) does not have an error term (inverse temperature parameter beta is fixed to 8.804). 

      (1) It is not clear why the beta is not estimated and how were the values presented here chosen. It is reported as being an average value but it is not clear from which parameter estimation. Furthermore, with an average value for participants that would have lower values of inverse temperature (more stochastic behaviour) the model is likely overfitting.

      (2) In the absence of a noise parameter, the model will have to classify behaviour that is not explained by the optimal strategy (where participants simply did not pay attention or were not motivated) as being due to one of the other two strategies.

      We apologize for any confusion caused by our writing. We did set the inverse temperature as a free parameter and quantitatively estimate it during the model fitting and comparison. We also created a table to show the free parameters for each models. In the previous manuscript, we did mention “temperature parameter beta is fixed to 8.804”, but only for the model simulation part, which is conducted to interpret some model behaviors.

      We agree with the concern that using the averaged value over the inverse temperature could lead to overfitting to more stochastic behaviors. To mitigate this issue, we now used the median as a more representative value for the population during simulation. Nonetheless, this change does not affect our conclusion (see simulation results in Figures 4&6).

      Action: We now use the term “free parameter” to emphasize that the inverse temperature was fitted rather than fixed. We also create a new table “Table 1”  in line 458 to show all the free parameters within a model. We also update the simulation details in lines 363-391 for more clarifications.

      Point 2.3

      (3) A model comparison among models with inverse temperature and variable subsets of the three strategies (EU + MO, EU + HA) would be interesting to see. Similarly, comparison of the MOS6 model to other models where the inverse temperature parameter is fixed to 8.804).

      This is an important limitation because the same simulation as with the MOS model in Figure 3b can be achieved by a more parsimonious (but less interesting) manipulation of the inverse temperature parameter.

      Thanks, we added a comparison between the MOS6 and the two lesion models (EU + MO, EU + HA). Please refer to the figure below and Point 1.8.

      We also realize that the MO strategy could exhibit averaged learning curves similar to random selection. To confirm that patients' slower learning rates are due to a preference for the MO strategy, we compared the MOS6 model with a variant (see the red box below) in which the MO strategy is replaced by Random (RD) selection that assigns a 0.5 probability to both choices. This comparison showed that the original MOS6 model with the MO strategy better fits human data.

      Author response image 2.

      Point 2.4

      Furthermore, the claim that the EU represents an optimal strategy is a bit overstated. The EU strategy is the only one of the three that assumes participants learn about the stimulus-outcomes contingencies. Higher EU strategy utilisation will include participants that are more optimal (in maximum utility maximisation terms), but also those that just learned better and completely ignored the reward magnitude.

      Thank you for your feedback. We have now revised the paper to remove all statement about “EU strategy is the optimal” and replaced by “EU strategy is rewarding but complex”. We agree that both the EU strategy and the strategy only focusing on feedback probability (i.e., ignoring the reward magnitude, refer to as the PF strategy) are rewarding but complex beyond two simple heuristics. We also included the later strategy in our model comparisons (see the next section Point 2.5).

      Point 2.5

      The mixture strategies model is an interesting proposal, but seems to be a very convoluted way to ask: to what degree are decisions of subjects affected by reward, what they've learned, and response stickiness? It seems to me that the same set of questions could be addressed with a simpler model that would define choice decisions through a softmax with a linear combination of the difference in rewards, the difference in probabilities, and a stickiness parameter. 

      Thanks for suggesting this model. We did include the proposed linear combination models (see “linear comb.” in the red box below) and found that it performed significantly worse than the MOS6.

      Action: We justified our model selection criterion in the Supplemental Note 1.

      Author response image 3.

      Point 2.6

      Learning rate adaptation was also shown with tasks where decision-making strategies play a less important role, such as the Predictive Inference task (see for instance Nassar et al, 2010). When discussing the merit of the findings of this study on learning rate adaptation across volatility blocks, this work would be essential to mention. 

      Thanks for mentioning this great experimental paradigm, which provides an ideal solution for disassociating the probability learning and decision process. We have discussed about this paradigm as well as the associated papers in discussion lines 749-751, 763-765, and 796-801.

      Point 2.7

      Minor mistakes that I've noticed:

      Equation 6: The learning rate for response stickiness is sometimes defined as alpha_AH or alpha_pi.

      Supplementary material (SM) Contents are lacking in Note1. SM talks about model MOS18, but it is not defined in the text (I am assuming it is MOS22 that should be talked about here).

      Thanks! Fixed.

      Reviewer #3:

      Point 3.1

      Summary: This paper presents a new formulation of a computational model of adaptive learning amid environmental volatility. Using a behavioral paradigm and data set made available by the authors of an earlier publication (Gagne et al., 2020), the new model is found to fit the data well. The model's structure consists of three weighted controllers that influence decisions on the basis of (1) expected utility, (2) potential outcome magnitude, and (3) habit. The model offers an interpretation of psychopathology-related individual differences in decision-making behavior in terms of differences in the relative weighting of the three controllers.

      Strengths: The newly proposed "mixture of strategies" (MOS) model is evaluated relative to the model presented in the original paper by Gagne et al., 2020 (here called the "flexible learning rate" or FLR model) and two other models. Appropriate and sophisticated methods are used for developing, parameterizing, fitting, and assessing the MOS model, and the MOS model performs well on multiple goodness-of-fit indices. The parameters of the model show decent recoverability and offer a novel interpretation for psychopathology-related individual differences. Most remarkably, the model seems to be able to account for apparent differences in behavioral learning rates between high-volatility and low-volatility conditions even with no true condition-dependent change in the parameters of its learning/decision processes. This finding calls into question a class of existing models that attribute behavioral adaptation to adaptive learning rates. 

      Thanks for evaluating our paper. This is a good summary.

      Point 3.2<br /> (1) Some aspects of the paper, especially in the methods section, lacked clarity or seemed to assume context that had not been presented. I found it necessary to set the paper down and read Gagne et al., 2020 in order to understand it properly.

      (3) Clarification-related suggestions for the methods section: <br /> - Explain earlier that there are 4 contexts (reward/shock crossed with high/low volatility). Lines 252-307 contain a number of references to parameters being fit separately per context, but "context" was previously used only to refer to the two volatility levels. 

      Action: We have placed the explanation as well as the table about the 4 contexts (stable-reward/stable-aversive/volatile-reward/volatile-aversive) earlier in the section that introduces the experiment paradigm (lines 177-186):

      “Participants was supposed to complete this learning and decision-making task in four experimental contexts (Fig. 1A), two feedback contexts (reward or aversive)  two volatility contexts (stable or volatile). Participants received points in the reward context and an electric shock in the aversive context. The reward points in the reward context were converted into a monetary bonus by the end of the task, ranging from £0 to £10. In the stable context, the dominant stimulus (i.e., a certain stimulus induces the feedback with a higher probability) provided a feedback with a fixed probability of 0.75, while the other one yielded a feedback with a probability of 0.25. In the volatile context, the dominant stimulus’s feedback probability was 0.8, but the dominant stimulus switched between the two every 20 trials. Hence, this design required participants to actively learn and infer the changing stimulus-feedback contingency in the volatile context.”

      - It would be helpful to provide an initial outline of the four models that will be described since the FLR, RS, and PH models were not foreshadowed in the introduction. For the FLR model in particular, it would be helpful to give a narrative overview of the components of the model before presenting the notation. 

      Action: We now include an overview paragraph in the section of computation model to outline the four models as well as the hypotheses constituted in the model (lines 202-220).  

      - The subsection on line 343, describing the simulations, lacks context. There are references to three effects being simulated (and to "the remaining two effects") but these are unclear because there's no statement in this section of what the three effects are.

      - Lines 352-353 give group-specific weighting parameters used for the stimulations of the HC and PAT groups in Figure 4B. A third, non-group-specific set of weighting parameters is given above on lines 348-349. What were those used for?

      - Line 352 seems to say Figure 4A is plotting a simulation, but the figure caption seems to say it is plotting empirical data. 

      These paragraphs has been rewritten and the abovementioned issues have been clarified. See lines 363-392.

      Point 3.2

      (2) There is little examination of why the MOS model does so well in terms of model fit indices. What features of the data is it doing a better job of capturing? One thing that makes this puzzling is that the MOS and FLR models seem to have most of the same qualitative components: the FLR model has parameters for additive weighting of magnitude relative to probability (akin to the MOS model's magnitude-only strategy weight) and for an autocorrelative choice kernel (akin to the MOS model's habit strategy weight). So it's not self-evident where the MOS model's advantage is coming from.

      An intuitive understanding of the FLR model is that it estimates the stimuli value through a linear combination of probability feedback (PF, )and (non-linear) magnitude .See equation:

      Also, the FLR model include the mechanisms of HA as:

      In other words, FLR model considers the mechanisms about the probability of feedback (PF)+MO+HA (see Eq. XX in the original study), but our MOS considers the mechanisms of EU+MO+HA. The key qualitative difference lies between FLR and MOS is the usage of the expected utility formula (EU) instead the probability of feedback (PF). The advantage of our MOS model has been fully evidenced by our model comparisons, indicating that human participants multiply probability and magnitude rather than only considering probability. The EU strategy has also been suggested by a large pile of literature (Gershman et al., 2015; Von Neumann & Morgenstern, 1947).

      Making decisions based on the multiplication of feedback probability and magnitude can often yield very different results compared to decisions based on a linear combination of the two, especially when the two magnitudes have a small absolute difference but a large ratio. Let’s consider two cases:

      (1) Stimulus 1: vs. Stimulus 2:

      (2) Stimulus 1: vs. Stimulus 2:

      The EU strategy may opt for stimulus 2 in both cases, since stimulus 2 always has a larger expected value. However, it is very likely for the PF+MO to choose stimulus 1 in the first case. For example, when .  If we want the PF+MO to also choose stimulus to align with the EU strategy, we need to increase the weight on magnitude . Note that in this example we divided the magnitude value by 100 to ensure that probability and magnitude are on the same scale to help illustration.

      In the dataset reported by Gagne, 2020, the described scenario seems to occur more often in the aversive context than in the reward context. To accurately capture human behaviors, FLR22 model requires a significantly larger weight for magnitude in the aversive context than in the reward context . Interestingly, when the weights for magnitude in different contexts are forced to be equal, the model (FLR6) fails, exhibiting an almost chance-level performance throughout learning (Fig. 3E, G). In contrast, the MOS6 model, and even the RS3 model, exhibit good performance using one identical set of parameters across contexts. Both MOS6 and RS3 include the EU strategy during decision-making. These findings suggest humans make decisions using the EU strategy rather than PF+MO.

      The focus of our paper is to present that a good-enough model can interpret the same dataset in a completely different perspective, not necessarily to explore improvements for the FLR model.

      Point 3.3

      One of the paper's potentially most noteworthy findings (Figure 5) is that when the FLR model is fit to synthetic data generated by the expected utility (EU) controller with a fixed learning rate, it recovers a spurious difference in learning rate between the volatile and stable environments. Although this is potentially a significant finding, its interpretation seems uncertain for several reasons: 

      - According to the relevant methods text, the result is based on a simulation of only 5 task blocks for each strategy. It would be better to repeat the simulation and recovery multiple times so that a confidence interval or error bar can be estimated and added to the figure. 

      - It makes sense that learning rates recovered for the magnitude-oriented (MO) strategy are near zero, since behavior simulated by that strategy would have no reason to show any evidence of learning. But this makes it perplexing why the MO learning rate in the volatile condition is slightly positive and slightly greater than in the stable condition. 

      - The pure-EU and pure-MO strategies are interpreted as being analogous to the healthy control group and the patient group, respectively. However, the actual difference in estimated EU/MO weighting between the two participant groups was much more moderate. It's unclear whether the same result would be obtained for a more empirically plausible difference in EU/MO weighting. 

      - The fits of the FLR model to the simulated data "controlled all parameters except for the learning rate parameters across the two strategies" (line 522). If this means that no parameters except learning rate were allowed to differ between the fits to the pure-EU and pure-MO synthetic data sets, the models would have been prevented from fitting the difference in terms of the relative weighting of probability and magnitude, which better corresponds to the true difference between the two strategies. This could have interfered with the estimation of other parameters, such as learning rate. 

      - If, after addressing all of the above, the FLR model really does recover a spurious difference in learning rate between stable and volatile blocks, it would be worth more examination of why this is happening. For example, is it because there are more opportunities to observe learning in those blocks?

      I would recommend performing a version of the Figure 5 simulations using two sets of MOS-model parameters that are identical except that they use healthy-control-like and patient-like values of the EU and MO weights (similar to the parameters described on lines 346-353, though perhaps with the habit controller weight equated). Then fit the simulated data with the FLR model, with learning rate and other parameters free to differ between groups. The result would be informative as to (1) whether the FLR model still misidentifies between-group strategy differences as learning rate differences, and (2) whether the FLR model still identifies spurious learning rate differences between stable and volatile conditions in the control-like group, which become attenuated in the patient-like group. 

      Many thanks for this great advice. Following your suggestions, we now conduct simulations using the median of the fitted parameters. The representations for healthy controls and patients have identical parameters, except for the three preference parameters; moreover, the habit weights are not controlled to be equal. 20 simulations for each representative, each comprising 4 task sequences sampled from the behavioral data. In this case, we could create error bars and perform statistical tests. We found that the differences in learning rates between stable and volatile conditions, as well as the learning rate adaptation differences between healthy controls and patients, still persisted.

      Combined with the discussion in Point 3.2, we justify why a mixture-of-strategy can account for learning rate adaptation as follow. Due to (unknown) differences in task sequences, the MOS6 model exhibits more MO-like behaviors due to the usage of the EU strategy. To capture this behavior pattern, the FLR22 model has to increase its weighting parameter 1-λ for magnitude, which could ultimately drive the FLR22 to adjust the fitted learning rate parameters, exhibiting a learning rate adaptation effect. Our simulations suggest that estimating learning rate just by model fitting may not be the only way to interpret the data.

      Action: We included the simulation details in the method section (lines 381-lines 391)

      “In one simulated experiment, we sampled the four task sequences from the real data. We simulated 20 experiments with the parameters of to mimic the behavior of the healthy control participants. The first three are the median of the fitted parameters across all participants; the latter three were chosen to approximate the strategy preferences of real health control participants (Figure 4A). Similarly, we also simulated 20 experiments for the patient group with the identical values of , and , but different strategy preferences   . In other words, the only difference in the parameters of the two groups is the switched and . We then fitted the FLR22 to the behavioral data generated by the MOS6 and examined the learning rate differences across groups and volatile contexts (Fig. 6). ”

      Point 3.4

      Figure 4C shows that the habit-only strategy is able to learn and adapt to changing contingencies, and some of the interpretive discussion emphasizes this. (For instance, line 651 says the habit strategy brings more rewards than the MO strategy.) However, the habit strategy doesn't seem to have any mechanism for learning from outcome feedback. It seems unlikely it would perform better than chance if it were the sole driver of behavior. Is it succeeding in this example because it is learning from previous decisions made by the EU strategy, or perhaps from decisions in the empirical data?

      Yes, the intuition is that the HA strategy seems to show no learning mechanism. But in reality, it yields a higher hit rate than MO by simply learning from previous decisions made by the EU strategy. We run simulations to confirm this (Figure 4B).

      Point 3.5

      For the model recovery analysis (line 567), the stated purpose is to rule out the possibility that the MOS model always wins (line 552), but the only result presented is one in which the MOS model wins. To assess whether the MOS and FLR models can be differentiated, it seems necessary also to show model recovery results for synthetic data generated by the FLR model. 

      Sure, we conducted a model recovery analysis that include all models, and it demonstrates that MOS and FLR can be fully differentiated. The results of the new model recovery analysis were shown in Fig. 7.

      Point 3.6

      To the best of my understanding, the MOS model seems to implement valence-specific learning rates in a qualitatively different way from how they were implemented in Gagne et al., 2020, and other previous literature. Line 246 says there were separate learning rates for upward and downward updates to the outcome probability. That's different from using two learning rates for "better"- and "worse"-than-expected outcomes, which will depend on both the direction of the update and the valence of the outcome (reward or shock). Might this relate to why no evidence for valence-specific learning rates was found even though the original authors found such evidence in the same data set? 

      Thanks. Following the suggestion, we have corrected our implementation of valence-specific learning rate in all models (see lines 261-268).

      “To keep consistent with Gagne et al., (2020), we also explored the valence-specific learning rate,

      is the learning rate for better-than-expected outcome, and for worse-than-expected outcome. It is important to note that Eq. 6 was only applied to the reward context, and the definitions of “better-than-expected” and “worse-than-expected” should change accordingly in the aversive context, where we defined for and for .

      No main effect of valence on learning rate was found (see Supplemental Information Note 3)

      Point 3.7

      The discussion (line 649) foregrounds the finding of greater "magnitude-only" weights with greater "general factor" psychopathology scores, concluding it reflects a shift toward simplifying heuristics. However, the picture might not be so straightforward because "habit" weights, which also reflect a simplifying heuristic, correlated negatively with the psychopathology scores. 

      Thanks. In contrast the detrimental effects of “MO”, “habit” is actually beneficial for the task. Please refer to Point 1.12.

      Point 3.8

      The discussion section contains some pejorative-sounding comments about Gagne et al. 2020 that lack clear justification. Line 611 says that the study "did not attempt to connect the decision process to anxiety and depression traits." Given that linking model-derived learning rate estimates to psychopathology scores was a major topic of the study, this broad statement seems incorrect. If the intent is to describe a more specific step that was not undertaken in that paper, please clarify. Likewise, I don't understand the justification for the statement on line 615 that the model from that paper "is not understandable" - please use more precise and neutral language to describe the model's perceived shortcomings. 

      Sorry for the confusion. We have removed all abovementioned pejorative-sounding comments.

      Point 3.9

      4. Minor suggestions: 

      - Line 114 says people with psychiatric illness "are known to have shrunk cognitive resources" - this phrasing comes across as somewhat loaded. 

      Thanks. We have removed this argument.

      - Line 225, I don't think the reference to "hot hand bias" is correct. I understand hot hand bias to mean overestimating the probability of success after past successes. That's not the same thing as habitual repetition of previous responses, which is what's being discussed here. 

      Response: Thanks for mentioning this. We have removed all discussions about “hot hand bias”.

      - There may be some notational inconsistency if alpha_pi on line 248 and alpha_HA on line 253 are referring to the same thing. 

      Thanks! Fixed!

      - Check the notation on line 285 - there may be some interchanging of decimals and commas.

      Thanks! Fixed!

      Also, would the interpretation in terms of risk seeking and risk aversion be different for rewarding versus aversive outcomes? 

      Thanks for asking. If we understand it correctly, risk seeking and risk aversion mechanisms are only present in the RS models, which show clearly worse fitting performance. We thus decide not to overly interpret the fitted parameters in the RS models.

      - Line 501, "HA and PAT groups" looks like a typo. 

      - In Figure 5, better graphical labeling of the panels and axes would be helpful. 

      Response: Thanks! Fixed!

      REFERENCES

      Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans' choices and striatal prediction errors. Neuron, 69(6), 1204-1215.

      Gagne, C., Zika, O., Dayan, P., & Bishop, S. J. (2020). Impaired adaptation of learning to contingency volatility in internalizing psychopathology. Elife, 9.

      Gershman, S. J. (2020). Origin of perseveration in the trade-off between reward and complexity. Cognition, 204, 104394.

      Gershman, S. J., Horvitz, E. J., & Tenenbaum, J. B. (2015). Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245), 273-278.

      Von Neumann, J., & Morgenstern, O. (1947). Theory of games and economic behavior, 2nd rev.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper investigates the neural mechanisms underlying the change in perception when viewing ambiguous figures. Each possible percept is related to an attractor-like brain state and a perceptual switch corresponds to a transition between these states. The hypothesis is that these switches are promoted by bursts of noradrenaline that change the gain of neural circuits. The authors present several lines of evidence consistent with this view: pupil diameter changes during the time point of the perceptual change; a gain change in neural network models promotes a state transition; and large-scale fMRI dynamics in a different experiment suggests a lower barrier between brain states at the change point. However, some assumptions of the computational model seem not well justified and the theoretical analysis is incomplete. The paper would also benefit from a more in-depth analysis of the experimental data.

      Strengths:

      The main strength of the paper is that it attempts to combine experimental measurements - from psychophysics, pupil measurements, and fMRI dynamics - and computational modeling to provide an emerging picture of how a perceptual switch emerges. This integrative approach is highly useful because the model has the potential to make the underlying mechanisms explicit and to make concrete predictions.

      Weaknesses:

      A general weakness is that the link between the three parts of the paper is not very strong. Pupil and fMRI measurements come from different experiments and additional analysis showing that the two experiments are comparable should be included. Crucially, the assumptions underlying the RNN modeling are unclear and the conclusions drawn from the simulation may depend on those assumptions.

      With this comment in mind we have made substantial effort to better integrate the three different aspects of our paper. On the pupillometry side, we now show that the dynamic uncertainty associated with perceptual categorisation shares a similar waveform with the observed fluctuations in pupil diameter around the switch point (Fig 2B). To better link the modelling to the behaviour we have also made the gain of the activation function of each sigmoidal unit change dynamically as a function of the uncertainty (i.e. the entropy) of the network’s classification generating phasic changes in gain that mimic the observed phasic changes in pupil dilation explicitly linking the dynamics of gain in the RNN to the observed dynamics of pupil diameter (our non-invasive proxy for neuromodulatory tone). Finally we note that the predictions of the RNN (flattened egocentric landscape and peaks in low-dimensional brain state velocity at the time point of the perceptual switch) were tested directly in the whole-brain BOLD data, which links the modelling and BOLD analysis. Finally we note that whilst we agree that an experiment in which pupilometry and BOLD data were collected simultaneously would be ideal, these data were not available to us at the time of this study.

      Main points:

      Perceptual tasks in pupil and fMRI experiments: how comparable are these two tasks? It seems that the timing is very different, with long stimulus presentations and breaks in the fMRI task and a rapid sequence in the pupil task. Detailed information about the task timing in the pupil task is missing. What evidence is there that the same mechanisms underlie perceptual switches at these different timescales? Quantification of the distributions of switching times/switching points in both tasks is missing. Do the subjects in the fMRI task show the same overall behavior as in the pupil task? More information is needed to clarify these points.

      We recognize the need for a more detailed and comparative analysis of the perceptual tasks used in our pupil and fMRI experiments, particularly regarding differences in timing, task structure, and instructions. The fMRI task incorporates jittered inter-trial intervals (ITIs) of 2, 4, 6, and 8 seconds, designed to enable effective deconvolution of the BOLD response (Stottinger et al., 2018). In contrast, the pupil task presents a more rapid sequence of stimuli without ITIs. These timing differences are reflected in the mean perceptual switch points: the 8th image in the fMRI task and the 9th image in the pupil task. This small yet consistent difference suggests subtle influences of task design on behavior.

      Despite these structural and instructional differences, our analyses indicate that overall behavioral patterns remain consistent across the two modalities. The distributions of switching times align closely, and no significant behavioral deviations were observed that might suggest a fundamental difference in the underlying mechanisms driving perceptual switches. These findings suggest that the additional time and structural differences in the fMRI task do not significantly alter the behavioral outcomes compared to the pupil task.

      To address these issues, we have added paragraphs in the Results, Methods, and Limitations sections of the manuscript. In the Results section, we provide a detailed comparison of switching point distributions across the two tasks, emphasizing behavioral consistencies and any observed variations. In the Methods section, we include an expanded description of task timing, instructions, and the presence or absence of catch trials to ensure clarity regarding the experimental setups. Finally, in the Limitations section, we acknowledge the structural differences between the tasks, particularly the lack of catch trials and rapid stimulus presentation in the pupil task, and discuss how these differences may influence perceptual dynamics.

      These additions aim to clarify how task-specific factors, such as timing, instructions, and catch trials, influence perceptual dynamics while highlighting the consistency in behavioral outcomes across both experimental setups. We believe these revisions address the concerns raised and enhance the manuscript’s transparency and rigor.

      Computational model:

      (1) Modeling noradrenaline effects in the RNN: The pupil data suggests phasic bursts of NA would promote perceptual switches. But as I understand, in the RNN neuromodulation is modeled as different levels of gain throughout the trial. Making the neural gain time-dependent would allow investigation of whether a phasic gain change can explain the experimentally observed distribution of switching times.

      We thank the reviewer for this very helpful suggestion. We updated the RNN so that, post-training, gain changes dynamically as a function of the network's classification uncertainty (i.e. the entropy of the network's output). Specifically, the gain dynamics of each unit in the neural network are governed by a linear ODE with a forcing function given by the entropy of the network’s classification (i.e. the uncertainty of the classification). This explicitly tests the hypothesis that uncertainty driven increases in gain near the perceptual switch (when the input is maximally ambiguous) speeds perceptual switches, and allows us to distinguish between tonic and phasic increases in gain (in the absence of uncertainty forcing gain decays exponentially to a tonic value of 1). Importantly, in line with our hypothesis, we found that switch times decreased as we increased the impact of uncertainty on gain (i.e. switch times decreased as the magnitude of uncertainty forcing increased). Finally, we wish to note that although making gain dynamical is relatively simple conceptually, actually implementing it and then analysing the dynamics turned out to be highly non-trivial. To our knowledge our model is the first RNN of reasonable size to implement dynamical gain requiring us to push the RNN modelling beyond the current state of the art (see Fig 2 - 4).

      (2) Modeling perceptual switches: in the results, it is described that the networks were trained to output a categorical response, but the firing rates in Fig 2B do not seem categorical but rather seem to follow the input stimulus. The output signals of the network are not shown. If I understand correctly, a trivial network that would just represent the two input signals without any internal computation and relay them to the output would do the task correctly (because "the network's choice at each time point was the maximum of the two-dimensional output", p. 22). This seems like cheating: the very operation that the model should perform is to signal the change, in a categorical manner, not to represent the gradually changing input signals.

      The output of the network was indeed trained to be categorical via a cross entropy loss function with the output defined by the max of the projection of the excitatory hidden units onto the output weights which is boilerplate RNN modelling practice. As requested we now show the output in Fig 2B. On the broader question of whether a trivially small network could solve the task we are in total agreement that with the right set of hand-crafted weights a two neuron sigmoidal network with winner-take-all readout could solve the task. We disagree, however, that using an RNN is cheating in any way. Many tasks in neuroscience can be trivially solved with a very small number of recurrent units (e.g. basically all 2AF tasks). The question we were interested in is how the brain might solve the task, and more specifically how neuromodulator control of gain changes the dynamics of our admittedly very simple task. We could have done this by hand crafting a small network to solve the task but we wanted to use the RNN modelling as a means of both hypothesis testing and hypothesis generation. We now expand on and justify this modelling choice in the second paragraph of the discussion:

      “We chose to use an RNN, instead of a simpler (more transparent) model as we wanted to use the RNN as a means of both hypothesis generation and hypothesis testing. Specifically, unlike more standard neuronal models which are handcrafted to reproduce a specific effect, when building an RNN the modeller only specifies the network inputs, labels, and the parameter constraints (e.g. Dale’s law) in advance. The dynamics of the RNN are entirely determined by optimisation. Post-training manipulations of the RNN are not built in, or in any way guaranteed to work, making them more analogous to experimental manipulations of an approximately task-optimal brain-like system. Confirmatory results are arguably, therefore, a first steps towards an in vitro experimental test.”

      (3) The mechanism of how increased gain leads to faster switches remains unclear to me. My first intuition was that increasing the gain of excitatory populations (the situation shown in Fig. 2E) in discrete attractor models would lead to deeper attractor wells and this would make it more difficult to switch. That is, a higher gain should lead to slower decisions in this case. However, here the switching time remains constant for a gain between 1 and 1.5. Lowering the gain, on the other hand, leads to slower switching. It is, of course, possible that the RNN behaves differently than classical point attractor models or that my intuition is incorrect (though I believe it is consistent with previous literature, e.g. Niyogi & Wong-Lin 2013 (doi:10.1371/journal.pcbi.1003099) who show higher firing rates - more stable attractors - for increased excitatory gain).

      We thank the reviewer for the astute observation, which we entirely agree with. The energy landscape analysis is a method still under active development within our group and we are still learning how to best explain it and its relationship to more traditional ways of quantifying potential-like energy functions of dynamical systems which we think the reviewer has in mind. We have now included a second type of energy landscape analysis which gives a complementary perspective on the RNN dynamics and is more straightforwardly comparable to typical potential functions. We describe the new analysis in the section “Large-scale neural predictions of recurrent neural network model” as follows:

      “Crucially, there are two complementary viewpoints from which we can construct an energy landscape; the first allocentric (i.e., third-person view) perspective quantifies the energy associated with each position in state space, whereas the second egocentric (i.e., first person view) perspective quantifies the energy associated relative changes independent of the direction of movement or the location in state space. The allocentric perspective is straightforwardly comparable to the potential function of a dynamical system but can only be applied to low dimensional data in settings where a position-like quantity is meaningfully defined. The egocentric perspective is analogous to taking the point of view of a single particle in a physical setting and quantifying the energy associated with movement relative to the particles initial location. An egocentric framework is thus more applicable, when signal magnitude is relative rather than absolute. See materials and methods, and (see Fig S4 for an intuitive explanation of the allocentric and egocentric energy landscape analysis on a toy dynamical system).”

      From the allocentric perspective it is entirely true that increasing gain increases the depth of the landscape, equivalent to increasing the depth of the attractor. However, because the input to the network changes dynamically the location of the approximate fixed-point attractor changes and the network state “chases” this attractor over the course of the trial. Importantly, the location of the energy minima changes more rapidly as gain increases, effectively forcing the network to rapidly change course at the point of the perceptual switch (see Fig 4). To quantify this effect we constructed a new measure - neural work - which describes the amount of “force” exerted on the low-dimensional neural trajectory by the vector field quantified by the allocentric landscape. Specifically we treat the allocentric landscape as analogous to a potential function and then leverage the fact that force is equal to the negative gradient of potential energy to calculate the work (force x displacement) done on the low dimensional trajectory at each time point. This showed that as gain increases the amount of work done on the neuronal trajectory at turning points increases analogous to the application of an external force transiently increasing the kinetic energy of an object. From the perspective of the egocentric landscape this results in a flattening of the landscape as there is a lower energy (i.e. higher probability) assigned to large deviations in the neuronal trajectory around the perceptual switch.

      Because of the novelty of the analyses we went to great lengths to carefully explain the methods in the updated manuscript. In addition we wrote a short tutorial style MATLAB script implementing both the allocentric and egocentric landscape analysis on a toy dynamical system with a known potential function (a supercritical pitchfork bifurcation).

      (4) From the RNN model it is not clear how changes in excitatory and inhibitory gain lead to slower/faster switching. In order to better understand the role of inhibitory and excitatory gain on switching, I would suggest studying a simple discrete attractor model (a rate model, for example as in Wong and Wang 2006 or Roxin and Ledberg, Plos Comp. Bio 2008) which will allow to study these effects in terms of a very few model parameters. The Roxin paper also shows how to map rate models onto simplified one-dimensional systems such as the one in Fig S3. Setting up the model using this framework would allow for making much stronger, principled statements about how gain changes affect the energy landscape, and under which conditions increased inhibitory gain leads to faster switching.

      One possibility is that increasing the excitatory gain in the RNN leads to saturated firing rates. If this is the reason for the different effects of excitatory and inhibitory gain changes, it should be properly explained. Moreover, the biological relevance of this effect should be discussed (assuming that saturation is indeed the explanation).

      We thank the reviewer for this excellent suggestion. After some consideration we decided that studying a reduced model would likely not do justice to the dynamical mechanisms of RNN especially after making gain dynamical rather than stationary. Still we very much share the reviewer’s concern that we need a stronger link between the (now dynamical) gain alterations and energy landscape dynamics. To this end we now describe and interrogate the dynamics of the RNN at a circuit level through selectivity and lesion based analyses, at a population level through analysis of the dynamical regime traversed by the network, and finally, through an extended energy landscape framework which has far stronger links to traditional potential based descriptions of low-dimensional dynamical systems (also see to comment 3. above).

      At a circuit level the speeding of perceptual switches is mediated by inhibition of the initially dominant population we describe in paragraphs 7 and 8 of the section “Computational evidence for neuromodulatory-mediated perceptual switches in a recurrent neural network” as follows:

      “Having confirmed our hypothesis that increasing gain as a function of the network uncertainty increased the speed of perceptual switches, we next sought to understand the mechanisms governing this effect starting with the circuit level and working our way up to the population level (c.f. Sheringtonian and Hopfieldian modes of analysis(66)). Because of the constraint that the input and output weights are strictly positive, we could use their (normalised) value as a measure of stimulus selectivity. Inspection of the firing rates sorted by input weights revealed that the networks had learned to complete the task by segregating both excitatory and inhibitory units into two stimulus-selective clusters (Fig 2C). As the inhibitory units could not contribute to the networks read out, we hypothesised that they likely played an indirect role in perceptual switching by inhibiting the population of excitatory neurons selective for the currently dominant stimulus allowing the competing population to take over and a perceptual switch to occur.

      To test this hypothesis, we sorted the inhibitory units by the selectivity of the excitatory units they inhibit (i.e. by the normalised value of the readout weights). Inspecting the histogram of this selectivity metric revealed a bimodal distribution with peaks at each extreme strongly inhibiting a stimulus selective excitatory population at the exclusion of the other (Fig S2). Based on the fact that leading up to the perceptual switch point both the input and firing rate of the dominant population are higher than the competing population, we hypothesized that gain likely speeds perceptual switches by actively inhibiting the currently dominant population rather than exciting/disinhibiting the competing population. We predicted, therefore, that lesioning the inhibitory units selective for the stimulus that is initially dominant would dramatically slow perceptual switches, whilst lesioning the inhibitory units selective for the stimulus the input is morphing into would have a comparatively minor slowing effect on switch times since the population is not receiving sufficient input to take over until approximately half way through the trial irrespective of the inhibition it receives. As selectivity is not entirely one-to-one, we expect both lesions to slow perceptual switches but differ in magnitude. In line with our prediction, lesioning the inhibitory units strongly selective for the initially dominant population greatly slowed perceptual switches (Fig 3F upper), whereas lesioning the population selective for the stimulus the input morphs into removed the speeding effect of gain but had a comparatively small slowing effect on perceptual switches (Fig 3F lower).”

      At the population level we characterised the dynamics of the 2D parameter space (defined by gain and the difference between the input dimensions) traversed by the network over the course of a trial as input and gain dynamically change. We describe this paragraphs 9-14 of the section “Computational evidence for neuromodulatory-mediated perceptual switches in a recurrent neural network” which we reprint below for the reviewers convenience :

      “Based on the selectivity of the network firing rates we hypothesised that the dynamics were shaped by a fixed-point attractor whose location and existence were determined by gain and  and thus changed dynamically over the course of a single trial(67-70). Because of the large size of the network, we could not solve for the fixed points or study their stability analytically. Instead we opted for a numerical approach and characterised the dynamical regime (i.e. the location and existence of approximate fixed-point attractors) across all combinations of gain and  visited by the network. Specifically, for each combination of elements in the parameter space  we ran 100 simulations with initial conditions (firing rates) drawn from a uniform distribution between [0,1], and let the dynamics run for 10 seconds of simulation time (10 times the length of the task - longer simulation times did not qualitatively change the results) without noise. As we were interested in the existence of fixed-point attractors rather than their precise location, at each time point we computed the difference in firing rate between successive time points across the network. For each simulation we computed both the proportion of trials that converged to a value below  10^-2 giving us proxy for the presence of fixed points, and the time to convergence, giving us a measure of the “strength” of the attractor.

      Across gain values when input had unambiguous values, the network rapidly converged across all initialisations (Fig 3A & 3C-H). When input became ambiguous, however, the dynamics acquired a decaying oscillation and did not converge within the time frame of the simulation. As gain increased, the range of  values characterised by oscillatory dynamics broadened. Crucially, for sufficiently high values of gain, ambiguous  values transitioned the network into a regime characterised by high amplitude inhibition-driven oscillations (Fig 3D & 3G). Each trial can, therefore, be characterised by a trajectory through this 2-dimensional parameter space, with dynamics shaped by the dynamical regimes of each location visited (Fig 3A-B).

      When uncertainty has a small impact on gain the network has a trajectory through an initial regime characterised by the rapid convergence to a fixed point where the population representing the initial stimulus dominated whilst the other was silent (Fig 3C), an uncertain regime characterised by oscillations with all neurons partially activated (Fig 3D), and after passing through the oscillatory regime, the network once again enters a new fixed-point regime where the population representing the initial stimulus is now silent and the other is dominant (Fig 3E).

      For high gain trails, the network again started and finished in states characterised by a rapid convergence to a fixed point representing the dominant input dimension (Fig 3F-H), but differed in how it transitioned between these states. Uncertain inputs now generated high amplitude oscillations with the network flip-flopping between active and silent states (Fig 3G). We hypothesised that, within the task, this has the effect of silencing the initially dominant population, and boosting the competing population. To test this we initialised each network with parameter values well inside the oscillatory regime (u = [ .5, .5]  , gain = 1.5) with initial conditions determined by the selectivity of each unit. Excitatory units selective for input dimension 1, as well as the associated inhibitory units projecting to this population, were fully activated, whilst the excitatory units selective for  input dimension 2 and the associated inhibitory units were silenced. As we predicted, when initialised in this state the network dynamics displayed an out of phase oscillation where the initially dominant population was rapidly silenced and the competing population was boosted after a brief delay (219 (ms), +/-114 Fig S3).”

      From this we concluded that at a population level, heightened gain leading up to the perceptual switch speeds the switch by transiently pushing the dynamics into an unstable dynamical regime replacing the fixed-point attractor representing the input with an oscillatory regime that actively inhibits the currently dominant population and boosts the competing population before transitioning back into a regime with a stable (approximate) fixed-point attractor representing the new stimulus (Fig 3F-H & Fig S3).

      As we describe in the our response to comment 3 above our extended energy-landscape analysis framework now includes an explicit link between the potential of the dynamical system and allocentric landscape, whilst also explaining how a transient deepening of the allocentric landscape (which can be essentially thought of analogous to a traditional potential function) relates to the flattening of the egocentric landscape.

      Finally, whilst we appreciate the interest in further characterising the effect of inhibitory gain compared with excitatory gain the topic is is largely orthogonal the aims of our paper so we have removed the discussion of inhibitory vs excitatory gain. Still, we understand that we need to do our due diligence and check that our results do not break down when we manipulate either inhibitory or excitatory gain in isolation. To this end we checked that dynamical gain still speeded perceptual switches when the effect was isolated to inhibitory or excitatory cells in isolation. We show the behavioural plots below for the reviewer’s interest.

      Author response image 1.

      Switch time as a function of uncertainty forcing

      Alternative mechanisms:

      It is mentioned in the introduction that changes in attention could drive perceptual switches. A priori, attention signals originating in the frontal cortex may be plausible mechanisms for perceptual switches, as an alternative to LC-controlled gain modulation. Does the observed fMRI dynamics allow us to distinguish these two hypotheses? In any case, I would suggest including alternative scenarios that may be compatible with the observed findings in the discussion.

      We agree with the reviewer, in that attention is itself a confound and a process that is challenging to disentangle from the perceptual switching process in the current task. Importantly, we were not arguing for exclusivity in our manuscript, but merely testing the veracity of the hypothesis that the ascending arousal system may play a causal role in mediating and/or speeding perceptual switches. Future work with experiments that more specifically aim to dissociate these different features will be required to tease apart these different possibilities.

      Reviewer #2 (Public Review):

      Strengths

      - the study combines different methods (pupillometry, RNNs, fMRI).

      - the study combines different viewpoints and fields of the scientific literature, including neuroscience, psychology, physics, dynamical systems.

      - This combination of methods and viewpoints is rarely done, it is thus very useful.

      - Overall well-written.

      Weaknesses

      - The study relies on a report paradigm: participants report when they identify a switch in the item category. The sequence corresponds to the drawing of an object being gradually morphed into another object. Perceptual switches are therefore behaviorally relevant, and it is not clear whether the effect reported correspond to the perceptual switch per se, or the detection of an event that should change behavior (participant press a button indicating the perceived category, and thus switch buttons when they identify a perceptual change). The text mentions that motor actions are controlled for, but this fact only indicates that a motor action is performed on each trial (not only on the switch trial); there is still a motor change confounded with the switch. As a result, it is not clear whether the effect reported in pupil size, brain dynamics, and brain states is related to a perceptual change, or a decision process (to report this change).

      We agree with the reviewer that the coupling of the motor change with the perceptual switch is confounded to some degree, but since motor preparation occurs on every trial we suspect that it is more accurate to describe it as confounded with task-relevance more than motor preparation per se.  While it is possible that pupil diameter, network topology and energy landscape features are all related to motor change rather than the perceptual switch, we note that the weight of evidence is against this interpretation, given the simple mechanistic explanation created by the coupling of perceptual uncertainty to network gain.

      - The study presents events that co-occur (perceptual switch, change in pupil size, energy landscape of brain dynamics) but we cannot identify the causes and consequences. Yet, the paper makes several claims about causality (e.g. in the abstract "neuromodulatory tone ... causally mediates perceptual switches", in the results "the system flattening the energy landscape ... facilitated an updating of the content of perception").

      We have made an effort to soften the causal language, where appropriate. In addition, we note that we have changed the title to “Gain neuromodulation mediates task-relevant perceptual switches: evidence from pupillometry, fMRI, and RNN Modelling” to reflect the fact that our claims do not extent to cases of perceptual switches where the stimulus is only passively observed.

      - Some effects may reflect the expectation of a perceptual switch, rather than the perceptual switch per se. Given the structure of the task, participants know that there will be a perceptual switch occurring once during a sequence of morphed drawings. This change is expected to occur roughly in the middle of the sequence, making early switches more surprising, and later switches less surprising. Differences in pupil response to early, medium, and late switches could reflect this expectation. The authors interpret this effect very differently ("the speed of a perceptual switch should be dependent on LC activity").

      The task includes catch trials designed to reduce the expectation of a perceptual switch. In these trials, a perceptual switch occurs either earlier or later than usual. While these trials are valuable for mitigating predictability, we did not focus extensively on them, as they were thoroughly discussed in the original paper. Additionally, due to the limited number of catch trials, it is difficult—if not impossible—to calculate a reliable mean surprise per image set.

      It is also worth noting that the pupil study does not include catch trials, which could contribute to differences in how perceptual switches are processed and interpreted between the fMRI and pupil experiments.

      - The RNN is far more complex than needed for the task. It has two input units that indicate the level of evidence for the two categories being morphed, and it is trained to output the dominant category. A (non-recurrent) network with only these two units and an output unit whose activity is a sigmoid transform of the difference in the inputs can solve the task perfectly. The RNN activity is almost 1-dimensional probably for this reason. In addition, the difficult part of the computation done by the human brain in this task is already solved in the input that is provided to the network (the brain is not provided with the evidence level for each category, and in fact, it does not know in advance what the second category will be).

      We agree that a simpler model could perform the task. We opted to use an RNN rather than hand craft a simpler model as we wanted to use the model as both a method of hypothesis testing and hypothesis generation. We now expand on and justify this modelling choice in the second paragraph of the discussion (also see our response to Reviewer 1 comment 4):

      “We chose to use an RNN, instead of a simpler (more transparent) model as we wanted to use the RNN as a means of both hypothesis generation and hypothesis testing. Specifically, unlike more standard neuronal models which are handcrafted to reproduce a specific effect, when building an RNN the modeller only specifies the network inputs, labels, and the parameter constraints (e.g. Dale’s law) in advance. The dynamics of the RNN are entirely determined by optimisation. Post-training manipulations of the RNN are not built in, or in any way guaranteed to work, making them more analogous to experimental manipulations of an approximately task-optimal brain-like system. Confirmatory results are arguably, therefore, a first steps towards an in vitro experimental test.”

      In other words, a simpler model would not have been appropriate to the aims. In addition we note that low dimensional dynamics are extremely common in the RNN literature and are in no way unique to our model. 

      - Basic fMRI results are missing and would be useful, before using elaborate analyses. For instance, what are the regions that are more active when a switch is detected?

      We explicitly chose to not run a standard voxelwise statistical parametric approach on these data, as the results were reported extensively in the original study (Stottinger et al., 2018).

      - The use of methods from physics may obscure some simple facts and simpler explanations. For instance, does the flatter energy landscape in the higher gain condition reflect a smaller number of states visited in the state space of the RNN because the activity of each unit gets in the saturation range? If correct, then it may be a more straightforward way of explaining the results.

      We appreciate the reviewer's concern as this would indeed be a problem. However, this is not the case for our network. At the time point of the perceptual switch where the egocentric landscape dynamics are at their flattest the RNN firing rates are approximately 50% activated nowhere near the saturation point. In addition, a flatter landscape in the egocentric and allocentric landscape analyses only occurs - mathematically speaking - when there are more states visited not less.

      In addition, we note that we are very sympathetic to the complexity of our physics based analyses and have gone to great lengths to describe them in an accessible manner in both the main text and methods. We have also included tutorial style code demonstrating how the analysis can be used on a toy dynamical system in the supplementary material.

      - Some results are not as expected as the authors claim, at least in the current form of the paper. For instance, they show that, when trained to identify which of two inputs u1 and u2 is the largest (with u2=1-u1, starting with u1=1 and gradually decreasing u1), a higher gain results in the RNN reporting a switch in dominance before the true switch (e.g. when u1=0.6 and u2=0.4), and vice et versa with a lower gain. In other words, it seems to correspond to a change in criterion or bias in the RNN's decision. The authors should discuss more specifically how this result is related to previous studies and models on gain modulation. An alternative finding could have been that the network output is a more (or less) deterministic function of its inputs, but this aspect is not reported.

      We appreciate this comment but it is simply not applicable to our network. There is no criterion in the RNN. We could certainly add one but this would be a significant departure from how decisions are typically modelled in RNNs. The (deterministic) readout is the max of the projection of the (instantaneous) excitatory firing rate onto the readout weights. A shift in criterion would imply that the dynamics are unaffected and the effect can be explained by a shift in the readout weights; this cannot be the case because the readout weights are stationary the change occurs at the level of the activation function.

      We are aware that there is a large literature in decision making and psychophysics that uses the term gain in a slightly different way. Here we are strictly referring to the gain of the activation function. Although we agree that it would be interesting and important to discuss the differing uses of the term gain, this is beyond the scope of the present paper.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their thoughtful comments and constructive suggestions. Point-by-point responses to comments are given below:

      Reviewer #1 (Recommendations For The Authors):

      This manuscript provides an important case study for in-depth research on the adaptability of vertebrates in deep-sea environments. Through analysis of the genomic data of the hadal snailfish, the authors found that this species may have entered and fully adapted to extreme environments only in the last few million years. Additionally, the study revealed the adaptive features of hadal snailfish in terms of perceptions, circadian rhythms and metabolisms, and the role of ferritin in high-hydrostatic pressure adaptation. Besides, the reads mapping method used to identify events such as gene loss and duplication avoids false positives caused by genome assembly and annotation. This ensures the reliability of the results presented in this manuscript. Overall, these findings provide important clues for a better understanding of deep-sea ecosystems and vertebrate evolution.

      Reply: Thank you very much for your positive comments and encouragement.

      However, there are some issues that need to be further addressed.

      1. L119: Please indicate the source of any data used.

      Reply: Thank you very much for the suggestion. All data sources used are indicated in Supplementary file 1.

      1. L138: The demographic history of hadal snailfish suggests a significant expansion in population size over the last 60,000 years, but the results only show some species, do the results for all individuals support this conclusion?

      Reply: Thank you for this suggestion. The estimated demographic history of the hadal snailfish reveals a significant population increase over the past 60,000 years for all individuals. The corresponding results have been incorporated into Figure 1-figure supplements 8B.

      Author response image 1.

      (B) Demographic history for 5 hadal snailfish individuals and 2 Tanaka’s snailfish individuals inferred by PSMC. The generation time of one year for Tanaka snailfish and three years for hadal snailfish.

      1. Figure 1-figure supplements 8: Is there a clear source of evidence for the generation time of 1 year chosen for the PSMC analysis?

      Reply: We apologize for the inclusion of an incorrect generation time in Figure 1-figure supplements 8. It is important to note that different generation times do not change the shape of the PSMC curve, they only shift the curve along the axis. Due to the absence of definitive evidence regarding the generation time of the hadal snailfish, we have referred to Wang et al., 2019, assuming a generation time of one year for Tanaka snailfish and three years for hadal snailfish. The generation time has been incorporated into the main text (lines 516-517): “The generation time of one year for Tanaka snailfish and three years for hadal snailfish.”.

      1. L237: Transcriptomic data suggest that the greatest changes in the brain of hadal snailfish compared to Tanaka's snailfish, what functions these changes are specifically associated with, and how these functions relate to deep-sea adaptation.

      Reply: Thank you for this suggestion. Through comparative transcriptome analysis, we identified 3,587 up-regulated genes and 3,433 down-regulated genes in the brains of hadal snailfish compared to Tanaka's snailfish. Subsequently, we conducted Gene Ontology (GO) functional enrichment analysis on the differentially expressed genes, revealing that the up-regulated genes were primarily associated with cilium, DNA repair, protein binding, ATP binding, and microtubule-based movement. Conversely, the down-regulated genes were associated with membranes, GTP-binding, proton transmembrane transport, and synaptic vesicles, as shown in following table (Supplementary file 15). Previous studies have shown that high hydrostatic pressure induces DNA strand breaks and damage, and that DNA repair-related genes upregulated in the brain may help hadal snailfish overcome these challenges.

      Author response table 1.

      GO enrichment of expression up-regulated and down-regulated genes in hadal snailfish brain.

      We have added new results (Supplementary file 15) and descriptions to show the changes in the brains of hadal snailfish (lines 250-255): “Specifically, there are 3,587 up-regulated genes and 3,433 down-regulated genes in the brain of hadal snailfish compared to Tanaka snailfish, and Gene Ontology (GO) functional enrichment analyses revealed that up-regulated genes in the hadal snailfish are associated with cilium, DNA repair, and microtubule-based movement, while down-regulated genes are enriched in membranes, GTP-binding, proton transmembrane transport, and synaptic vesicles (Supplementary file 15).”

      1. L276: What is the relationship between low bone mineralization and deep-sea adaptation, and can low mineralization help deep-sea fish better adapt to the deep sea?

      Reply: Thank you for this suggestion. The hadal snailfish exhibits lower bone mineralization compared to Tanaka's snailfish, which may have facilitated its adaptation to the deep sea. On one hand, this reduced bone mineralization could have contributed to the hadal snailfish's ability to maintain neutral buoyancy without excessive energy expenditure. On the other hand, the lower bone mineralization may have also rendered their skeleton more flexible and malleable, enhancing their resilience to high hydrostatic pressure. Accordingly, we added the following new descriptions (lines 295-300): “Nonetheless, micro-CT scans have revealed shorter bones and reduced bone density in hadal snailfish, from which it has been inferred that this species has reduced bone mineralization (M. E. Gerringer et al., 2021); this may be a result of lowering density by reducing bone mineralization, allowing to maintain neutral buoyancy without expending too much energy, or it may be a result of making its skeleton more flexible and malleable, which is able to better withstand the effects of HHP.”

      1. L293: The abbreviation HHP was mentioned earlier in the article and does not need to be abbreviated here.

      Reply: Thank you for the correction. We have corrected the word. Line 315.

      1. L345: It should be "In addition, the phylogenetic relationships between different individuals clearly indicate that they have successfully spread to different trenches about 1.0 Mya".

      Reply: Thank you for the correction. We have corrected the word. Line 374.

      1. It is curious what functions are associated with the up-regulated and down-regulated genes in all tissues of hadal snailfish compared to Tanaka's snailfish, and what functions have hadal snailfish lost in order to adapt to the deep sea?

      Reply: Thank you for this suggestion. We added a description of this finding in the results section (lines 337-343): “Next, we identified 34 genes that are significantly more highly expressed in all organs of hadal snailfish in comparison to Tanaka’s snailfish and zebrafish, while only seven genes were found to be significantly more highly expressed in Tanaka’s snailfish using the same criterion (Figure 5-figure supplements 1). The 34 genes are enriched in only one GO category, GO:0000077: DNA damage checkpoint (Adjusted P-value: 0.0177). Moreover, five of the 34 genes are associated with DNA repair.” This suggests that up-regulated genes in all tissues in hadal snailfish are associated with DNA repair in response to DNA damage caused by high hydrostatic pressure, whereas down-regulated genes do not show enrichment for a particular function.

      Overall, the functions lost in hadal snailfish adapted to the deep sea are mainly related to the effects of the dark environment, which can be summarized as follows (lines 375-383): “The comparative genomic analysis revealed that the complete absence of light had a profound effect on the hadal snailfish. In addition to the substantial loss of visual genes and loss of pigmentation, many rhythm-related genes were also absent, although some rhythm genes were still present. The gene loss may not only come from relaxation of natural selection, but also for better adaptation. For example, the grpr gene copies are absent or down-regulated in hadal snailfish, which could in turn increased their activity in the dark, allowing them to survive better in the dark environment (Wada et al., 1997). The loss of gpr27 may also increase the ability of lipid metabolism, which is essential for coping with short-term food deficiencies (Nath et al., 2020).”

      Reviewer #2 (Recommendations For The Authors):

      I have pointed out some of the examples that struck me as worthy of additional thought/writing/comments from the authors. Any changes/comments are relatively minor.

      Reply: Thank you very much for your positive comments on this work.

      For comparative transcriptome analyses, reads were mapped back to reference genomes and TPM values were obtained for gene-level count analyses. 1:1 orthologs were used for differential expression analyses. This is indeed the only way to normalize counts across species, by comparing the same gene set in each species. Differential expression statistics were run in DEseq2. This is a robust way to compare gene expression across species and where fold-change values are reported (e.g. Fig 3, creatively by coloring the gene name) the values are best-practice.

      In other places, TPM values are reported (e.g. Fig 2D, Fig 4C, Fig 5A, Fig 4-Fig supp 4) to illustrate expression differences within a tissue across species. The comparisons look robust, although it is not made clear how the values were obtained in all cases. For example, in Fig 2D the TPM values appear to be from eyes of individual fish, but in Fig 4C and 5A they must be some kind of average? I think that information should be added to the figure legends.

      Of note: TPM values are sensitive to the shape of the RNA abundance distribution from a given sample: A small number of very highly expressed genes might bias TPM values downward for other genes. From one individual to another or from one species to another, it is not obvious to me that we should expect the same TPM distribution from the same tissues, making it a challenging metric for comparison across samples, and especially across species. An alternative measure of RNA abundance is normalized counts that can be output from DEseq2. See:

      Zhao, Y., Li, M.C., Konaté, M.M., Chen, L., Das, B., Karlovich, C., Williams, P.M., Evrard, Y.A., Doroshow, J.H. and McShane, L.M., 2021. TPM, FPKM, or normalized counts? A comparative study of quantification measures for the analysis of RNA-seq data from the NCI patient-derived models repository. Journal of translational medicine, 19(1), pp.1-15.

      If the authors would like to keep the TPM values, I think it would be useful for them to visualize the TPM value distribution that the numbers were derived from. One way to do this would be to make a violin plot for species/tissue and plot the TPM values of interest on that. That would give a visualization of the ranked value of the gene within the context of all other TPM values. A more highly expressed gene would presumably have a higher rank in context of the specific tissue/species and be more towards the upper tail of the distribution. An example violin plot can be found in Fig 6 of:

      Burns, J.A., Gruber, D.F., Gaffney, J.P., Sparks, J.S. and Brugler, M.R., 2022. Transcriptomics of a Greenlandic Snailfish Reveals Exceptionally High Expression of Antifreeze Protein Transcripts. Evolutionary Bioinformatics, 18, p.11769343221118347.

      Alternatively, a comparison of TPM and normalized count data (heatmaps?) would be of use for at least some of the reported TPM values to show whether the different normalization methods give comparable outputs in terms of differential expression. One reason for these questions is that DEseq2 uses normalized counts for statistical analyses, but values are expressed as TPM in the noted figures (yes, TPM accounts for transcript length, but can still be subject to distribution biases).

      Reply: Thank you for your suggestions. Following your suggestions, we modified Fig 2D, Fig 4C, Fig 4-Fig supp 4, and Fig 5-Fig supp 1, respectively. In the differential expression analyses, only one-to-one orthologues of hadal snailfish and Tanaka's snailfish can get the normalized counts output by DEseq2, so we showed the normalized counts by DEseq2 output for Fig 2D, Fig 4C, Fig 4-Fig supp 4, Fig 5-Fig supp 1, and for Fig 5A, since the copy number of fthl27 genes undergoes specific expansion in hadal snailfish, we visualized the ranking of all fthl27 genes across tissues by plotting violins in Fig 5-Fig supp 2.

      Author response image 2.

      (D) Log10-transformation normalized counts for DESeq2 (COUNTDESEQ2) of vision-related genes in the eyes of hadal snailfish and Tanka's snailfish. * represents genes significantly downregulated in hadal snailfish (corrected P < 0.05).

      Author response image 3.

      (C) The deletion of one copy of grpr and another copy of down-regulated expression in hadal snailfish. The relative positions of genes on chromosomes are indicated by arrows, with arrows to the right representing the forward strand and arrows to the left representing the reverse strand. The heatmap presented is the average of the normalized counts for DESeq2 (COUNTDESEQ2) in all replicate samples from each tissue. * represents tissue in which the grpr-1 was significantly down-regulated in hadal snailfish (corrected P < 0.05).

      Author response image 4.

      Expression of the vitamin D related genes in various tissues of hadal snailfish and Tanaka's snailfish. The heatmap presented is the average of the normalized counts for DESeq2 (COUNTDESEQ2) in all replicate samples from each tissue.

      Author response image 5.

      (B) Expression of the ROS-related genes in different tissues of hadal snailfish and Tanaka's snailfish. The heatmap presented is the average of the normalized counts for DESeq2 (COUNTDESEQ2) in all replicate samples from each tissue.

      Author response image 6.

      Ranking of the expression of individual copies of fthl27 gene in hadal snailfish and Tanaka's snailfish in various tissues showed that all copies of fthl27 in hadal snailfish have high expression. The gene expression presented is the average of TPM in all replicate samples from each tissue.

      Line 96: Which BUSCOs? In the methods it is noted that the actinopterygii_odb10 BUSCO set was used. I think it should also be noted here so that it is clear which BUSCO set was used for completeness analysis. It could even be informally the ray-finned fish BUSCOs or Actinopterygii BUSCOs.

      Reply: Thank you for this suggestion. We used Actinopterygii_odb10 database and we added the BUSCO set to the main text as follows (lines 92-95): “The new assembly filled 1.26 Mb of gaps that were present in our previous assembly and have a much higher level of genome continuity and completeness (with complete BUSCOs of 96.0 % [Actinopterygii_odb10 database]) than the two previous assemblies.”

      Lines 102-105: The medaka genome paper proposes the notion that the ancestral chromosome number between medaka, tetraodon, and zebrafish is 24. There may be other evidence of that too. Some of that evidence should be cited here to support the notion that sticklebacks had chromosome fusions to get to 21 chromosomes rather than scorpionfish having chromosome fissions to get to 24. Here's the medaka genome paper:

      Kasahara, M., Naruse, K., Sasaki, S., Nakatani, Y., Qu, W., Ahsan, B., Yamada, T., Nagayasu, Y., Doi, K., Kasai, Y. and Jindo, T., 2007. The medaka draft genome and insights into vertebrate genome evolution. Nature, 447(7145), pp.714-719.

      Reply: Thank you for your great suggestion. Accordingly, we modified the sentence and added the citation as follows (lines 100-105): “We noticed that there is no major chromosomal rearrangement between hadal snailfish and Tanaka’s snailfish, and chromosome numbers are consistent with the previously reported MTZ-ancestor (the last common ancestor of medaka, Tetraodon, and zebrafish) (Kasahara et al., 2007), while the stickleback had undergone several independent chromosomal fusion events (Figure 1-figure supplements 4).”

      Line 161-173: "Along with the expression data, we noticed that these genes exhibit a different level of relaxation of natural selection in hadal snailfish (Figure 2B; Figure 2-figure supplements 1)." With the above statment and evidence, the authors are presumably referring to gene losses and differences in expression levels. I think that since gene expression was not measured in a controlled way it may not be a good measure of selection throughout. The reported genes could be highly expressed under some other condition, selection intact. I find Fig2-Fig supp 1 difficult to interpret. I assume I am looking for regions where Tanaka’s snailfish reads map and Hadal snailfish reads do not, but it is not abundantly clear. Also, other measures of selection might be good to investigate: accumulation of mutations in the region could be evidence of relaxed selection, for example, where essential genes will accumulate fewer mutations than conditional genes or (presumably) genes that are not needed at all. The authors could complete a mutational/SNP analysis using their genome data on the discussed genes if they want to strengthen their case for relaxed selection. Here is a reference (from Arabidopsis) showing these kinds of effects:

      Monroe, J.G., Srikant, T., Carbonell-Bejerano, P., Becker, C., Lensink, M., Exposito-Alonso, M., Klein, M., Hildebrandt, J., Neumann, M., Kliebenstein, D. and Weng, M.L., 2022. Mutation bias reflects natural selection in Arabidopsis thaliana. Nature, 602(7895), pp.101-105.

      Reply: Thank you for pointing out this important issue. Following your suggestion, we have removed the mention of the down-regulation of some visual genes in the eyes of hadal snailfish and the results of the original Fig2-Fig supp 1 that were based on reads mapping to confirm whether the genes were lost or not. To investigate the potential relaxation of natural selection in the opn1sw2 gene in hadal snailfish, we conducted precise gene structure annotation. Our findings revealed that the opn1sw2 gene is pseudogenized in hadal snailfish, indicating a relaxation of natural selection. We have included this result in Figure 2-figure supplements 1.

      Author response image 7.

      Pseudogenization of opn1sw2 in hadal snailfish. The deletion changed the protein’s sequence, causing its premature termination.

      Accordingly, we have toned down the related conclusions in the main text as follows (lines 164-173): “We noticed that the lws gene (long wavelength) has been completely lost in both hadal snailfish and Tanaka’s snailfish; rh2 (central wavelength) has been specifically lost in hadal snailfish (Figure 2B and 2C); sws2 (short wavelength) has undergone pseudogenization in hadal snailfish (Figure 2-figure supplements 1); while rh1 and gnat1 (perception of very dim light) is both still present and expressed in the eyes of hadal snailfish (Figure 2D). A previous study has also proven the existence of rhodopsin protein in the eyes of hadal snailfish using proteome data (Yan, Lian, Lan, Qian, & He, 2021). The preservation and expression of genes for the perception of very dim light suggests that they are still subject to natural selection, at least in the recent past.”

      Line 161-170: What tissue were the transcripts derived from for looking at expression level of opsins? Eyes?

      Reply: Thank you for your suggestions. The transcripts used to observe the expression levels of optic proteins were obtained from the eye.

      Line 191: What does tmc1 do specifically?

      Reply: Thank you for this suggestion. The tmc1 gene encodes transmembrane channel-like protein 1, involved in the mechanotransduction process in sensory hair cells of the inner ear that facilitates the conversion of mechanical stimuli into electrical signals used for hearing and homeostasis. We added functional annotations for the tmc1 in the main text (lines 190-196): “Of these, the most significant upregulated gene is tmc1, which encodes transmembrane channel-like protein 1, involved in the mechanotransduction process in sensory hair cells of the inner ear that facilitates the conversion of mechanical stimuli into electrical signals used for hearing and homeostasis (Maeda et al., 2014), and some mutations in this gene have been found to be associated with hearing loss (Kitajiri, Makishima, Friedman, & Griffith, 2007; Riahi et al., 2014).”

      Line 208: "it is likely" is a bit proscriptive

      Reply: Thank you for this suggestion. We rephrased the sentence as follows (lines 213-215): “Expansion of cldnj was observed in all resequenced individuals of the hadal snailfish (Supplementary file 10), which provides an explanation for the hadal snailfish breaks the depth limitation on calcium carbonate deposition and becomes one of the few species of teleost in hadal zone.”

      Line 199: maybe give a little more info on exactly what cldnj does? e.g. "cldnj encodes a claudin protein that has a role in tight junctions through calcium independent cell-adhesion activity" or something like that.

      Reply: Thank you for this suggestion. We have added functional annotations for the cldnj to the main text (lines 200-204): “Moreover, the gene involved in lifelong otolith mineralization, cldnj, has three copies in hadal snailfish, but only one copy in other teleost species, encodes a claudin protein that has a role in tight junctions through calcium independent cell-adhesion activity (Figure 3B, Figure 3C) (Hardison, Lichten, Banerjee-Basu, Becker, & Burgess, 2005).”

      Lines 199-210: Paragraph on cldnj: there are extra cldnj genes in the hadal snailfish, but no apparent extra expression. Could the authors mention that in their analysis/discussion of the data?

      Reply: Thank you for your suggestions. Despite not observing significant changes in cldnj expression in the brain tissue of hadal snailfish compared to Tanaka's snailfish, it is important to consider that the brain may not be the primary site of cldnj expression. Previous studies in zebrafish have consistently shown expression of cldnj in the otocyst during the critical early growth phase of the otolith, with a lower level of expression observed in the zebrafish brain. However, due to the unavailability of otocyst samples from hadal snailfish in our current study, our findings do not provide confirmation of any additional expression changes resulting from cldnj amplification. Consequently, it is crucial to conduct future comprehensive investigations to explore the expression patterns of cldnj specifically in the otocyst of hadal snailfish. Accordingly, we added a discussion of this result in the main text (lines 209-214): “In our investigation, we found that the expression of cldnj was not significantly up-regulated in the brain of the hadal snailfish than in Tanaka’s snailfish, which may be related to the fact that cldnj is mainly expressed in the otocyst, while the expression in the brain is lower. However, due to the immense challenge in obtaining samples of hadal snailfish, the expression of cldnj in the otocyst deserves more in-depth study in the future.”

      Lines 225-231: I wonder whether low expression of a circadian gene might be a time of day effect rather than an evolutionary trait. Could the authors comment?

      Reply: Thank you for your suggestions. Previous studies have shown that the grpr gene is expressed relatively consistently in mouse suprachiasmatic nucleus (SCN) throughout the day (Figure 4-figure supplements 1) and we hypothesize that the low expression of grpr-1 gene expression in hadal snailfish is an evolutionary trait. We have modified this result in the main text (lines 232-242): “In addition, in the teleosts closely related to hadal snailfish, there are usually two copies of grpr encoding the gastrin-releasing peptide receptor; we noticed that in hadal snailfish one of them is absent and the other is barely expressed in brain (Figure 4C), whereas a previous study found that the grpr gene in the mouse suprachiasmatic nucleus (SCN) did not fluctuate significantly during a 24-hour light/dark cycle and had a relatively stable expression (Pembroke, Babbs, Davies, Ponting, & Oliver, 2015) (Figure 4-figure supplements 1). It has been reported that grpr deficient mice, while exhibiting normal circadian rhythms, show significantly increased locomotor activity in dark conditions (Wada et al., 1997; Zhao et al., 2023). We might therefore speculate that the absence of that gene might in some way benefit the activity of hadal snailfish under complete darkness.”

      Author response image 8.

      (B) Expression of the grpr in a 24-hour light/dark cycle in the mouse suprachiasmatic nucleus (SCN). Data source with http://www.wgpembroke.com/shiny/SCNseq.

      Line 253: What is gpr27? G protein coupled receptor?

      Reply: We apologize for the ambiguous description. Gpr27 is a G protein-coupled receptor, belonging to the family of cell surface receptors. We introduced gpr27 in the main text as follows (lines 270-273): “Gpr27 is a G protein-coupled receptor, belonging to the family of cell surface receptors, involved in various physiological processes and expressed in multiple tissues including the brain, heart, kidney, and immune system.”

      Line 253: Fig4 Fig supp 3 is a good example of pseudogenization!

      Reply: Thank you very much for your recognition.

      Line 279: What is bglap? It regulates bone mineralization, but what specifically does that gene do?

      Reply: We apologize for the ambiguous description. The bglap gene encodes a highly abundant bone protein secreted by osteoblasts that binds calcium and hydroxyapatite and regulates bone remodeling and energy metabolism. We introduced bglap in the main text as follows (lines 300-304): “The gene bglap, which encodes a highly abundant bone protein secreted by osteoblasts that binds calcium and hydroxyapatite and regulates bone remodeling and energy metabolism, had been found to be a pseudogene in hadal fish (K. Wang et al., 2019), which may contribute to this phenotype.”

      Line 299: Introduction of another gene without providing an exact function: acaa1.

      Reply: We apologize for the ambiguous description. The acaa1 gene encodes acetyl-CoA acetyltransferase 1, a key regulator of fatty acid β-oxidation in the peroxisome, which plays a controlling role in fatty acid elongation and degradation. We introduced acaa1 in the main text as follows (lines 319-324): “In regard to the effect of cell membrane fluidity, relevant genetic alterations had been identified in previous studies, i.e., the amplification of acaa1 (encoding acetyl-CoA acetyltransferase 1, a key regulator of fatty acid β-oxidation in the peroxisome, which plays a controlling role in fatty acid elongation and degradation) may increase the ability to synthesize unsaturated fatty acids (Fang et al., 2000; K. Wang et al., 2019).”

      Fig 5 legend: The DCFH-DA experiment is not an immunofluorescence assay. It is better described as a redox-sensitive fluorescent probe. Please take note throughout.

      Reply: Thank you for pointing out our mistakes. We corrected the word. Line 1048 and 1151 as follows: “ROS levels were confirmed by redox-sensitive fluorescent probe using DCFH-DA molecular probe in 293T cell culture medium with or without fthl27-overexpression plasmid added with H2O2 or FAC for 4 hours.”

      Line 326: Manuscript notes that ROS levels in transfected cells are "significantly lower" than the control group, but there is no quantification or statistical analysis of ROS levels. In the methods, I noticed the mention of flow cytometry, but do not see any data from that experiment. Proportion of cells with DCFH-DA fluorescence above a threshold would be a good statistic for the experiment... Another could be average fluorescence per cell. Figure 5B shows some images with green dots and it looks like more green in the "control" (which could better be labeled as "mock-transfection") than in the fthl27 overexpression, but this could certainly be quantified by flow cytometry. I recommend that data be added.

      Reply: Thank you for your suggestions. We apologize for the error in the main text, we used a fluorescence microscope to observe fluorescence in our experiments, not a flow cytometer. We have corrected it in the methods section as follows (lines 651-653): “ROS levels were measured using a DCFH-DA molecular probe, and fluorescence was observed through a fluorescence microscope with an optional FITC filter, with the background removed to observe changes in fluorescence.” Meanwhile, we processed the images with ImageJ to obtain the respective mean fluorescence intensities (MFI) and found that the MFI of the fthl27-overexpression cells were lower than the control group, which indicated that the ROS levels of the fthl27-overexpression cells were significantly lower than the control group. MFI has been added to Figure 5B.

      Author response image 9.

      ROS levels were confirmed by redox-sensitive fluorescent probe using DCFH-DA molecular probe in 293T cell culture medium with or without fthl27-overexpression plasmid added with H2O2 or FAC for 4 hours. Images are merged from bright field images with fluorescent images using ImageJ, while the mean fluorescence intensity (MFI) is also calculated using ImageJ. Green, cellular ROS. Scale bars equal 100 μm.

      Regarding the ROS experiment: Transfection of HEK293T cells should be reasonably straightforward, and the experiment was controlled appropriately with a mock transfection, but some additional parameters are still needed to help interpret the results. Those include: Direct evidence that the transfection worked, like qPCR, western blots (is the fthl27 tagged with an antigen?), coexpression of a fluorescent protein. Then transfection efficiency should be calculated and reported.

      Reply: Thank you for your suggestions. To assess the success of the transfection, we randomly selected a subset of fthl27-transfected HEK293T cells for transcriptome sequencing. This approach allowed us to examine the gene expression profiles and confirm the efficacy of the transfection process. As control samples, we obtained transcriptome data from two untreated HEK293T cells (SRR24835259 and SRR24835265) from NCBI. Subsequently, we extracted the fthl27 gene sequence of the hadal snailfish, along with 1,000 bp upstream and downstream regions, as a separate scaffold. This scaffold was then merged with the human genome to assess the expression levels of each gene in the three transcriptome datasets. The results demonstrated that the fthl27 gene exhibited the highest expression in fthl27-transfected HEK293T cells, while in the control group, the expression of the fthl27 gene was negligible (TPM = 0). Additionally, the expression patterns of other highly expressed genes were similar to those observed in the control group, confirming the successful fthl27 transfection. These findings have been incorporated into Figure 5-figure supplements 3.

      Author response image 10.

      (B) Reads depth of fthl27 gene in fthl27-transfected HEK293T cells and 2 untreated HEK293T cells (SRR24835259 and SRR24835265) transcriptome data. (C) Expression of each gene in the transcriptome data of fthl27-transfected HEK293T cells and 2 untreated HEK293T cells (SRR24835259 and SRR24835265), where the genes shown are the 4 most highly expressed genes in each sample.

      Lines 383-386: expression of DNA repair genes is mentioned, but not shown anywhere in the results?

      Reply: Thank you for your suggestions. Accordingly, we added a description of this finding in the results section (lines 337-343): “Next, we identified 34 genes that are significantly more highly expressed in all organs of hadal snailfish in comparison to Tanaka’s snailfish and zebrafish, while only seven genes were found to be significantly more highly expressed in Tanaka’s snailfish using the same criterion (Figure 5-figure supplements 1). The 34 genes are enriched in only one GO category, GO:0000077: DNA damage checkpoint (Adjusted P-value: 0.0177). Moreover, five of the 34 genes are associated with DNA repair.”. And we added the information in the Figure 5-figure supplements 1C.

      Author response image 11.

      (C) Genes were significantly more highly expressed in all tissues of the hadal snailfish compared to Tanaka's snailfish, and 5 genes (purple) were associated with DNA repair.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This important study explores infants' attention patterns in real-world settings using advanced protocols and cutting-edge methods. The presented evidence for the role of EEG theta power in infants' attention is currently incomplete. The study will be of interest to researchers working on the development and control of attention.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The paper investigates the physiological and neural processes that relate to infants' attention allocation in a naturalistic setting. Contrary to experimental paradigms that are usually employed in developmental research, this study investigates attention processes while letting the infants be free to play with three toys in the vicinity of their caregiver, which is closer to a common, everyday life context. The paper focuses on infants at 5 and 10 months of age and finds differences in what predicts attention allocation. At 5 months, attention episodes are shorter and their duration is predicted by autonomic arousal. At 10 months, attention episodes are longer, and their duration can be predicted by theta power. Moreover, theta power predicted the proportion of looking at the toys, as well as a decrease in arousal (heart rate). Overall, the authors conclude that attentional systems change across development, becoming more driven by cortical processes.

      Strengths:

      I enjoyed reading the paper, I am impressed with the level of detail of the analyses, and I am strongly in favour of the overall approach, which tries to move beyond in-lab settings. The collection of multiple sources of data (EEG, heart rate, looking behaviour) at two different ages (5 and 10 months) is a key strength of this paper. The original analyses, which build onto robust EEG preprocessing, are an additional feat that improves the overall value of the paper. The careful consideration of how theta power might change before, during, and in the prediction of attention episodes is especially remarkable. However, I have a few major concerns that I would like the authors to address, especially on the methodological side.

      Points of improvement

      (1) Noise

      The first concern is the level of noise across age groups, periods of attention allocation, and metrics. Starting with EEG, I appreciate the analysis of noise reported in supplementary materials. The analysis focuses on a broad level (average noise in 5-month-olds vs 10-month-olds) but variations might be more fine-grained (for example, noise in 5mos might be due to fussiness and crying, while at 10 months it might be due to increased movements). More importantly, noise might even be the same across age groups, but correlated to other aspects of their behaviour (head or eye movements) that are directly related to the measures of interest. Is it possible that noise might co-vary with some of the behaviours of interest, thus leading to either spurious effects or false negatives? One way to address this issue would be for example to check if noise in the signal can predict attention episodes. If this is the case, noise should be added as a covariate in many of the analyses of this paper. 

      We thank the reviewer for this comment. We certainly have evidence that even the most state-of-the-art cleaning procedures (such as machine-learning trained ICA decompositions, as we applied here) are unable to remove eye movement artifact entirely from EEG data (Haresign et al., 2021; Phillips et al., 2023). (This applies to our data but also to others’ where confounding effects of eye movements are generally not considered.) Importantly, however, our analyses have been designed very carefully with this explicit challenge in mind. All of our analyses compare changes in the relationship between brain activity and attention as a function of age, and there is no evidence to suggest that different sources of noise (e.g. crying vs. movement) would associate differently with attention durations nor change their interactions with attention over developmental time. And figures 5 and 7, for example, both look at the relationship of EEG data at one moment in time to a child’s attention patterns hundreds or thousands of milliseconds before and after that moment, for which there is no possibility that head or eye movement artifact can have systematically influenced the results.

      Moving onto the video coding, I see that inter-rater reliability was not very high. Is this due to the fine-grained nature of the coding (20ms)? Is it driven by differences in expertise among the two coders? Or because coding this fine-grained behaviour from video data is simply too difficult? The main dependent variable (looking duration) is extracted from the video coding, and I think the authors should be confident they are maximising measurement accuracy.

      We appreciate the concern. To calculate IRR we used this function (Cardillo G. (2007) Cohen's kappa: compute the Cohen's kappa ratio on a square matrix. http://www.mathworks.com/matlabcentral/fileexchange/15365). Our “Observed agreement” was 0.7 (std= 0.15). However, we decided to report the Cohen's kappa coefficient, which is generally thought to be a more robust measure as it takes into account the agreement occurring by chance. We conducted the training meticulously (refer to response to Q6, R3), and we have confidence that our coders performed to the best of their abilities.

      (2) Cross-correlation analyses

      I would like to raise two issues here. The first is the potential problem of using auto-correlated variables as input for cross-correlations. I am not sure whether theta power was significantly autocorrelated. If it is, could it explain the cross-correlation result? The fact that the cross-correlation plots in Figure 6 peak at zero, and are significant (but lower) around zero, makes me think that it could be a consequence of periods around zero being autocorrelated. Relatedly: how does the fact that the significant lag includes zero, and a bit before, affect the interpretation of this effect? 

      Just to clarify this analysis, we did include a plot showing autocorrelation of theta activity in the original submission (Figs 7A and 7B in the revised paper). These indicate that theta shows little to no autocorrelation. And we can see no way in which this might have influenced our results. From their comments, the reviewer seems rather to be thinking of phasic changes in the autocorrelation, and whether the possibility that greater stability in theta during the time period around looks might have caused the cross-correlation result shown in 7E. Again though we can see no way in which this might be true, as the cross-correlation indicates that greater theta power is associated with a greater likelihood of looking, and this would not have been affected by changes in the autocorrelation.

      A second issue with the cross-correlation analyses is the coding of the looking behaviour. If I understand correctly, if an infant looked for a full second at the same object, they would get a maximum score (e.g., 1) while if they looked at 500ms at the object and 500ms away from the object, they would receive a score of e.g., 0.5. However, if they looked at one object for 500ms and another object for 500ms, they would receive a maximum score (e.g., 1). The reason seems unclear to me because these are different attention episodes, but they would be treated as one. In addition, the authors also show that within an attentional episode theta power changes (for 10mos). What is the reason behind this scoring system? Wouldn't it be better to adjust by the number of attention switches, e.g., with the formula: looking-time/(1+N_switches), so that if infants looked for a full second, but made 1 switch from one object to the other, the score would be .5, thus reflecting that attention was terminated within that episode? 

      We appreciate this suggestion. This is something we did not consider, and we thank the reviewer for raising it. In response to their comment, we have now rerun the analyses using the new measure (looking-time/(1+N_switches), and we are reassured to find that the results remain highly consistent. Please see Author response image 1 below where you can see the original results in orange and the new measure in blue at 5 and 10 months.

      Author response image 1.

      (3) Clearer definitions of variables, constructs, and visualisations

      The second issue is the overall clarity and systematicity of the paper. The concept of attention appears with many different names. Only in the abstract, it is described as attention control, attentional behaviours, attentiveness, attention durations, attention shifts and attention episode. More names are used elsewhere in the paper. Although some of them are indeed meant to describe different aspects, others are overlapping. As a consequence, the main results also become more difficult to grasp. For example, it is stated that autonomic arousal predicts attention, but it's harder to understand what specific aspect (duration of looking, disengagement, etc.) it is predictive of. Relatedly, the cognitive process under investigation (e.g., attention) and its operationalization (e.g., duration of consecutive looking toward a toy) are used interchangeably. I would want to see more demarcation between different concepts and between concepts and measurements.

      We appreciate the comment and we have clarified the concepts and their operationalisation throughout the revised manuscript.

      General Remarks

      In general, the authors achieved their aim in that they successfully showed the relationship between looking behaviour (as a proxy of attention), autonomic arousal, and electrophysiology. Two aspects are especially interesting. First, the fact that at 5 months, autonomic arousal predicts the duration of subsequent attention episodes, but at 10 months this effect is not present. Conversely, at 10 months, theta power predicts the duration of looking episodes, but this effect is not present in 5-month-old infants. This pattern of results suggests that younger infants have less control over their attention, which mostly depends on their current state of arousal, but older infants have gained cortical control of their attention, which in turn impacts their looking behaviour and arousal.

      We thank the reviewer for the close attention that they have paid to our manuscript, and for their insightful comments.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript explores infants' attention patterns in real-world settings and their relationship with autonomic arousal and EEG oscillations in the theta frequency band. The study included 5- and 10-month-old infants during free play. The results showed that the 5-month-old group exhibited a decline in HR forward-predicted attentional behaviors, while the 10-month-old group exhibited increased theta power following shifts in gaze, indicating the start of a new attention episode. Additionally, this increase in theta power predicted the duration of infants' looking behavior.

      Strengths:

      The study's strengths lie in its utilization of advanced protocols and cutting-edge techniques to assess infants' neural activity and autonomic arousal associated with their attention patterns, as well as the extensive data coding and processing. Overall, the findings have important theoretical implications for the development of infant attention.

      Weaknesses:

      Certain methodological procedures require further clarification, e.g., details on EEG data processing. Additionally, it would be beneficial to eliminate possible confounding factors and consider alternative interpretations, e,g., whether the differences observed between the two age groups were partly due to varying levels of general arousal and engagement during the free play.

      We thank the reviewer for their suggestions and have addressed them in our point-by-point responses below.

      Reviewer #3 (Public Review):

      Summary:

      Much of the literature on attention has focused on static, non-contingent stimuli that can be easily controlled and replicated--a mismatch with the actual day-to-day deployment of attention. The same limitation is evident in the developmental literature, which is further hampered by infants' limited behavioral repertoires and the general difficulty in collecting robust and reliable data in the first year of life. The current study engages young infants as they play with age-appropriate toys, capturing visual attention, cardiac measures of arousal, and EEG-based metrics of cognitive processing. The authors find that the temporal relations between measures are different at age 5 months vs. age 10 months. In particular, at 5 months of age, cardiac arousal appears to precede attention, while at 10 months of age attention processes lead to shifts in neural markers of engagement, as captured in theta activity.

      Strengths:

      The study brings to the forefront sophisticated analytical and methodological techniques to bring greater validity to the work typically done in the research lab. By using measures in the moment, they can more closely link biological measures to actual behaviors and cognitive stages. Often, we are forced to capture these measures in separate contexts and then infer in-the-moment relations. The data and techniques provide insights for future research work.

      Weaknesses:

      The sample is relatively modest, although this is somewhat balanced by the sheer number of data points generated by the moment-to-moment analyses. In addition, the study is cross-sectional, so the data cannot capture true change over time. Larger samples, followed over time, will provide a stronger test for the robustness and reliability of the preliminary data noted here. Finally, while the method certainly provides for a more active and interactive infant in testing, we are a few steps removed from the complexity of daily life and social interactions.

      We thank the reviewer for their suggestions and have addressed them in our point-by-point responses below.

      Reviewer #1 (Recommendations For The Authors):

      Here are some specific ways in which clarity can be improved:

      A. Regarding the distinction between constructs, or measures and constructs:

      i. In the results section, I would prefer to mention looking at duration and heart rate as metrics that have been measured, while in the introduction and discussion, a clear 1-to-1 link between construct/cognitive process and behavioural or (neuro)psychophysical measure can be made (e.g., sustained attention is measured via looking durations; autonomic arousal is measured via heart-rate). 

      The way attention and arousal were operationalised are now clarified throughout the text, especially in the results.

      ii. Relatedly, the "attention" variable is not really measuring attention directly. It is rather measuring looking time (proportion of looking time to the toys?), which is the operationalisation, which is hypothesised to be related to attention (the construct/cognitive process). I would make the distinction between the two stronger.

      This distinction between looking and paying attention is clearer now in the reviewed manuscript as per R1 and R3’s suggestions. We have also added a paragraph in the Introduction to clarify it and pointed out its limitations (see pg.5).

      B. Each analysis should be set out to address a specific hypothesis. I would rather see hypotheses in the introduction (without direct reference to the details of the models that were used), and how a specific relation between variables should follow from such hypotheses. This would also solve the issue that some analyses did not seem directly necessary to the main goal of the paper. For example:

      i. Are ACF and survival probability analyses aimed at proving different points, or are they different analyses to prove the same point? Consider either making clearer how they differ or moving one to supplementary materials.

      We clarified this in pg. 4 of the revised manuscript.

      ii. The autocorrelation results are not mentioned in the introduction. Are they aiming to show that the variables can be used for cross-correlation? Please clarify their role or remove them.

      We clarified this in pg. 4 of the revised manuscript.

      C. Clarity of cross-correlation figures. To ensure clarity when presenting a cross-correlation plot, it's important to provide information on the lead-lag relationships and which variable is considered X and which is Y. This could be done by labelling the axes more clearly (e.g., the left-hand side of the - axis specifies x leads y, right hand specifies y leads x) or adding a legend (e.g., dashed line indicates x leading y, solid line indicates y leading x). Finally, the limits of the x-axis are consistent across plots, but the limits of the y-axis differ, which makes it harder to visually compare the different plots. More broadly, the plots could have clearer labels, and their resolution could also be improved. 

      This information on what variable precedes/ follows was in the caption of the figures. However, we have edited the figures as per the reviewer’s suggestion and added this information in the figures themselves. We have also uploaded all the figures in higher resolution.

      D. Figure 7 was extremely helpful for understanding the paper, and I would rather have it as Figure 1 in the introduction. 

      We have moved figure 7 to figure 1 as per this request.

      E. Statistics should always be reported, and effects should always be described. For example, results of autocorrelation are not reported, and from the plot, it is also not clear if the effects are significant (the caption states that red dots indicate significance, but there are no red dots. Does this mean there is no autocorrelation?).

      We apologise – this was hard to read in the original. We have clarified that there is no autocorrelation present in Fig 7A and 7D.

      And if so, given that theta is a wave, how is it possible that there is no autocorrelation (connected to point 1)? 

      We thank the reviewer for raising this point. In fact, theta power is looking at oscillatory activity in the EEG within the 3-6Hz window (i.e. 3 to 6 oscillations per second). Whereas we were analysing the autocorrelation in the EEG data by looking at changes in theta power between consecutive 1 second long windows. To say that there is no autocorrelation in the data means that, if there is more 3-6Hz activity within one particular 1-second window, there tends not to be significantly more 3-6Hz activity within the 1-second windows immediately before and after.

      F. Alpha power is introduced later on, and in the discussion, it is mentioned that the effects that were found go against the authors' expectations. However, alpha power and the authors' expectations about it are not mentioned in the introduction. 

      We thank the reviewer for this comment. We have added a paragraph on alpha in the introduction (pg.4).

      Minor points:

      1. At the end of 1st page of introduction, the authors state that: 

      “How children allocate their attention in experimenter-controlled, screen-based lab tasks differs, however, from actual real-world attention in several ways (32-34). For example, the real-world is interactive and manipulable, and so how we interact with the world determines what information we, in turn, receive from it: experiences generate behaviours (35).”

      I think there's more to this though - Lab-based studies can be made interactive too (e.g., Meyer et al., 2023, Stahl & Feigenson, 2015). What remains unexplored is how infants actively and freely initiate and self-structure their attention, rather than how they respond to experimental manipulations.

      Meyer, M., van Schaik, J. E., Poli, F., & Hunnius, S. (2023). How infant‐directed actions enhance infants' attention, learning, and exploration: Evidence from EEG and computational modeling. Developmental Science, 26(1), e13259.

      Stahl, A. E., & Feigenson, L. (2015). Observing the unexpected enhances infants' learning and exploration. Science, 348(6230), 91-94.

      We thank the reviewer for this suggestion and added their point in pg. 4.

      (2) Regarding analysis 4:

      a. In analysis 1 you showed that the duration of attentional episodes changes with age. Is it fair to keep the same start, middle, and termination ranges across age groups? Is 3-4 seconds "middle" for 5-month-olds? 

      We appreciate the comment. There are many ways we could have run these analyses and, in fact, in other papers we have done it differently, for example by splitting each look in 3, irrespective of its duration (Phillips et al., 2023).

      However, one aspect we took into account was the observation that 5-month-old infants exhibited more shorter looks compared to older infants. We recognized that dividing each into 3 parts, regardless of its duration, might have impacted the results. Presumably, the activity during the middle and termination phases of a 1.5-second look differs from that of a look lasting over 7 seconds.

      Two additional factors that provided us with confidence in our approach were: 1) while the definition of "middle" was somewhat arbitrary, it allowed us to maintain consistency in our analyses across different age points. And, 2) we obtained a comparable amount of observations across the two time points (e.g. “middle” at 5 months we had 172 events at 5 months, and 194 events at 10 months).

      b. It is recommended not to interpret lower-level interactions if more complex interactions are not significant. How are the interaction effects in a simpler model in which the 3-way interaction is removed? 

      We appreciate the comment. We tried to follow the same steps as in (Xie et al., 2018). However, we have re-analysed the data removing the 3-way interaction and the significance of the results stayed the same. Please see Author response image 2 below (first: new analyses without the 3-way interactions, second: original analyses that included the 3-way interaction).

      Author response image 2.

      (3) Figure S1: there seems to be an outlier in the bottom-right panel. Do results hold excluding it? 

      We re-run these analyses as per this suggestion and the results stayed the same (refer to SM pg. 2).

      (4) Figure S2 should refer to 10 months instead of 12.

      We thank the reviewer for noticing this typo, we have changed it in the reviewed manuscript (see SM pg. 3). 

      (5) In the 2nd paragraph of the discussion, I found this sentence unclear: "From Analysis 1 we found that infants at both ages showed a preferred modal reorientation rate". 

      We clarified this in the reviewed manuscript in pg10

      (6) Discussion: many (infant) studies have used theta in anticipation of receiving information (Begus et al., 2016) surprising events (Meyer et al., 2023), and especially exploration (Begus et al., 2015). Can you make a broader point on how these findings inform our interpretation of theta in the infant population (go more from description to underlying mechanisms)? 

      We have extended on this point on interpreting frequency bands in pg13 of the reviewed manuscript and thank the reviewer for bringing it up.

      Begus, K., Gliga, T., & Southgate, V. (2016). Infants' preferences for native speakers are associated with an expectation of information. Proceedings of the National Academy of Sciences, 113(44), 12397-12402.

      Meyer, M., van Schaik, J. E., Poli, F., & Hunnius, S. (2023). How infant‐directed actions enhance infants' attention, learning, and exploration: Evidence from EEG and computational modeling. Developmental Science, 26(1), e13259.

      Begus, K., Southgate, V., & Gliga, T. (2015). Neural mechanisms of infant learning: differences in frontal theta activity during object exploration modulate subsequent object recognition. Biology letters, 11(5), 20150041.

      (7) 2nd page of discussion, last paragraph: "preferred modal reorientation timer" is not a neural/cognitive mechanism, just a resulting behaviour. 

      We agree with this comment and thank the reviewer for bringing it out to our attention. We clarified this in in pg12 and pg13 of the reviewed manuscript.

      Reviewer #2 (Recommendations For The Authors):

      I have a few comments and questions that I think the authors should consider addressing in a revised version. Please see below:

      (1) During preprocessing (steps 5 and 6), it seems like the "noisy channels" were rejected using the pop_rejchan.m function and then interpolated. This procedure is common in infant EEG analysis, but a concern arises: was there no upper limit for channel interpolation? Did the authors still perform bad channel interpolation even when more than 30% or 40% of the channels were identified as "bad" at the beginning with the continuous data? 

      We did state in the original manuscript that “participants with fewer than 30% channels interpolated at 5 months and 25% at 10 months made it to the final step (ICA) and final analyses”. In the revised version we have re-written this section in order to make this more clear (pg. 17).

      (2) I am also perplexed about the sequencing of the ICA pruning step. If the intention of ICA pruning is to eliminate artificial components, would it be more logical to perform this procedure before the conventional artifacts' rejection (i.e., step 7), rather than after? In addition, what was the methodology employed by the authors to identify the artificial ICA components? Was it done through manual visual inspection or utilizing specific toolboxes? 

      We agree that the ICA is often run before, however, the decision to reject continuous data prior to ICA was to remove the very worst sections of data (where almost all channels were affected), which can arise during times when infants fuss or pull the caps. Thus, this step was applied at this point in the pipeline so that these sections of really bad data were not inputted into the ICA. This is fairly widespread practice in cleaning infant data.

      Concerning the reviewer’s second question, of how ICA components were removed – the answer to this is described in considerable detail in the paper that we refer to in that setion of the manuscript. This was done by training a classifier specially designed to clean naturalistic infant EEG data (Haresign et al., 2021) and has since been employed in similar studies (e.g. Georgieva et al., 2020; Phillips et al., 2023).

      (3) Please clarify how the relative power was calculated for the theta (3-6Hz) and alpha (6-9Hz) bands. Were they calculated by dividing the ratio of theta or alpha power to the power between 3 and 9Hz, or the total power between 1 (or 3) and 20 Hz? In other words, what does the term "all frequency bands" refer to in section 4.3.7? 

      We thank the reviewer for this comment, we have now clarified this in pg. 22.

      (4) One of the key discoveries presented in this paper is the observation that attention shifts are accompanied by a subsequent enhancement in theta band power shortly after the shifts occur. Is it possible that this effect or alteration might be linked to infants' saccades, which are used as indicators of attention shifts? Would it be feasible to analyze the disparities in amplitude between the left and right frontal electrodes (e.g., Fp1 and Fp2, which could be viewed as virtual horizontal EOG channels) in relation to theta band power, in order to eliminate the possibility that the augmentation of theta power was attributable to the intensity of the saccades? 

      We appreciate the concern. Average saccade duration in infants is about 40ms (Garbutt et al., 2007). Our finding that the positive cross-correlation between theta and look duration is present not only when we examine zero-lag data but also when we examine how theta forwards-predicts attention 1-2 seconds afterwards seems therefore unlikely to be directly attributable to saccade-related artifact. Concerning the reviewer’s suggestion – this is something that we have tried in the past. Unfortunately, however, our experience is that identifying saccades based on the disparity between Fp1 and Fp2 is much too unreliable to be of any use in analysing data. Even if specially positioned HEOG electrodes are used, we still find the saccade detection to be insufficiently reliable. In ongoing work we are tracking eye movements separately, in order to be able to address this point more satisfactorily.

      (5) The following question is related to my previous comment. Why is the duration of the relationship between theta power and moment-to-moment changes in attention so short? If theta is indeed associated with attention and information processing, shouldn't the relationship between the two variables strengthen as the attention episode progresses? Given that the authors themselves suggest that "One possible interpretation of this is that neural activity associates with the maintenance more than the initiation of attentional behaviors," it raises the question of (is in contradiction to) why the duration of the relationship is not longer but declines drastically (Figure 6). 

      We thank the reviewer for raising this excellent point. Certainly we argue that this, together with the low autocorrelation values for theta documented in Fig 7A and 7D challenge many conventional ways of interpreting theta. We are continuing to investigate this question in ongoing work.

      (6) Have the authors conducted a comparison of alpha relative power and HR deceleration durations between 5 and 10-month-old infants? This analysis could provide insights into whether the differences observed between the two age groups were partly due to varying levels of general arousal and engagement during free play.

      We thank the reviewer for this suggestion. Indeed, this is an aspect we investigated but ultimately, given that our primary emphasis was on the theta frequency, and considering the length of the manuscript, we decided not to incorporate. However, we attached Author response image 3 below showing there was no significant interaction between HR and alpha band.

      Author response image 3.

      Reviewer #3 (Recommendations For The Authors):

      (1) In reading the manuscript, the language used seems to imply longitudinal data or at the very least the ability to detect change or maturation. Given the cross-sectional nature of the data, the language should be tempered throughout. The data are illustrative but not definitive. 

      We thank the reviewer for this comment. We have now clarified that “Data was analysed in a cross-sectional manner” in pg15.

      (2) The sample size is quite modest, particularly in the specific age groups. This is likely tempered by the sheer number of data points available. This latter argument is implied in the text, but not as explicitly noted. (However, I may have missed this as the text is quite dense). I think more notice is needed on the reliability and stability of the findings given the sample. 

      We have clarified this in pg16.

      (3) On a related note, how was the sample size determined? Was there a power analysis to help guide decision-making for both recruitment and choosing which analyses to proceed with? Again, the analytic approach is quite sophisticated and the questions are of central interest to researchers, but I was left feeling maybe these two aspects of the study were out-sprinting the available data. The general impression is that the sample is small, but it is not until looking at table s7, that it is in full relief. I think this should be more prominent in the main body of the study.

      We have clarified this in pg16.

      (4) The devotes a few sentences to the relation between looking and attention. However, this distinction is central to the design of the study, and any philosophical differences regarding what take-away points can be generated. In my reading, I think this point needs to be more heavily interrogated. 

      This distinction between looking and paying attention is clearer now in the reviewed manuscript as per R1 and R3’s suggestions. We have also added a paragraph in the Introduction to clarify it and pointed out its limitations (see pg.5).

      (5) I would temper the real-world attention language. This study is certainly a great step forward, relative to static faces on a computer screen. However, there are still a great number of artificial constraints that have been added. That is not to say that the constraints are bad--they are necessary to carry out the work. However, it should be acknowledged that it constrains the external validity. 

      We have added a paragraph to acknowledged limitations of the setup in pg. 14.

      (6) The kappa on the coding is not strong. The authors chose to proceed nonetheless. Given that, I think more information is needed on how coders were trained, how they were standardized, and what parameters were used to decide they were ready to code independently. Again, with the sample size and the kappa presented, I think more discussion is needed regarding the robustness of the findings. 

      We appreciate the concern. As per our answer to R1, we chose to report the most stringent calculator of inter-rater reliability, but other calculation methods (i.e., percent agreement) return higher scores (see response to R1).

      As per the training, we wrote an extensively detailed coding scheme describing exactly how to code each look that was handed to our coders. Throughout the initial months of training, we meet with the coders on a weekly basis to discuss questions and individual frames that looked ambiguous. After each session, we would revise the coding scheme to incorporate additional details, aiming to make the coding process progressively less subjective. During this period, every coder analysed the same interactions, and inter-rater reliability (IRR) was assessed weekly, comparing their evaluations with mine (Marta). With time, the coders had fewer questions and IRR increased. At that point, we deemed them sufficiently trained, and began assigning them different interactions from each other. Periodically, though, we all assessed the same interaction and meet to review and discuss our coding outputs.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      These ingenious and thoughtful studies present important findings concerning how people represent and generalise abstract patterns of sensory data. The issue of generalisation is a core topic in neuroscience and psychology, relevant across a wide range of areas, and the findings will be of interest to researchers across areas in perception, learning, and cognitive science. The findings have the potential to provide compelling support for the outlined account, but there appear other possible explanations, too, that may affect the scope of the findings but could be considered in a revision.

      Thank you for sending the feedback from the three peer reviewers regarding our paper. Please find below our detailed responses addressing the reviewers' comments. We have incorporated these suggestions into the paper and provided explanations for the modifications made.

      We have specifically addressed the point of uncertainty highlighted in eLife's editorial assessment, which concerned alternative explanations for the reported effect. In response to Reviewer #1, we have clarified how Exp. 2c and Exp. 3c address the potential alternative explanation related to "attention to dimensions." Further, we present a supplementary analysis to account for differences in asymptotic learning, as noted by Reviewer #2. We have also clarified how our control experiments address effects associated with general cognitive engagement in the task. Lastly, we have further clarified the conceptual foundation of our paper, addressing concerns raised by Reviewers #2 and #3.

      Reviewer #1 (Public Review):

      Summary:

      This manuscript reports a series of experiments examining category learning and subsequent generalization of stimulus representations across spatial and nonspatial domains. In Experiment 1, participants were first trained to make category judgments about sequences of stimuli presented either in nonspatial auditory or visual modalities (with feature values drawn from a two-dimensional feature manifold, e.g., pitch vs timbre), or in a spatial modality (with feature values defined by positions in physical space, e.g., Cartesian x and y coordinates). A subsequent test phase assessed category judgments for 'rotated' exemplars of these stimuli: i.e., versions in which the transition vectors are rotated in the same feature space used during training (near transfer) or in a different feature space belonging to the same domain (far transfer). Findings demonstrate clearly that representations developed for the spatial domain allow for representational generalization, whereas this pattern is not observed for the nonspatial domains that are tested. Subsequent experiments demonstrate that if participants are first pre-trained to map nonspatial auditory/visual features to spatial locations, then rotational generalization is facilitated even for these nonspatial domains. It is argued that these findings are consistent with the idea that spatial representations form a generalized substrate for cognition: that space can act as a scaffold for learning abstract nonspatial concepts.

      Strengths:

      I enjoyed reading this manuscript, which is extremely well-written and well-presented. The writing is clear and concise throughout, and the figures do a great job of highlighting the key concepts. The issue of generalization is a core topic in neuroscience and psychology, relevant across a wide range of areas, and the findings will be of interest to researchers across areas in perception and cognitive science. It's also excellent to see that the hypotheses, methods, and analyses were pre-registered.

      The experiments that have been run are ingenious and thoughtful; I particularly liked the use of stimulus structures that allow for disentangling of one-dimensional and two-dimensional response patterns. The studies are also well-powered for detecting the effects of interest. The model-based statistical analyses are thorough and appropriate throughout (and it's good to see model recovery analysis too). The findings themselves are clear-cut: I have little doubt about the robustness and replicability of these data.

      Weaknesses:

      I have only one significant concern regarding this manuscript, which relates to the interpretation of the findings. The findings are taken to suggest that "space may serve as a 'scaffold', allowing people to visualize and manipulate nonspatial concepts" (p13). However, I think the data may be amenable to an alternative possibility. I wonder if it's possible that, for the visual and auditory stimuli, participants naturally tended to attend to one feature dimension and ignore the other - i.e., there may have been a (potentially idiosyncratic) difference in salience between the feature dimensions that led to participants learning the feature sequence in a one-dimensional way (akin to the 'overshadowing' effect in associative learning: e.g., see Mackintosh, 1976, "Overshadowing and stimulus intensity", Animal Learning and Behaviour). By contrast, we are very used to thinking about space as a multidimensional domain, in particular with regard to two-dimensional vertical and horizontal displacements. As a result, one would naturally expect to see more evidence of two-dimensional representation (allowing for rotational generalization) for spatial than nonspatial domains.

      In this view, the impact of spatial pre-training and (particularly) mapping is simply to highlight to participants that the auditory/visual stimuli comprise two separable (and independent) dimensions. Once they understand this, during subsequent training, they can learn about sequences on both dimensions, which will allow for a 2D representation and hence rotational generalization - as observed in Experiments 2 and 3. This account also anticipates that mapping alone (as in Experiment 4) could be sufficient to promote a 2D strategy for auditory and visual domains.

      This "attention to dimensions" account has some similarities to the "spatial scaffolding" idea put forward in the article, in arguing that experience of how auditory/visual feature manifolds can be translated into a spatial representation helps people to see those domains in a way that allows for rotational generalization. Where it differs is that it does not propose that space provides a scaffold for the development of the nonspatial representations, i.e., that people represent/learn the nonspatial information in a spatial format, and this is what allows them to manipulate nonspatial concepts. Instead, the "attention to dimensions" account anticipates that ANY manipulation that highlights to participants the separable-dimension nature of auditory/visual stimuli could facilitate 2D representation and hence rotational generalization. For example, explicit instruction on how the stimuli are constructed may be sufficient, or pre-training of some form with each dimension separately, before they are combined to form the 2D stimuli.

      I'd be interested to hear the authors' thoughts on this account - whether they see it as an alternative to their own interpretation, and whether it can be ruled out on the basis of their existing data.

      We thank the Reviewer for their comments. We agree with the Reviewer that the “attention to dimensions” hypothesis is an interesting alternative explanation. However, we believe that the results of our control experiments Exp. 2c and Exp. 3c are incompatible with this alternative explanation.

      In Exp. 2c, participants are pre-trained in the visual modality and then tested in the auditory modality. In the multimodal association task, participants have to associate the auditory stimuli and the visual stimuli: on each trial, they hear a sound and then have to click on the corresponding visual stimulus. It is thus necessary to pay attention to both auditory dimensions and both visual dimensions to perform the task. To give an example, the task might involve mapping the fundamental frequency and the amplitude modulation of the auditory stimulus to the colour and the shape of the visual stimulus, respectively. If participants pay attention to only one dimension, this would lead to a maximum of 25% accuracy on average (because they would be at chance on the other dimension, with four possible options). We observed that 30/50 participants reached an accuracy > 50% in the multimodal association task in Exp. 2c. This means that we know for sure that at least 60% of the participants paid attention to both dimensions of the stimuli. Nevertheless, there was a clear difference between participants that received a visual pre-training (Exp. 2c) and those who received a spatial pre-training (Exp. 2a) (frequency of 1D vs 2D models between conditions, BF > 100 in near transfer and far transfer). In fact, only 3/50 participants were best fit by a 2D model when vision was the pre-training modality compared to 29/50 when space was the pre-training modality. Thus, the benefit of the spatial pre-training cannot be due solely to a shift in attention toward both dimensions.

      This effect was replicated in Exp. 3c. Similarly, 33/48 participants reached an accuracy > 50% in the multimodal association task in Exp. 3c, meaning that we know for sure that at least 68% of the participants actually paid attention to both dimensions of the stimuli. Again, there was a clear difference between participants who received a visual pre-training (frequency of 1D vs 2D models between conditions, Exp. 3c) and those who received a spatial pre-training (Exp. 3a) (BF > 100 in near transfer and far transfer).

      Thus, we believe that the alternative explanation raised by the Reviewer is not supported by our data. We have added a paragraph in the discussion:

      “One alternative explanation of this effect could be that the spatial pre-training encourages participants to attend to both dimensions of the non-spatial stimuli. By contrast, pretraining in the visual or auditory domains (where multiple dimensions of a stimulus may be relevant less often naturally) encourages them to attend to a single dimension. However, data from our control experiments Exp. 2c and Exp. 3c, are incompatible with this explanation. Around ~65% of the participants show a level of performance in the multimodal association task (>50%) which could only be achieved if they were attending to both dimensions (performance attending to a single dimension would yield 25% and chance performance is at 6.25%). This suggests that participants are attending to both dimensions even in the visual and auditory mapping case.”

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, L&S investigates the important general question of how humans achieve invariant behavior over stimuli belonging to one category given the widely varying input representation of those stimuli and more specifically, how they do that in arbitrary abstract domains. The authors start with the hypothesis that this is achieved by invariance transformations that observers use for interpreting different entries and furthermore, that these transformations in an arbitrary domain emerge with the help of the transformations (e.g. translation, rotation) within the spatial domain by using those as "scaffolding" during transformation learning. To provide the missing evidence for this hypothesis, L&S used behavioral category learning studies within and across the spatial, auditory, and visual domains, where rotated and translated 4-element token sequences had to be learned to categorize and then the learned transformation had to be applied in new feature dimensions within the given domain. Through single- and multiple-day supervised training and unsupervised tests, L&S demonstrated by standard computational analyses that in such setups, space and spatial transformations can, indeed, help with developing and using appropriate rotational mapping whereas the visual domain cannot fulfill such a scaffolding role.

      Strengths:

      The overall problem definition and the context of spatial mapping-driven solution to the problem is timely. The general design of testing the scaffolding effect across different domains is more advanced than any previous attempts clarifying the relevance of spatial coding to any other type of representational codes. Once the formulation of the general problem in a specific scientific framework is done, the following steps are clearly and logically defined and executed. The obtained results are well interpretable, and they could serve as a good stepping stone for deeper investigations. The analytical tools used for the interpretations are adequate. The paper is relatively clearly written.

      Weaknesses:

      Some additional effort to clarify the exact contribution of the paper, the link between analyses and the claims of the paper, and its link to previous proposals would be necessary to better assess the significance of the results and the true nature of the proposed mechanism of abstract generalization.

      (1) Insufficient conceptual setup: The original theoretical proposal (the Tolman-Eichenbaum-Machine, Whittington et al., Cell 2020) that L&S relate their work to proposes that just as in the case of memory for spatial navigation, humans and animals create their flexible relational memory system of any abstract representation by a conjunction code that combines on the one hand, sensory representation and on the other hand, a general structural representation or relational transformation. The TEM also suggests that the structural representation could contain any graph-interpretable spatial relations, albeit in their demonstration 2D neighbor relations were used. The goal of L&S's paper is to provide behavioral evidence for this suggestion by showing that humans use representational codes that are invariant to relational transformations of non-spatial abstract stimuli and moreover, that humans obtain these invariances by developing invariance transformers with the help of available spatial transformers. To obtain such evidence, L&S use the rotational transformation. However, the actual procedure they use actually solved an alternative task: instead of interrogating how humans develop generalizations in abstract spaces, they demonstrated that if one defines rotation in an abstract feature space embedded in a visual or auditory modality that is similar to the 2D space (i.e. has two independent dimensions that are clearly segregable and continuous), humans cannot learn to apply rotation of 4-piece temporal sequences in those spaces while they can do it in 2D space, and with co-associating a one-to-one mapping between locations in those feature spaces with locations in the 2D space an appropriate shaping mapping training will lead to the successful application of rotation in the given task (and in some other feature spaces in the given domain). While this is an interesting and challenging demonstration, it does not shed light on how humans learn and generalize, only that humans CAN do learning and generalization in this, highly constrained scenario. This result is a demonstration of how a stepwise learning regiment can make use of one structure for mapping a complex input into a desired output. The results neither clarify how generalizations would develop in abstract spaces nor the question of whether this generalization uses transformations developed in the abstract space. The specific training procedure ensures success in the presented experiments but the availability and feasibility of an equivalent procedure in a natural setting is a crucial part of validating the original claim and that has not been done in the paper.

      We thank the Reviewer for their detailed comments on our manuscript. We reply to the three main points in turn.

      First, concerning the conceptual grounding of our work, we would point out that the TEM model (Whittington et al., 2020), however interesting, is not our theoretical starting point. Rather, as we hope the text and references make clear, we ground our work in theoretical work from the 1990/2000s proposing that space acts as a scaffold for navigating abstract spaces (such as Gärdenfors, 2000). We acknowledge that the TEM model and other experimental work on the implication of the hippocampus, the entorhinal cortex and the parietal cortex in relational transformations of nonspatial stimuli provide evidence for this general theory. However, our work is designed to test a more basic question: whether there is behavioural evidence that space scaffolds learning in the first place. To achieve this, we perform behavioural experiments with causal manipulation (spatial pre-training vs no spatial pre-training) have the potential to provide such direct evidence. This is why we claim that:

      “This theory is backed up by proof-of-concept computational simulations [13], and by findings that brain regions thought to be critical for spatial cognition in mammals (such as the hippocampal-entorhinal complex and parietal cortex) exhibit neural codes that are invariant to relational transformations of nonspatial stimuli. However, whilst promising, this theory lacks direct empirical evidence. Here, we set out to provide a strong test of the idea that learning about physical space scaffolds conceptual generalisation.“

      Second, we agree with the Reviewer that we do not provide an explicit model for how generalisation occurs, and how precisely space acts as a scaffold for building representations and/or applying the relevant transformations to non-spatial stimuli to solve our task. Rather, we investigate in our Exp. 2-4 which aspects of the training are necessary for rotational generalisation to happen (and conclude that a simple training with the multimodal association task is sufficient for ~20% participants). We now acknowledge in the discussion the fact that we do not provide an explicit model and leave that for future work:

      “We acknowledge that our study does not provide a mechanistic model of spatial scaffolding but rather delineate which aspects of the training are necessary for generalisation to happen.”

      Finally, we also agree with the Reviewer that our task is non-naturalistic. As is common in experimental research, one must sacrifice the naturalistic elements of the task in exchange for the control and the absence of prior knowledge of the participants. We have decided to mitigate as possible the prior knowledge of the participants to make sure that our task involved learning a completely new task and that the pre-training was really causing the better learning/generalisation. The effects we report are consistent across the experiments so we feel confident about them but we agree with the Reviewer that an external validation with more naturalistic stimuli/tasks would be a nice addition to this work. We have included a sentence in the discussion:

      “All the effects observed in our experiments were consistent across near transfer conditions (rotation of patterns within the same feature space), and far transfer conditions (rotation of patterns within a different feature space, where features are drawn from the same modality). This shows the generality of spatial training for conceptual generalisation. We did not test transfer across modalities nor transfer in a more natural setting; we leave this for future studies.”

      (2) Missing controls: The asymptotic performance in experiment 1 after training in the three tasks was quite different in the three tasks (intercepts 2.9, 1.9, 1.6 for spatial, visual, and auditory, respectively; p. 5. para. 1, Fig 2BFJ). It seems that the statement "However, our main question was how participants would generalise learning to novel, rotated exemplars of the same concept." assumes that learning and generalization are independent. Wouldn't it be possible, though, that the level of generalization depends on the level of acquiring a good representation of the "concept" and after obtaining an adequate level of this knowledge, generalization would kick in without scaffolding? If so, a missing control is to equate the levels of asymptotic learning and see whether there is a significant difference in generalization. A related issue is that we have no information on what kind of learning in the three different domains was performed, albeit we probably suspect that in space the 2D representation was dominant while in the auditory and visual domains not so much. Thus, a second missing piece of evidence is the model-fitting results of the ⦰ condition that would show which way the original sequences were encoded (similar to Fig 2 CGK and DHL). If the reason for lower performance is not individual stimulus difficulty but the natural tendency to encode the given stimulus type by a combo of random + 1D strategy that would clarify that the result of the cross-training is, indeed, transferring the 2D-mapping strategy.

      We agree with the Reviewer that a good further control is to equate performance during training. Thus, we have run a complementary analysis where we select only the participants that reach > 90% accuracy in the last block of training in order to equate asymptotic performance after training in Exp. 1. The results (see Author response image 1) replicates the results that we report in the main text: there is a large difference between groups (relative likelihood of 1D vs. 2D models, all BF > 100 in favour of a difference between the auditory and the spatial modalities, between the visual and the spatial modalities, in both near and far transfer, “decisive” evidence). We prefer not to include this figure in the paper for clarity, and because we believe this result is expected given the fact that 0/50 and 0/50 of the participants in the auditory and visual condition used a 2D strategy – thus, selecting subgroups of these participants cannot change our conclusions.

      Author response image 1.

      Results of Exp. 1 when selecting participants that reached > 90% accuracy in the last block of training. Captions are the same as Figure 2 of the main text.

      Second, the Reviewer suggested that we run the model fitting analysis only on the ⦰ condition (training) in Exp. 1 to reveal whether participants use a 1D or a 2D strategy already during training. Unfortunately, we cannot provide the model fits only in the ⦰ condition in Exp. 1 because all models make the same predictions for this condition (see Fig S4). However, note that this is done by design: participants were free to apply whatever strategy they want during training; we then used the generalisation phase with the rotated stimuli precisely to reveal this strategy. Further, we do believe that the strategy used by the participants during training and the strategy during transfer are the same, partly because – starting from block #4 – participants have no idea whether the current trial is a training trial or a transfer trial, as both trial types are randomly interleaved with no cue signalling the trial type. We have made this clear in the methods:

      “They subsequently performed 105 trials (with trialwise feedback) and 105 transfer trials including rotated and far transfer quadruplets (without trialwise feedback) which were presented in mixed blocks of 30 trials. Training and transfer trials were randomly interleaved, and no clue indicated whether participants were currently on a training trial or a transfer trial before feedback (or absence of feedback in case of a transfer trial).”

      Reviewer #3 (Public Review):

      Summary:

      Pesnot Lerousseau and Summerfield aimed to explore how humans generalize abstract patterns of sensory data (concepts), focusing on whether and how spatial representations may facilitate the generalization of abstract concepts (rotational invariance). Specifically, the authors investigated whether people can recognize rotated sequences of stimuli in both spatial and nonspatial domains and whether spatial pre-training and multi-modal mapping aid in this process.

      Strengths:

      The study innovatively examines a relatively underexplored but interesting area of cognitive science, the potential role of spatial scaffolding in generalizing sequences. The experimental design is clever and covers different modalities (auditory, visual, spatial), utilizing a two-dimensional feature manifold. The findings are backed by strong empirical data, good data analysis, and excellent transparency (including preregistration) adding weight to the proposition that spatial cognition can aid abstract concept generalization.

      Weaknesses:

      The examples used to motivate the study (such as "tree" = oak tree, family tree, taxonomic tree) may not effectively represent the phenomena being studied, possibly confusing linguistic labels with abstract concepts. This potential confusion may also extend to doubts about the real-life applicability of the generalizations observed in the study and raises questions about the nature of the underlying mechanism being proposed.

      We thank the Reviewer for their comments. We agree that we could have explained ore clearly enough how these examples motivate our study. The similarity between “oak tree” and “family tree” is not just the verbal label. Rather, it is the arrangement of the parts (nodes and branches) in a nested hierarchy. Oak trees and family trees share the same relational structure. The reason that invariance is relevant here is that the similarity in relational structure is retained under rigid body transformations such as rotation or translation. For example, an upside-down tree can still be recognised as a tree, just as a family tree can be plotted with the oldest ancestors at either top or bottom. Similarly, in our study, the quadruplets are defined by the relations between stimuli: all quadruplets use the same basic stimuli, but the categories are defined by the relations between successive stimuli. In our task, generalising means recognising that relations between stimuli are the same despite changes in the surface properties (for example in far transfer). We have clarify that in the introduction:

      “For example, the concept of a “tree” implies an entity whose structure is defined by a nested hierarchy, whether this is a physical object whose parts are arranged in space (such as an oak tree in a forest) or a more abstract data structure (such as a family tree or taxonomic tree). [...] Despite great changes in the surface properties of oak trees, family trees and taxonomic trees, humans perceive them as different instances of a more abstract concept defined by the same relational structure.”

      Next, the study does not explore whether scaffolding effects could be observed with other well-learned domains, leaving open the question of whether spatial representations are uniquely effective or simply one instance of a familiar 2D space, again questioning the underlying mechanism.

      We would like to mention that Reviewer #2 had a similar comment. We agree with both Reviewers that our task is non-naturalistic. As is common in experimental research, one must sacrifice the naturalistic elements of the task in exchange for the control and the absence of prior knowledge of the participants. We have decided to mitigate as possible the prior knowledge of the participants to make sure that our task involved learning a completely new task and that the pre-training was really causing the better learning/generalisation. The effects we report are consistent across the experiments so we feel confident about them but we agree with the Reviewer that an external validation with more naturalistic stimuli/tasks would be a nice addition to this work. We have included a sentence in the discussion:

      “All the effects observed in our experiments were consistent across near transfer conditions (rotation of patterns within the same feature space), and far transfer conditions (rotation of patterns within a different feature space, where features are drawn from the same modality). This shows the generality of spatial training for conceptual generalisation. We did not test transfer across modalities nor transfer in a more natural setting; we leave this for future studies.”

      Further doubt on the underlying mechanism is cast by the possibility that the observed correlation between mapping task performance and the adoption of a 2D strategy may reflect general cognitive engagement rather than the spatial nature of the task. Similarly, the surprising finding that a significant number of participants benefited from spatial scaffolding without seeing spatial modalities may further raise questions about the interpretation of the scaffolding effect, pointing towards potential alternative interpretations, such as shifts in attention during learning induced by pre-training without changing underlying abstract conceptual representations.

      The Reviewer is concerned about the fact that the spatial pre-training could benefit the participants by increasing global cognitive engagement rather than providing a scaffold for learning invariances. It is correct that the participants in the control group in Exp. 2c have poorer performances on average than participants that benefit from the spatial pre-training in Exp. 2a and 2b. The better performances of the participants in Exp. 2a and 2b could be due to either the spatial nature of the pre-training (as we claim) or a difference in general cognitive engagement. .

      However, if we look closely at the results of Exp. 3, we can see that the general cognitive engagement hypothesis is not well supported by the data. Indeed, the participants in the control condition (Exp. 3c) have relatively similar performances than the other groups during training. Rather, the difference is in the strategy they use, as revealed by the transfer condition. The majority of them are using a 1D strategy, contrary to the participants that benefited from a spatial pre-training (Exp 3a and 3b). We have included a sentence in the results:

      “Further, the results show that participants who did not experience spatial pre-training were still engaged in the task, but were not using the same strategy as the participants who experienced spatial pre-training (1D rather than 2D). Thus, the benefit of the spatial pre-training is not simply to increase the cognitive engagement of the participants. Rather, spatial pre-training provides a scaffold to learn rotation-invariant representation of auditory and visual concepts even when rotation is never explicitly shown during pre-training.”

      Finally, Reviewer #1 had a related concern about a potential alternative explanation that involved a shift in attention. We reproduce our response here: we agree with the Reviewer that the “attention to dimensions” hypothesis is an interesting (and potentially concerning) alternative explanation. However, we believe that the results of our control experiments Exp. 2c and Exp. 3c are not compatible with this alternative explanation.

      Indeed, in Exp. 2c, participants are pre-trained in the visual modality and then tested in the auditory modality. In the multimodal association task, participants have to associate the auditory stimuli and the visual stimuli: on each trial, they hear a sound and then have to click on the corresponding visual stimulus. It is necessary to pay attention to both auditory dimensions and both visual dimensions to perform well in the task. To give an example, the task might involve mapping the fundamental frequency and the amplitude modulation of the auditory stimulus to the colour and the shape of the visual stimulus, respectively. If participants pay attention to only one dimension, this would lead to a maximum of 25% accuracy on average (because they would be at chance on the other dimension, with four possible options). We observed that 30/50 participants reached an accuracy > 50% in the multimodal association task in Exp. 2c. This means that we know for sure that at least 60% of the participants actually paid attention to both dimensions of the stimuli. Nevertheless, there was a clear difference between participants that received a visual pre-training (Exp. 2c) and those who received a spatial pre-training (Exp. 2a) (frequency of 1D vs 2D models between conditions, BF > 100 in near transfer and far transfer). In fact, only 3/50 participants were best fit by a 2D model when vision was the pre-training modality compared to 29/50 when space was the pre-training modality. Thus, the benefit of the spatial pre-training cannot be due solely to a shift in attention toward both dimensions.

      This effect was replicated in Exp. 3c. Similarly, 33/48 participants reached an accuracy > 50% in the multimodal association task in Exp. 3c, meaning that we know for sure that at least 68% of the participants actually paid attention to both dimensions of the stimuli. Again, there was a clear difference between participants who received a visual pre-training (frequency of 1D vs 2D models between conditions, Exp. 3c) and those who received a spatial pre-training (Exp. 3a) (BF > 100 in near transfer and far transfer).

      Thus, we believe that the alternative explanation raised by the Reviewer is not supported by our data. We have added a paragraph in the discussion:

      “One alternative explanation of this effect could be that the spatial pre-training encourages participants to attend to both dimensions of the non-spatial stimuli. By contrast, pretraining in the visual or auditory domains (where multiple dimensions of a stimulus may be relevant less often naturally) encourages them to attend to a single dimension. However, data from our control experiments Exp. 2c and Exp. 3c, are incompatible with this explanation. Around ~65% of the participants show a level of performance in the multimodal association task (>50%) which could only be achieved if they were attending to both dimensions (performance attending to a single dimension would yield 25% and chance performance is at 6.25%). This suggests that participants are attending to both dimensions even in the visual and auditory mapping case.”

      Conclusions:

      The authors successfully demonstrate that spatial training can enhance the ability to generalize in nonspatial domains, particularly in recognizing rotated sequences. The results for the most part support their conclusions, showing that spatial representations can act as a scaffold for learning more abstract conceptual invariances. However, the study leaves room for further investigation into whether the observed effects are unique to spatial cognition or could be replicated with other forms of well-established knowledge, as well as further clarifications of the underlying mechanisms.

      Impact:

      The study's findings are likely to have a valuable impact on cognitive science, particularly in understanding how abstract concepts are learned and generalized. The methods and data can be useful for further research, especially in exploring the relationship between spatial cognition and abstract conceptualization. The insights could also be valuable for AI research, particularly in improving models that involve abstract pattern recognition and conceptual generalization.

      In summary, the paper contributes valuable insights into the role of spatial cognition in learning abstract concepts, though it invites further research to explore the boundaries and specifics of this scaffolding effect.

      Reviewer #1 (Recommendations For The Authors):

      Minor issues / typos:

      P6: I think the example of the "signed" mapping here should be "e.g., ABAB maps to one category and BABA maps to another", rather than "ABBA maps to another" (since ABBA would always map to another category, whether the mapping is signed or unsigned).

      Done.

      P11: "Next, we asked whether pre-training and mapping were systematically associated with 2Dness...". I'd recommend changing to: "Next, we asked whether accuracy during pre-training and mapping were systematically associated with 2Dness...", just to clarify what the analyzed variables are.

      Done.

      P13, paragraph 1: "only if the features were themselves are physical spatial locations" either "were" or "are" should be removed.

      Done.

      P13, paragraph 1: should be "neural representations of space form a critical substrate" (not "for").

      Done.

      Reviewer #2 (Recommendations For The Authors):

      The authors use in multiple places in the manuscript the phrases "learn invariances" (Abstract), "formation of invariances" (p. 2, para. 1), etc. It might be just me, but this feels a bit like 'sloppy' wording: we do not learn or form invariances, rather we learn or form representations or transformations by which we can perform tasks that require invariance over particular features or transformation of the input such as the case of object recognition and size- translation- or lighting-invariance. We do not form size invariance, we have representations of objects and/or size transformations allowing the recognition of objects of different sizes. The authors might change this way of referring to the phenomenon.

      We respectfully disagree with this comment. An invariance occurs when neurons make the same response under different stimulation patterns. The objects or features to which a neuron responds is shaped by its inputs. Those inputs are in turn determined by experience-dependent plasticity. This process is often called “representation learning”. We think that our language here is consistent with this status quo view in the field.

      Reviewer #3 (Recommendations For The Authors):

      • I understand that the objective of the present experiment is to study our ability to generalize abstract patterns of sensory data (concepts). In the introduction, the authors present examples like the concept of a "tree" (encompassing a family tree, an oak tree, and a taxonomic tree) and "ring" to illustrate the idea. However, I am sceptical as to whether these examples effectively represent the phenomena being studied. From my perspective, these different instances of "tree" do not seem to relate to the same abstract concept that is translated or rotated but rather appear to share only a linguistic label. For instance, the conceptual substance of a family tree is markedly different from that of an oak tree, lacking significant overlap in meaning or structure. Thus, to me, these examples do not demonstrate invariance to transformations such as rotations.

      To elaborate further, typically, generalization involves recognizing the same object or concept through transformations. In the case of abstract concepts, this would imply a shared abstract representation rather than a mere linguistic category. While I understand the objective of the experiments and acknowledge their potential significance, I find myself wondering about the real-world applicability and relevance of such generalizations in everyday cognitive functioning. This, in turn, casts some doubt on the broader relevance of the study's results. A more fitting example, or an explanation that addresses my concerns about the suitability of the current examples, would be beneficial to further clarify the study's intent and scope.

      Response in the public review.

      • Relatedly, the manuscript could benefit from greater clarity in defining key concepts and elucidating the proposed mechanism behind the observed effects. Is it plausible that the changes observed are primarily due to shifts in attention induced by the spatial pre-training, rather than a change in the process of learning abstract conceptual invariances (i.e., modifications to the abstract representations themselves)? While the authors conclude that spatial pre-training acts as a scaffold for enhancing the learning of conceptual invariances, it raises the question: does this imply participants simply became more focused on spatial relationships during learning, or might this shift in attention represent a distinct strategy, and an alternative explanation? A more precise definition of these concepts and a clearer explanation of the authors' perspective on the mechanism underlying these effects would reduce any ambiguity in this regard.

      Response in the public review.

      • I am wondering whether the effectiveness of spatial representations in generalizing abstract concepts stems from their special nature or simply because they are a familiar 2D space for participants. It is well-established that memory benefits from linking items to familiar locations, a technique used in memory training (method of loci). This raises the question: Are we observing a similar effect here, where spatial dimensions are the only tested familiar 2D spaces, while the other 2 spaces are simply unfamiliar, as also suggested by the lower performance during training (Fig.2)? Would the results be replicable with another well-learned, robustly encoded domain, such as auditory dimensions for professional musicians, or is there something inherently unique about spatial representations that aids in bootstrapping abstract representations?

      On the other side of the same coin, are spatial representations qualitatively different, or simply more efficient because they are learned more quickly and readily? This leads to the consideration that if visual pre-training and visual-to-auditory mapping were continued until a similar proficiency level as in spatial training is achieved, we might observe comparable performance in aiding generalization. Thus, the conclusion that spatial representations are a special scaffold for abstract concepts may not be exclusively due to their inherent spatial nature, but rather to the general characteristic of well-established representations. This hypothesis could be further explored by either identifying alternative 2D representations that are equally well-learned or by extending training in visual or auditory representations before proceeding with the mapping task. At the very least I believe this potential explanation should be explored in the discussion section.

      Response in the public review.

      I had some difficulty in following an important section of the introduction: "... whether participants can learn rotationally invariant concepts in nonspatial domains, i.e., those that are defined by sequences of visual and auditory features (rather than by locations in physical space, defined in Cartesian or polar coordinates) is not known." This was initially puzzling to me as the paragraph preceding it mentions: "There is already good evidence that nonspatial concepts are represented in a translation invariant format." While I now understand that the essential distinction here is between translation and rotation, this was not immediately apparent upon first reading. This crucial distinction, especially in the context of conceptual spaces, was not clearly established before this point in the manuscript. For better clarity, it would be beneficial to explicitly contrast and define translation versus rotation in this particular section and stress that the present study concerns rotations in abstract spaces.

      Done.

      • The multi-modal association is crucial for the study, however to my knowledge, it is not depicted or well explained in the main text or figures (Results section). In my opinion, the details of this task should be explained and illustrated before the details of the associated results are discussed.

      We have included an illustration of a multimodal association trial in Fig. S3B.

      Author response image 2.

      • The observed correlation between the mapping task performance and the adoption of a 2D strategy is logical. However, this correlation might not exclusively indicate the proposed underlying mechanism of spatial scaffolding. Could it also be reflective of more general factors like overall performance, attention levels, or the effort exerted by participants? This alternative explanation suggests that the correlation might arise from broader cognitive engagement rather than specifically from the spatial nature of the task. Addressing this possibility could strengthen the argument for the unique role of spatial representations in learning abstract concepts, or at least this alternative interpretation should be mentioned.

      Response in the public review.

      • To me, the finding that ~30% of participants benefited from the spatial scaffolding effect for example in the auditory condition merely through exposure to the mapping (Fig 4D), without needing to see the quadruplets in the spatial modality, was somewhat surprising. This is particularly noteworthy considering that only ~60% of participants adopted the 2D strategy with exposure to rotated contingencies in Experiment 3 (Fig 3D). How do the authors interpret this outcome? It would be interesting to understand their perspective on why such a significant effect emerged from mere exposure to the mapping task.

      • I appreciate the clarity Fig.1 provides in explaining a challenging experimental setup. Is it possible to provide example trials, including an illustration that shows which rotations produce the trail and an intuitive explanation that response maps onto the 1D vs 2D strategies respectively, to aid the reader in better understanding this core manipulation?

      • I like that the authors provide transparency by depicting individual subject's data points in their results figures (e.g. Figs. 2 B, F, J). However, with an n=~50 per condition, it becomes difficult to intuit the distribution, especially for conditions with higher variance (e.g., Auditory). The figures might be more easily interpretable with alternative methods of displaying variances, such as violin plots per data point, conventional error shading using 95%CIs, etc.

      • Why are the authors not reporting exact BFs in the results sections at least for the most important contrasts?

      • While I understand why the authors report the frequencies for the best model fits, this may become difficult to interpret in some sections, given the large number of reported values. Alternatives or additional summary statistics supporting inference could be beneficial.

      As the Reviewer states, there are a large number of figures that we can report in this study. We have chosen to keep this number at a minimum to be as clear as possible. To illustrate the distribution of individual data points, we have opted to display only the group's mean and standard error (the standard errors are included, but the substantial number of participants per condition provides precise estimates, resulting in error bars that can be smaller than the mean point). This decision stems from our concern that including additional details could lead to a cluttered representation with unnecessary complexity. Finally, we report what we believe to be the critical BFs for the comprehension of the reader in the main text, and choose a cutoff of 100 when BFs are high (corresponding to the label “decisive” evidence, some BFs are larger than 1012). All the exact BFs are in the supplementary for the interested readers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The manuscript considers a mechanistic extension of MacArthur's consumer-resource model to include chasing down food and potential encounters between the chasers (consumers) that lead to less efficient feeding in the form of negative feedback. After developing the model, a deterministic solution and two forms of stochastic solutions are presented, in agreement with each other. Finally, the model is applied to explain observed coexistence and rank-abundance data.

      We thank the reviewer for the accurate summary of our manuscript.

      Strengths:

      The application of the theory to natural rank-abundance curves is impressive. The comparison with the experiments that reject the competitive exclusion principle is promising. It would be fascinating to see if in, e.g. insects, the specific interference dynamics could be observed and quantified and whether they would agree with the model.

      The results are clearly presented; the methods adequately described; the supplement is rich with details.

      There is much scope to build upon this expansion of the theory of consumer-resource models. This work can open up new avenues of research.

      We appreciate the reviewer for the very positive comments. We have followed many of the suggestions raised by the reviewer, and the manuscript is much improved as a result.

      Following the reviewer’s suggestions, we have now used Shannon entropies to quantify the model comparison with experiments that reject the Competitive Exclusion Principle (CEP). Specifically, for each time point of each experimental or model-simulated community, we calculated the Shannon entropies using the formula:

      , where is the probability that a consumer individual belongs to species C<sub>i</sub> at the time stamp of t. The comparison of Shannon entropies in the time series between those of the experimental data and SSA results shown in Fig. 2D-E is presented in Appendix-fig. 7C-D. The time averages and standard deviations (δH) of the Shannon entropies for these experimental or SSA model-simulated communities are as follows:

      , ; ,

      , , .

      Meanwhile, we have calculated the time averages and standard deviations (δC<sub>i</sub>) of the species’ relative/absolute abundances for the experimental or SSA model-simulated communities shown in Fig. 2D-E, which are as follows:

      , ; , ; , , , , where the superscript “(R)” represents relative abundances.

      From the results of Shannon entropies shown in Author response image 1 (which are identical to those of Appendix-fig. 7C-D) and the quantitative comparison of the time average and standard deviation between the model and experiments presented above, it is evident that the model results in Fig. 2D-E exhibit good consistency with the experimental data. They share roughly identical time averages and standard deviations in both Shannon entropies and the species' relative/absolute abundances for most of the comparisons. All these analyses are included in the appendices and mentioned in the main text.

      Author response image 1.

      Shannon Entropies of the experimental data and SSA results in Fig. 2D-E, redrawn from Appendix-fig. 7C-D.

      Weaknesses:

      I am questioning the use of carrying capacity (Eq. 4) instead of using nutrient limitation directly through Monod consumption (e.g. Posfai et al. who the authors cite). I am curious to see how these results hold or are changed when Monod consumption is used.

      We thank the reviewer for raising this question. To explain it more clearly, the equation combining the third equation in Eq. 1 and Eq. 4 of our manuscript is presented below as Eq. R1:

      where x<sub>il</sub> represents the population abundance of the chasing pair C<sub>i</sub><sup>(P)</sup> ∨ R<sub>l</sub><sup>(P)</sup>, κ<sub>l</sub> stands for the steady-state population abundance of species R<sub>l</sub> (the carrying capacity) in the absence of consumer species. In the case with no consumer species, then x<sub>il</sub> \= 0 since C<sub>i</sub> \= 0 (i\=1,…,S<sub>C</sub>), thus R<sub>l</sub> = κ<sub>l</sub> when R<sub>l</sub> = 0.

      Eq. R1 for the case of abiotic resources is comparable to Eq. (1) in Posfai et al., which we present below as Eq. R2:

      where c<sub>i</sub> represents the concentration of nutrient i, and thus corresponds to our R<sub>l</sub> ; n<sub>σ</sub>(t) is the population of species σ, which corresponds to our C<sub>i</sub> ; s<sub>i</sub> stands for the nutrient supply rate, which corresponds to our ζl ; µi denotes the nutrient loss rate, corresponding to our is the coefficient of the rate of species σ for consuming nutrient i, which corresponds to our in Posfai et al. is the consumption rate of nutrient i by the population of species σ, which corresponds to our x<sub>il</sub>.

      In Posfai et al., is the Monod function: and thus

      In our model, however, since predator interference is not involved in Posfai et al., we need to analyze the form of x<sub>il</sub> presented in the functional form of x<sub>il</sub> ({R<sub>l</sub>},{C<sub>i</sub>}) in the case involving only chasing pairs. Specifically, for the case of abiotic resources, the population dynamics can be described by Eq. 1 combined with Eq. R1:

      where and . For convenience, we consider the case of S<sub>R</sub> \=1 where the Monod form was derived (Monod, J. (1949). Annu. Rev. Microbiol., 3, 371-394.). From , we have

      where , and l =1. If the population abundance of the resource species is much larger than that of all consumer species (i.e., ), then,

      and R<sub>l</sub><sup>(F)</sup> ≈ R<sub>l</sub>. Combined with R5, and noting that C<sub>i</sub> \= C<sub>i</sub>(F) + xil we can solve for x<sub>il</sub> :

      with l =1 since S<sub>R</sub> \=1. Comparing Eq. R6 with Eq. R3, and considering the symbol correspondence explained in the text above, it is now clear that our model can be reduced to the Monod consumption form in the case of S<sub>R</sub> \=1 where the Monod form was derived from.

      Following on the previous comment, I am confused by the fact that the nutrient consumption term in Eq. 1 and how growth is modeled (Eq. 4) are not obviously compatible and would be hard to match directly to experimentally accessible quantities such as yield (nutrient to biomass conversion ratio). Ultimately, there is a conservation of mass ("flux balance"), and therefore the dynamics must obey it. I don't quite see how conservation of mass is imposed in this work.

      We thank the reviewer for raising this question. Indeed, the population dynamics of our model must adhere to flux balance, with the most pertinent equation restated here as Eq. R7:

      Below is the explanation of how Eq. R7, and thus Eqs. 1 and 4 of our manuscript, adhere to the constraint of flux balance. The interactions and fluxes between consumer and resource species occur solely through chasing pairs. At the population level, the scenario of chasing pairs among consumer species C<sub>i</sub> and resource species R<sub>l</sub> is presented in the follow expression:

      where the superscripts "(F)" and "(P)" represent the freely wandering individuals and those involved in chasing pairs, respectively, "(+)" stands for the gaining biomass of consumer C<sub>i</sub> from resource R<sub>l</sub>. In our manuscript, we use x<sub>l</sub> to represent the population abundance (or equivalently, the concentration, for a well-mixed system with a given size) of the chasing pair C<sub>i</sub><sup>(P)</sup> ∨ R<sub>l</sub><sup>(P)</sup>, and thus, the net flow from resource species R<sub>l</sub> to consumer species C<sub>i</sub> per unit time is k<sub>il</sub>x<sub>il</sub>. Noting that there is only one R<sub>l</sub> individual within the chasing pair C<sub>i</sub><sup>(P)</sup> ∨ R<sub>l</sub><sup>(P)</sup>, then the net effect on the population dynamics of species is −k<sub>il</sub>x<sub>il</sub>. However, since a consumer individual from species C<sub>i</sub> could be much heavier than a species R<sub>l</sub> individual, and energy dissipation would be involved from nutrient conversion into biomass, we introduce a mass conversion ratio w<sub>l</sub> in our manuscript. For example, if a species C<sub>i</sub> individual is ten times the weight of a species R<sub>l</sub> individual, without energy dissipation, the mass conversion ratio wil should be 1/10 (i.e., wil \= 0.1 ), however, if half of the chemical energy is dissipated into heat from nutrient conversion into biomass, then w<sub>l</sub> \= 0.1 0.5× = 0.05. Consequently, the net effect of the flux from resource species _R_l to consumer species C<sub>i</sub> per unit time on the population dynamics is , and flux balance is clearly satisfied.

      For the population dynamics of a consumer species C<sub>i</sub>, we need to consider all the biomass influx from different resource species, and thus there is a summation over all species of resources, which leads to the term of in Eq. R7. Similarly, for the population dynamics of a resource species R<sub>l</sub>, we need to lump sum all the biomass outflow into different consumer species, resulting in the term of in Eq. R7.

      Consequently, Eq. R7 and our model satisfy the constraint of flux balance.

      These models could be better constrained by more data, in principle, thereby potential exists for a more compelling case of the relevance of this interference mechanism to natural systems.

      We thank the reviewer for raising this question. Indeed, our model could benefit from the inclusion of more experimental data. In our manuscript, we primarily set the parameters by estimating their reasonable range. Following the reviewer's suggestions, we have now specified the data we used to set the parameters. For example, in Fig. 2D, we set 𝐷<sub>2</sub>\=0.01 with τ=0.4 days, resulting in an expected lifespan of Drosophila serrata in our model setting of 𝜏⁄𝐷<sub>2</sub>\= 40 days, which roughly agrees with experimental data showing that the average lifespan of D. serrata is 34 days for males and 54 days for females (lines 321-325 in the appendices; reference: Narayan et al. J Evol Biol. 35: 657–663 (2022)). To explain biodiversity and quantitatively illustrate the rank-abundance curves across diverse communities, the competitive differences across consumer species, exemplified by the coefficient of variation of the mortality rates - a key parameter influencing the rank-abundance curve, were estimated from experimental data in the reference article (Patricia Menon et al., Water Research (2003) 37, 4151) using the two-sigma rule (lines 344-347 in the appendices).

      Still, we admit that many factors other than intraspecific interference, such as temporal variation, spatial heterogeneity, etc., are involved in breaking the limits of CEP in natural systems, and it is still challenging to differentiate each contribution in wild systems. However, for the two classical experiments that break CEP (Francisco Ayala, 1969; Thomas Park, 1954), intraspecific interference could probably be the most relevant mechanism, since factors such as temporal variation, spatial heterogeneity, cross-feeding, and metabolic tradeoffs are not involved in those two experimental systems.

      The underlying frameworks, B-D and MacArthur are not properly exposed in the introduction, and as a result, it is not obvious what is the specific contribution in this work as opposed to existing literature. One needs to dig into the literature a bit for that.

      The specific contribution exists, but it might be more clearly separated and better explained. In the process, the introduction could be expanded a bit to make the paper more accessible, by reviewing key features from the literature that are used in this manuscript.

      We thank the reviewer for these very insightful suggestions. Following these suggestions, we have now added a new paragraph and revised the introduction part of our manuscript (lines 51-67 in the main text) to address the relevant issues. Our paper is much improved as a result.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Kang et al investigates how the consideration of pairwise encounters (consumer-resource chasing, intraspecific consumer pair, and interspecific consumer pair) influences the community assembly results. To explore this, they presented a new model that considers pairwise encounters and intraspecific interference among consumer individuals, which is an extension of the classical Beddington-DeAngelis (BD) phenomenological model, incorporating detailed considerations of pairwise encounters and intraspecific interference among consumer individuals. Later, they connected with several experimental datasets.

      Strengths:

      They found that the negative feedback loop created by the intraspecific interference allows a diverse range of consumer species to coexist with only one or a few types of resources. Additionally, they showed that some patterns of their model agree with experimental data, including time-series trajectories of two small in-lab community experiments and the rank-abundance curves from several natural communities. The presented results here are interesting and present another way to explain how the community overcomes the competitive exclusion principle.

      We appreciate the reviewer for the positive comments and the accurate summary of our manuscript.

      Weaknesses:

      The authors only explore the case with interspecific interference or intraspecific interference exists. I believe they need to systematically investigate the case when both interspecific and intraspecific interference exists. In addition, the text description, figures, and mathematical notations have to be improved to enhance the article's readability. I believe this manuscript can be improved by addressing my comments, which I describe in more detail below.

      We thank the reviewer for these valuable suggestions. We have followed many of the suggestions raised by the reviewer, and the manuscript is much improved as a result.

      (1) In nature, it is really hard for me to believe that only interspecific interference or intraspecific interference exists. I think a hybrid between interspecific interference and intraspecific interference is very likely. What would happen if both the interspecific and intraspecific interference existed at the same time but with different encounter rates? Maybe the authors can systematically explore the hybrid between the two mechanisms by changing their encounter rates. I would appreciate it if the authors could explore this route.

      We thank the reviewer for raising this question. Indeed, interspecific interference and intraspecific interference simultaneously exist in real cases. To differentiate the separate contributions of inter- and intra-specific interference on biodiversity, we considered different scenarios involving inter- or intra-specific interference. In fact, we have also considered the scenario involving both inter- and intra-specific interference in our old version for the case of S<sub>C</sub> = 2 and S<sub>R</sub> = 1, where two consumer species compete for one resource species (Appendix-fig. 5, and lines 147-148, 162-163 in the main text of the old version, or lines 160-161, 175-177 in the new version).

      Following the reviewer’s suggestions, we have now systematically investigated the cases of S<sub>C</sub> = 6, S<sub>R</sub> = 1, and S<sub>C</sub> = 20, S<sub>R</sub> = 1, where six or twenty consumer species compete for one resource species in scenarios involving chasing pairs and both inter- and intra-specific interference using both ordinary differential equations (ODEs) and stochastic simulation algorithm (SSA). These newly added ODE and SSA results are shown in Appendix-fig. 5 F-H, and we have added a new paragraph to describe these results in our manuscript (lines 212-215 in the main text). Consistent with our findings in the case of S<sub>C</sub> = 2 and S<sub>R</sub> = 1, the species coexistence behavior in the cases of both S<sub>C</sub> = 6, S<sub>R</sub> = 1, and S<sub>C</sub> = 20, S<sub>R</sub> = 1 is very similar to those without interspecific interference: all consumer species coexist with one type of resources at constant population densities in the ODE studies, and the SSA results fluctuate around the population dynamics of the ODEs.

      As for the encounter rates of interspecific and intraspecific interference, in fact, in a well-mixed system, these encounter rates can be derived from the mobility rates of the consumer species using the mean field method. For a system with a size of L2, the interspecific encounter rate between consumer species C<sub>i</sub> and C<sub>j</sub> (ij) is please refer to lines 100-102, 293-317 in the main text, and see also Appendix-fig. 1), where r<sup>(I)</sup> is the upper distance for interference, while v<sub>C<sub>i</sub></sub> and v<sub>C<sub>j</sub></sub> represent the mobility rates of species C<sub>i</sub> and C<sub>j</sub>, respectively. Meanwhile, the intraspecific encounter rates within species C<sub>i</sub> and species C<sub>j</sub> are and , respectively.

      Thus, once the intraspecific encounter rates a’<sub>ii</sub> are a’<sub>jj</sub> given, the interspecific encounter rate between species C<sub>i</sub> and C<sub>j</sub> is determined. Consequently, we could not tune the encounter rates of interspecific and intraspecific interference at will in our study, especially noting that for clarity reasons, we have used the mortality rate as the only parameter that varies among the consumer species throughout this study. Alternatively, we have made a systematic study on analyzing the influence of varying the separate rate and escape rate on species coexistence in the case of two consumers competing for a single type of resources (see Appendix-fig. 5A).

      (2) In the first two paragraphs of the introduction, the authors describe the competitive exclusion principle (CEP) and past attempts to overcome the CEP. Moving on from the first two paragraphs to the third paragraph, I think there is a gap that needs to be filled to make the transition smoother and help readers understand the motivations. More specifically, I think the authors need to add one more paragraph dedicated to explaining why predator interference is important, how considering the mechanism of predator interference may help overcome the CEP, and whether predator interference has been investigated or under-investigated in the past. Then building upon the more detailed introduction and movement of predator interference, the authors may briefly introduce the classical B-D phenomenological model and what are the conventional results derived from the classical B-D model as well as how they intend to extend the B-D model to consider the pairwise encounters.

      We thank the reviewer for these very insightful suggestions. Following these suggestions, we have added a new paragraph and revised the introduction part of our paper (lines 51-67 in the main text). Our manuscript is significantly improved as a result.

      (3) The notations for the species abundances are not very informative. I believe some improvements can be made to make them more meaningful. For example, I think using Greek letters for consumers and English letters for resources might improve readability. Some sub-scripts are not necessary. For instance, R^(l)_0 can be simplified to g_l to denote the intrinsic growth rate of resource l. Similarly, K^(l)_0 can be simplified to K_l. Another example is R^(l)_a, which can be simplified to s_l to denote the supply rate. In addition, right now, it is hard to find all definitions across the text. I would suggest adding a separate illustrative box with all mathematical equations and explanations of symbols.

      We thank the reviewer for these very useful suggestions. We have now followed many of the suggestions to improve the readability of our manuscript. Given that we have used many English letters for consumers and there are already many symbols of English and Greek letters for different variables and parameters in the appendices, we have opted to use Greek letters for parameters specific to resource species and English letters for those specific to consumer species. Additionally, we have now added Appendix-tables 1-2 in the appendices (pages 16-17 in the appendices) to illustrate the symbols used throughout our manuscript.

      (4) What is the f_i(R^(F)) on line 131? Does it refer to the growth rate of C_i? I noticed that f_i(R^(F)) is defined in the supplementary information. But please ensure that readers can understand it even without reading the supplementary information. Otherwise, please directly refer to the supplementary information when f_i(R^(F)) occurs for the first time. Similarly, I don't think the readers can understand \Omega^\prime_i and G^\prime_i on lines 135-136.

      We thank the reviewer for raising these questions. We apologize for not illustrating those symbols and functions clearly enough in our previous version of the manuscript. f<sub>i</sub>R<sup>(F)</sup>⟯ is a function of the variable R<sup>(F)</sup> with the index i, which is defined as and for i=2. Following the reviewer’s suggestions, we have now added clear definitions for symbols and functions and resolved these issues. The definitions of \Omega_i, \Omega^\prime_i, G, and G^\prime are overly complex, and hence we directly refer to the Appendices when they occur for the first time in the main text.

      Reviewer #3 (Public Review):

      Summary:

      A central question in ecology is: Why are there so many species? This question gained heightened interest after the development of influential models in theoretical ecology in the 1960s, demonstrating that under certain conditions, two consumer species cannot coexist on the same resource. Since then, several mechanisms have been shown to be capable of breaking the competitive exclusion principle (although, we still lack a general understanding of the relative importance of the various mechanisms in promoting biodiversity).

      One mechanism that allows for breaking the competitive exclusion principle is predator interference. The Beddington-DeAngelis is a simple model that accounts for predator interference in the functional response of a predator. The B-D model is based on the idea that when two predators encounter one another, they waste some time engaging with one another which could otherwise be used to search for resources. While the model has been influential in theoretical ecology, it has also been criticized at times for several unusual assumptions, most critically, that predators interfere with each other regardless of whether they are already engaged in another interaction. However, there has been considerable work since then which has sought either to find sets of assumptions that lead to the B-D equation or to derive alternative equations from a more realistic set of assumptions (Ruxton et al. 1992; Cosner et al. 1999; Broom et al. 2010; Geritz and Gyllenberg 2012). This paper represents another attempt to more rigorously derive a model of predator interference by borrowing concepts from chemical reaction kinetics (the approach is similar to previous work: Ruxton et al. 1992). The main point of difference is that the model in the current manuscript allows for 'chasing pairs', where a predator and prey engage with one another to the exclusion of other interactions, a situation Ruxton et al. (1992) do not consider. While the resulting functional response is quite complex, the authors show that under certain conditions, one can get an analytical expression for the functional response of a predator as a function of predator and resource densities. They then go on to show that including intraspecific interference allows for the coexistence of multiple species on one or a few resources, and demonstrate that this result is robust to demographic stochasticity.

      We thank the reviewer for carefully reading our manuscript and for the positive comments on the rigorously derived model of predator interference presented in our paper. We also appreciate the reviewer for providing a thorough introduction to the research background of our study, especially the studies related to the BeddingtonDeAngelis model. We apologize for our oversight in not fully appreciating the related study by Ruxton et al. (1992) at the time of our first submission. Indeed, as suggested by the reviewer, Ruxton et al. (1992) is relevant to our study in that we both borrowed concepts from chemical reaction kinetics. Now, we have reworked the introduction and discussion sections of our manuscript, cited, and acknowledged the contributions of related works, including Ruxton et al. (1992).

      Strengths:

      I appreciate the effort to rigorously derive interaction rates from models of individual behaviors. As currently applied, functional responses (FRs) are estimated by fitting equations to feeding rate data across a range of prey or predator densities. In practice, such experiments are only possible for a limited set of species. This is problematic because whether a particular FR allows stability or coexistence depends on not just its functional form, but also its parameter values. The promise of the approach taken here is that one might be able to derive the functional response parameters of a particular predator species from species traits or more readily measurable behavioral data.

      We appreciate the reviewer's positive comments regarding the rigorous derivation of our model. Indeed, all parameters of our model can be derived from measurable behavioral data for a specific set of predator species.

      Weaknesses:

      The main weakness of this paper is that it devotes the vast majority of its length to demonstrating results that are already widely known in ecology. We have known for some time that predator interference can relax the CEP (e.g., Cantrell, R. S., Cosner, C., & Ruan, S. 2004).

      While the model presented in this paper differs from the functional form of the B-D in some cases, it would be difficult to formulate a model that includes intraspecific interference (that increases with predator density) that does not allow for coexistence under some parameter range. Thus, I find it strange that most of the main text of the paper deals with demonstrating that predator interference allows for coexistence, given that this result is already well known. A more useful contribution would focus on the extent to which the dynamics of this model differ from those of the B-D model.

      We appreciate the reviewer for raising this question and apologize for not sufficiently clarifying the contribution of our manuscript in the context of existing knowledge upon our initial submission. We have now significantly revised the introduction part of our manuscript (lines 51-67 in the main text) to make this clearer. Indeed, with the application of the Beddington-DeAngelis (B-D) model, several studies (e.g., Cantrell, R. S., Cosner, C., & Ruan, S. 2004) have already shown that intraspecific interference promotes species coexistence, and it is certain that the mechanism of intraspecific interference could lead to species coexistence if modeled correctly. However, while we acknowledge that the B-D model is a brilliant phenomenological model of intraspecific interference, for the specific research topic of our manuscript on breaking the CEP and explaining the paradox of the plankton, it is highly questionable regarding the validity of applying the B-D model to obtain compelling results.

      Specifically, the functional response in the B-D model of intraspecific interference can be formally derived from the scenario involving only chasing pairs without consideration of pairwise encounters between consumer individuals (Eq. S8 in Appendices; related references: Gert Huisman, Rob J De Boer, J. Theor. Biol. 185, 389 (1997) and Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)). Since we have demonstrated that the scenario involving only chasing pairs is under the constraint of CEP (see lines 139-144 in the main text and Appendix-fig. 3A-C; related references: Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), and given the identical functional response mentioned above, it is thus highly questionable regarding the validity of the studies relying on the B-D model to break CEP or explain the paradox of the plankton.

      Consequently, one of the major objectives of our manuscript is to resolve whether the mechanism of intraspecific interference can truly break CEP and explain the paradox of the plankton in a rigorous manner. By modeling intraspecific predator interference from a mechanistic perspective and applying rigorous mathematical analysis and numerical simulations, our work resolves these issues and demonstrates that intraspecific interference enables a wide range of consumer species to coexist with only one or a handful of resource species. This naturally breaks CEP, explains the paradox of plankton, and quantitatively illustrates a broad spectrum of experimental results.

      For intuitive understanding, we introduced a functional response in our model (presented as Eq. 5 in the main text), which indeed involves approximations. However, to rigorously break the CEP or explain the paradox of plankton, all simulation results in our study were directly derived from equations 1 to 4 (main text), without relying on the approximate functional response presented in Eq. 5.

      The formulation of chasing-pair engagements assumes that prey being chased by a predator are unavailable to other predators. For one, this seems inconsistent with the ecology of most predator-prey systems. In the system in which I work (coral reef fishes), prey under attack by one predator are much more likely to be attacked by other predators (whether it be a predator of the same species or otherwise). I find it challenging to think of a mechanism that would give rise to chased prey being unavailable to other predators. The authors also critique the B-D model: "However, the functional response of the B-D model involving intraspecific interference can be formally derived from the scenario involving only chasing pairs without predator interference (Wang and Liu, 2020; Huisman and De Boer, 1997) (see Eqs. S8 and S24). Therefore, the validity of applying the B-D model to break the CEP is questionable.".

      We appreciate the reviewer for raising this question. We fully agree with the reviewer that in many predator-prey systems (e.g., coral reef fishes as mentioned by the reviewer, wolves, and even microbial species such as Myxococcus xanthus; related references: Berleman et al., FEMS Microbiol. Rev. 33, 942-957 (2009)), prey under attack by one predator can be targeted by another predator (which we term as a chasing triplet) or even by additional predator individuals (which we define as higher-order terms). However, since we have already demonstrated in a previous study (Xin Wang, Yang-Yu Liu, iScience 23, 101009 (2020)) from a mechanistic perspective that a scenario involving chasing triplets or higher-order terms can naturally break the CEP, while our manuscript focuses on whether pairwise encounters between individuals can break the CEP and explain the paradox of plankton, we deliberately excluded confounding factors that are already known to promote biodiversity, just as we excluded prevalent factors such as cross-feeding and temporal variations in our model.

      However, the way "chasing pairs" are formulated does result in predator interference because a predator attacking prey interferes with the ability of other predators to encounter the prey. I don't follow the author's logic that B-D isn't a valid explanation for coexistence because a model incorporating chasing pairs engagements results in the same functional form as B-D.

      We thank the reviewer for raising this question, and we apologize for not making this point clear enough at the time of our initial submission. We have now revised the related part of our manuscript (lines 56-62 in the main text) to make this clearer.

      In our definition, predator interference means the pairwise encounter between consumer individuals, while a chasing pair is formed by a pairwise encounter between a consumer individual and a resource individual. Thus, in these definitions, a scenario involving only chasing pairs does not involve pairwise encounters between consumer individuals (which is our definition of predator interference).

      We acknowledge that there can be different definitions of predator interference, and the reviewer's interpretation is based on a definition of predator interference that incorporates indirect interference without pairwise encounters between consumer individuals. We do not wish to argue about the appropriateness of definitions. However, since we have proven that scenarios involving only chasing pairs are under the constraint of CEP (see lines 139-144 in the main text and Appendix-fig. 3A-C; related references: Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), while the functional response of the B-D model can be derived from the scenario involving only chasing pairs without consideration of pairwise encounters between consumer individuals (Eq. S8 in Appendices; related references: Gert Huisman, Rob J De Boer, J. Theor. Biol. 185, 389 (1997) and Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), it is thus highly questionable regarding the validity of applying the B-D model to break CEP.

      More broadly, the specific functional form used to model predator interference is of secondary importance to the general insight that intraspecific interference (however it is modeled) can allow for coexistence. Mechanisms of predator interference are complex and vary substantially across species. Thus it is unlikely that any one specific functional form is generally applicable.

      We thank the reviewer for raising this issue. We agree that the general insight that intraspecific predator interference can facilitate species coexistence is of great importance. We also acknowledge that any functional form of a functional response is unlikely to be universally applicable, as explicit functional responses inevitably involve approximations. However, we must reemphasize the importance of verifying whether intraspecific predator interference can truly break CEP and explain the paradox of plankton, which is one of the primary objectives of our study. As mentioned above, since the B-D model can be derived from the scenario involving only chasing pairs (Eq. S8 in Appendices; related references: Gert Huisman, Rob J De Boer, J. Theor. Biol. 185, 389 (1997) and Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), while we have demonstrated that scenarios involving only chasing pairs are subject to the constraint of CEP (see lines 139-144 in the main text and Appendix-fig. 3A-C; related references: Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), it is highly questionable regarding the validity of applying the B-D model to break CEP.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      I do not see any code or data sharing. They should exist in a prominent place. The authors should make their simulations and the analysis scripts freely available to download, e.g. by GitHub. This is always true but especially so in a journal like eLife.

      We appreciate the reviewer for these recommendations. We apologize for our oversight regarding the unsuccessful upload of the data in our initial submission, as the data size was considerable and we neglected to double-check for this issue. Following the reviewer’s recommendation, we have now uploaded the code and dataset to GitHub (accessible at https://github.com/SchordK/Intraspecific-predator-interference-promotesbiodiversity-in-ecosystems), where they are freely available for download.

      The introduction section should include more background, including about BD but also about consumer-resource models. Part of the results section could be moved/edited to the introduction. You should try that the results section should contain only "new" stuff whereas the "old" stuff should go in the introduction.

      We thank the reviewer for these recommendations. Following these suggestions, we have now reorganized our manuscript by adding a new paragraph to the introduction section (lines 51-62 in the main text) and revising related content in both the introduction and results sections (lines 63-67, 81-83 in the main text).

      I found myself getting a little bogged down in the general/formal description of the model before you go to specific cases. I found the most interesting part of the paper to be its second half. This is a dangerous strategy, a casual reader may miss out on the most interesting part of the paper. It's your paper and do what you think is best, but my opinion is that you could improve the presentation of the model and background to get to the specific contribution and specific use case quickly and easily, then immediately to the data. You can leave the more general formulation and the details to later in the paper or even the appendix. Ultimately, you have a simple idea and a beautiful application on interesting data-that is your strength I think, and so, I would focus on that.

      We appreciate the reviewer for the positive comments and valuable suggestions. Following these recommendations, we have revised the presentation of the background information to clarify the contribution of our manuscript, and we have refined our model presentation to enhance clarity. Meanwhile, as we need to address the concerns raised by other reviewers, we continue to maintain systematic investigations for scenarios involving different forms of pairwise encounters in the case of S<sub>C</sub> = 2 and S<sub>R</sub> = 1 before applying our model to the experimental data.

      Reviewer #2 (Recommendations For The Authors):

      (1) I believe the surfaces in Figs. 1F-H corresponds to the zero-growth isoclines. The authors should directly point it out in the figure captions and text descriptions.

      We thank the reviewer for this suggestion, and we have followed it to address the issue.

      (2) After showing equations 1 or 2, I believe it will help readers understand the mechanism of equations by adding text such as "(see Fig. 1B)" to the sentences following the equations.

      We appreciate the reviewer's suggestion, and we have implemented it to address the issue.

      (3) Lines 12, 129 143 & 188: "at steady state" -> "at a steady state"

      (4) Line 138: "is doom to extinct" -> "is doomed to extinct"

      (5) Line 170: "intraspecific interference promotes species coexistence along with stochasticity" -> "intraspecific interference still robustly promotes species coexistence when stochasticity is considered"

      (6) Line 190: "The long-term coexistence behavior are exemplified" -> "The long-term coexistence behavior is exemplified"

      (7) Line 227: "the coefficient of variation was taken round 0.3" -> "the coefficient of variation was taken around 0.3"?

      (8) Line 235: "tend to extinct" -> "tend to be extinct"

      We thank the reviewer for all these suggestions, and we have implemented each of them to revise our manuscript.

      Reviewer #3 (Recommendations For The Authors):

      I think this would be a much more useful paper if the authors focused on how the behavior of this model differs from existing models rather than showing that the new formation also generates the same dynamics as the existing theory.

      We thank the reviewers for this suggestion, and we apologize for not explaining the limitations of the B-D model and the related studies on the topic of CEP clearly enough at the time of our initial submission. As we have explained in the responses above, we have now revised the introduction part of our manuscript (lines 5167 in the main text) to make it clear that since the functional response in the B-D model can be derived from the scenario involving only chasing pairs without consideration of pairwise encounters between consumer individuals, while we have demonstrated that a scenario involving only chasing pairs is under the constraint of CEP, it is thus highly questionable regarding the validity of the studies relying on the B-D model to break CEP or explain the paradox of the plankton. Consequently, one of the major objectives of our manuscript is to resolve whether the mechanism of intraspecific interference can truly break CEP and explain the paradox of the plankton in a rigorous manner. By modeling from a mechanistic perspective, we resolve the above issues and quantitatively illustrate a broad spectrum of experimental results, including two classical experiments that violate CEP and the rank-abundance curves across diverse ecological communities.

      Things that would be of interest:

      What are the conditions for coexistence in this model? Presumably, it depends heavily on the equilibrium abundances of the consumers and resources as well as the engagement times/rates.

      We thank the reviewer for raising this question. We have shown that there is a wide range of parameter space for species coexistence in our model. Specifically, for the case involving two consumer species and one resource species (S<sub>C</sub> = 2 and S<sub>R</sub> \= 1), we have conducted a systematic study on the parameter region for promoting species coexistence. For clarity, we set the mortality rate 𝐷<sub>i</sub> (i = 1, 2) as the only parameter that varies with the consumer species, and the order of magnitude of all model parameters was estimated from behavioral data. The results for scenarios involving intraspecific predator interference are shown in Appendix-figs. 4B-D, 5A, 6C-D and we redraw some of them here as Fig. R2, including both ODEs and SSA results, wherein Δ = (𝐷<sub>1</sub>-𝐷<sub>2</sub>)/ 𝐷<sub>2</sub> represents the competitive difference between the two consumer species. For example, Δ =1 means that species C2 is twice the competitiveness of species C<sub>1</sub>. In Fig. R2 (see also Appendix-figs. 4B-D, 5A, 6C-D), we see that the two consumer species can coexist with a large competitive difference in either ODEs and SSA simulation studies.

      Author response image 2.

      The parameter region for two consumer species coexisting with one type of abiotic resource species (S<sub>C</sub> =2 and S<sub>R</sub> \=1). (A) The region below the blue surface and above the red surface represents stable coexistence of the three species at constant population densities. (B) The blue region represents stable coexistence at a steady state for the three species. (C) The color indicates (refer to the color bar) the coexisting fraction for long-term coexistence of the three species. Figure redrawn from Appendixfigs. 4B, 6C-D.

      For systems shown in Fig. 3A-D, where the number of consumer species is much larger than that of the resource species, we set each consumer species with unique competitiveness through a distinctive 𝐷<sub>i</sub> (i =1,…, S<sub>C</sub>). In Fig. 3A-D (see also Appendix fig. 10), we see that hundreds of consumer species may coexist with one or three types of resources when the coefficient of variation (CV) of the consumer species’ competitiveness was taken around 0.3, which indicates a large parameter region for promoting species coexistence.

      Is there existing data to estimate the parameters in the model directly from behavioral data? Do these parameter ranges support the hypothesis that predator interference is significant enough to allow for the coexistence of natural predator populations?

      We appreciate the reviewer for raising this question. Indeed, the parameters in our model were primarily determined by estimating their reasonable range from behavioral data. Following the reviewer's suggestions, we have now specified the data we used to set the parameters. For instance, in Fig. 2D, we set 𝐷<sub>2</sub>\=0.01 with τ=0.4 Day, resulting in an expected lifespan of Drosophila serrata in our model setting of 𝜏⁄𝐷<sub>2</sub>\= 40 days, which roughly agrees with experimental behavioral data showing that the average lifespan of D. serrata is 34 days for males and 54 days for females (lines 321325 in the appendices; reference: Narayan et al. J Evol Biol. 35: 657–663 (2022)). To account for competitive differences, we set the mortality rate as the only parameter that varies among the consumer species. As specified in the Appendices, the CV of the mortality rate is the only parameter that was used to fit the experiments within the range of 0.15-0.43. This parameter range (i.e., 0.15-0.43) was directly estimated from experimental data in the reference article (Patricia Menon et al., Water Research 37, 4151(2003)) using the two-sigma rule (lines 344-347 in the appendices).

      Given the high consistency between the model results and experiments shown in Figs. 2D-E and 3C-D, where all the key model parameters were estimated from experimental data in references, and considering that the rank-abundance curves shown in Fig. 3C-D include a wide range of ecological communities, there is no doubt that predator interference is significant enough to allow for the coexistence of natural predator populations within the parameter ranges estimated from experimental references.

      Bifurcation analyses for the novel parameters of this model. Does the fact that prey can escape lead to qualitatively different model behaviors?

      Author response image 3.

      Bifurcation analyses for the separate rate d’<sub>i</sub> and escape rate d<sub>i</sub> (i =1, 2) of our model in the case of two consumer species competing for one abiotic resource species (S<sub>C</sub> =2 and S<sub>R</sub> \=1). (A) A 3D representation: the region above the blue surface signifies competitive exclusion where C<sub>1</sub> species extinct, while the region below the blue surface and above the red surface represents stable coexistence of the three species at constant population densities. (B) a 2D representation: the blue region represents stable coexistence at a steady state for the three species. Figure redrawn from Appendix-fig. 4C-D.

      We appreciate the reviewer for this suggestion. Following this suggestion, we have conducted bifurcation analyses for the separate rate d’<sub>i</sub> and escape rate d<sub>i</sub> of our model in the case where two consumer species compete for one resource species (S<sub>C</sub> =2 and S<sub>R</sub> \=1). Both 2D and 3D representations of these results have been included in Appendix-fig. 4, and we redraw them here as Fig. R3. In Fig. R3, we set the mortality rate 𝐷<sub>i</sub> (i =1, 2) as the only parameter that varies between the consumer species, and thus Δ = _(D1-𝐷<sub>2</sub>)/𝐷<sub>2</sub> represents the competitive difference between the two species.

      As shown in Fig. R3A-B, the smaller the escape rate d<sub>i</sub>, the larger the competitive difference Δ tolerated for species coexistence at steady state. A similar trend is observed for the separate rate d’<sub>i</sub>. However, there is an abrupt change for both 2D and 3D representations at the area where d’<sub>i</sub> =0, since if d’<sub>i</sub> =0, all consumer individuals would be trapped in interference pairs, and then no consumer species could exist. On the contrary, there is no abrupt change for both 2D and 3D representations at the area where d<sub>i</sub>\=0, since even if d<sub>i</sub>\=0, the consumer individuals could still leave the chasing pair through the capture process.

      Figures: I found the 3D plots especially Appendix Figure 2 very difficult to interpret. I think 2D plots with multiple lines to represent predator densities would be more clear.

      We thank the reviewer for this suggestion. Following this suggestion, we have added a 2D diagram to Appendix-fig. 2.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment 

      The work introduces a valuable new method for depleting the ribosomal RNA from bacterial single-cell RNA sequencing libraries and shows that this method is applicable to studying the heterogeneity in microbial biofilms. The evidence for a small subpopulation of cells at the bottom of the biofilm which upregulates PdeI expression is solid. However, more investigation into the unresolved functional relationship between PdeI and c-di-GMP levels with the help of other genes co-expressed in the same cluster would have made the conclusions more significant. 

      Many thanks for eLife’s assessment of our manuscript and the constructive feedback. We are encouraged by the recognition of our bacterial single-cell RNA-seq methodology as valuable and its efficacy in studying bacterial population heterogeneity. We appreciate the suggestion for additional investigation into the functional relationship between PdeI and c-di-GMP levels. We concur that such an exploration could substantially enhance the impact of our conclusions. To address this, we have implemented the following revisions: We have expanded our data analysis to identify and characterize genes co-expressed with PdeI within the same cellular cluster (Fig. 3F, G, Response Fig. 10); We conducted additional experiments to validate the functional relationships between PdeI and c-di-GMP, followed by detailed phenotypic analyses (Response Fig. 9B). Our analysis reveals that while other marker genes in this cluster are co-expressed, they do not significantly impact biofilm formation or directly relate to c-di-GMP or PdeI. We believe these revisions have substantially enhanced the comprehensiveness and context of our manuscript, thereby reinforcing the significance of our discoveries related to microbial biofilms. The expanded investigation provides a more thorough understanding of the PdeI-associated subpopulation and its role in biofilm formation, addressing the concerns raised in the initial assessment.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      In this manuscript, Yan and colleagues introduce a modification to the previously published PETRI-seq bacterial single-cell protocol to include a ribosomal depletion step based on a DNA probe set that selectively hybridizes with ribosome-derived (rRNA) cDNA fragments. They show that their modification of the PETRI-seq protocol increases the fraction of informative non-rRNA reads from ~4-10% to 54-92%. The authors apply their protocol to investigating heterogeneity in a biofilm model of E. coli, and convincingly show how their technology can detect minority subpopulations within a complex community. 

      Strengths: 

      The method the authors propose is a straightforward and inexpensive modification of an established split-pool single-cell RNA-seq protocol that greatly increases its utility, and should be of interest to a wide community working in the field of bacterial single-cell RNA-seq. 

      Weaknesses: 

      The manuscript is written in a very compressed style and many technical details of the evaluations conducted are unclear and processed data has not been made available for evaluation, limiting the ability of the reader to independently judge the merits of the method. 

      Thank you for your thoughtful and constructive review of our manuscript. We appreciate your recognition of the strengths of our work and the potential impact of our modified PETRI-seq protocol on the field of bacterial single-cell RNA-seq. We are grateful for the opportunity to address your concerns and improve the clarity and accessibility of our manuscript.

      We acknowledge your feedback regarding the compressed writing style and lack of technical details, which are constrained by the requirements of the Short Report format in eLife. We have addressed these issues in our revised manuscript as follows:

      (1) Expanded methodology section: We have provided a more comprehensive description of our experimental procedures, including detailed protocols for the ribosomal depletion step (lines 435-453) and data analysis pipeline (lines 471-528). This will enable readers to better understand and potentially replicate our methods.

      (2) Clarification of technical evaluations: We have elaborated on the specifics of our evaluations, including the criteria used for assessing the efficiency of ribosomal depletion (lines 99-120), and the methods employed for identifying and characterizing subpopulations (lines 155-159, 161-163 and 163-167).

      (3) Data availability: We apologize for the oversight in not making our processed data readily available. We have deposited all relevant datasets, including raw and source data, in appropriate public repositories (GEO: GSE260458) and provide clear instructions for accessing this data in the revised manuscript.

      (4) Supplementary information: To maintain the concise nature of the main text while providing necessary details, we have included additional supplementary information. This will cover extended methodology (lines 311-318, 321-323, 327-340, 450-453, 533, and 578-589), detailed statistical analyses (lines 492-493, 499-501 and 509-528), and comprehensive data tables to support our findings.

      We believe these changes significantly improved the clarity and reproducibility of our work, allowing readers to better evaluate the merits of our method.

      Reviewer #2 (Public Review): 

      Summary: 

      This work introduces a new method of depleting the ribosomal reads from the single-cell RNA sequencing library prepared with one of the prokaryotic scRNA-seq techniques, PETRI-seq. The advance is very useful since it allows broader access to the technology by lowering the cost of sequencing. It also allows more transcript recovery with fewer sequencing reads. The authors demonstrate the utility and performance of the method for three different model species and find a subpopulation of cells in the E.coli biofilm that express a protein, PdeI, which causes elevated c-di-GMP levels. These cells were shown to be in a state that promotes persister formation in response to ampicillin treatment. 

      Strengths: 

      The introduced rRNA depletion method is highly efficient, with the depletion for E.coli resulting in over 90% of reads containing mRNA. The method is ready to use with existing PETRI-seq libraries which is a large advantage, given that no other rRNA depletion methods were published for split-pool bacterial scRNA-seq methods. Therefore, the value of the method for the field is high. There is also evidence that a small number of cells at the bottom of a static biofilm express PdeI which is causing the elevated c-di-GMP levels that are associated with persister formation. Given that PdeI is a phosphodiesterase, which is supposed to promote hydrolysis of c-di-GMP, this finding is unexpected. 

      Weaknesses: 

      With the descriptions and writing of the manuscript, it is hard to place the findings about the PdeI into existing context (i.e. it is well known that c-di-GMP is involved in biofilm development and is heterogeneously distributed in several species' biofilms; it is also known that E.coli diesterases regulate this second messenger, i.e. https://journals.asm.org/doi/full/10.1128/jb.00604-15). 

      There is also no explanation for the apparently contradictory upregulation of c-di-GMP in cells expressing higher PdeI levels. Perhaps the examination of the rest of the genes in cluster 2 of the biofilm sample could be useful to explain the observed association. 

      Thank you for your thoughtful and constructive review of our manuscript. We are pleased that the reviewer recognizes the value and efficiency of our rRNA depletion method for PETRI-seq, as well as its potential impact on the field. We would like to address the points raised by the reviewer and provide additional context and clarification regarding the function of PdeI in c-di-GMP regulation.

      We acknowledge that c-di-GMP’s role in biofilm development and its heterogeneous distribution in bacterial biofilms are well studied. We appreciate the reviewer's observation regarding the seemingly contradictory relationship between increased PdeI expression and elevated c-di-GMP levels. This is indeed an intriguing finding that warrants further explanation.

      PdeI is predicted to function as a phosphodiesterase involved in c-di-GMP degradation, based on sequence analysis demonstrating the presence of an intact EAL domain, which is known for this function. However, it is important to note that PdeI also harbors a divergent GGDEF domain, typically associated with c-di-GMP synthesis. This dual-domain structure indicates that PdeI may play complex regulatory roles. Previous studies have shown that knocking out the major phosphodiesterase PdeH in E. coli results in the accumulation of c-di-GMP. Moreover, introducing a point mutation (G412S) in PdeI's divergent GGDEF domain within this PdeH knockout background led to decreased c-di-GMP levels2. This finding implies that the wild-type GGDEF domain in PdeI contributes to maintaining or increasing cellular c-di-GMP levels.

      Importantly, our single-cell experiments demonstrated a positive correlation between PdeI expression levels and c-di-GMP levels (Figure 4D). In this revision, we also constructed a PdeI(G412S)-BFP mutation strain. Notably, our observations of this strain revealed that c-di-GMP levels remained constant despite an increase in BFP fluorescence, which serves as a proxy for PdeI(G412S) expression levels (Figure 4D). This experimental evidence, coupled with domain analyses, suggests that PdeI may also contribute to c-di-GMP synthesis, rebutting the notion that it acts solely as a phosphodiesterase. HPLC LC-MS/MS analysis further confirmed that the overexpression of PdeI, induced by arabinose, resulted in increased c-di-GMP levels (Fig. 4E) . These findings strongly suggest that PdeI plays a pivotal role in upregulating c-di-GMP levels.

      Our further analysis indicated that PdeI contains a CHASE (cyclases/histidine kinase-associated sensory) domain. Combined with our experimental results showing that PdeI is a membrane-associated protein, we hypothesize that PdeI acts as a sensor, integrating environmental signals with c-di-GMP production under complex regulatory mechanisms.

      We understand your interest in the other genes present in cluster 2 of the biofilm and their potential relationship to PdeI and c-di-GMP. Upon careful analysis, we have determined that the other marker genes in this cluster do not significantly impact biofilm formation, nor have we identified any direct relationship between these genes, c-di-GMP, or PdeI. Our focus on PdeI within this cluster is justified by its unique and significant role in c-di-GMP regulation and biofilm formation, as demonstrated by our experimental results. While other genes in this cluster may be co-expressed, their functions appear unrelated to the PdeI-c-di-GMP pathway we are investigating. Therefore, we opted not to elaborate on these genes in our main discussion, as they do not contribute directly to our understanding of the PdeI-c-di-GMP association. However, we can include a brief mention of these genes in the manuscript, indicating their lack of relevance to the PdeI-c-di-GMP pathway. This addition will provide a more comprehensive view of the cluster's composition while maintaining our focus on the key findings related to PdeI and c-di-GMP.

      We have also included the aforementioned explanations and supporting experimental data within the manuscript to clarify this important point (lines 193-217). Thank you for highlighting this apparent contradiction, allowing us to provide a more detailed explanation of our findings.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Overall, I found the main text of the manuscript well written and easy to understand, though too compressed in parts to fully understand the details of the work presented, some examples are outlined below. The materials and methods appeared to be less carefully compiled and could use some careful proof-reading for spelling (e.g. repeated use of "minuts" for minutes, "datas" for data) and grammar and sentence fragments (e.g. "For exponential period E. coli data." Line 333). In general, the meaning is still clear enough to be understood. I also was unable to find figure captions for the supplementary figures, making these difficult to understand. 

      We appreciate your careful review, which has helped us improve the clarity and quality of our manuscript. We acknowledge that some parts of the main text may have been overly compressed due to Short Report format in eLife. We have thoroughly reviewed the manuscript and expanded on key areas to provide more comprehensive explanations. We have carefully revised the Materials and Methods section to address the following: Corrected all spelling and grammatical error, including "minuts" to "minutes" and "datas" to "data". Corrected grammatical issues and sentence fragments throughout the section. We sincerely apologize for the omission of captions for the supplementary figures. We have now added detailed captions for all supplementary figures to ensure they are easily understandable. We believe these revisions address your concerns and enhance the overall readability and comprehension of our work.

      General comments: 

      (1) To evaluate the performance of RiboD-PETRI, it would be helpful to have more details in general, particularly to do with the development of the sequencing protocol and the statistics shown. Some examples: How many reads were sequenced in each experiment? Of these, how many are mapped to the bacterial genome? How many reads were recovered per cell? Have the authors performed some kind of subsampling analysis to determine if their sequencing has saturated the detection of expressed genes? The authors show e.g. correlations between classic PETRI-seq and RiboD-PETRI for E. coli in Figure 1, but also have similar data for C. crescentus and S. aureus - do these data behave similarly? These are just a few examples, but I'm sure the authors have asked themselves many similar questions while developing this project; more details, hard numbers, and comparisons would be very much appreciated. 

      Thank you for your valuable feedback. To address your concerns, we have added a table in the supplementary material that clarifies the details of sequencing.

      The correlation values of PETRI-seq and RiboD-PETRI data in C. crescentus are relatively good. However, the correlation values between PETRI-seq and RiboD-PETRI data in SA data are relatively less high. The reason is that the sequencing depths of RiboD-PETRI and PETRI-seq are different, resulting in much higher gene expression in the RiboD-PETRI sequencing results than in PETRI-seq, and the calculated correlation coefficient is only about 0.47. This indicates that there is some positive correlation between the two sets of data, but it is not particularly strong. This indicates that there is a certain positive correlation between these two sets of data, but it is not particularly strong. However, we have counted the expression of 2763 genes in total, and even though the calculated correlation coefficient is relatively low, it still shows that there is some consistency between the two groups of samples.

      Author response image 1.

      Assessment of the effect of rRNA depletion on transcriptional profiles of (A) C. crescentus (CC) and (B) S. aureus (SA) . The Pearson correlation coefficient (r) of UMI counts per gene (log2 UMIs) between RiboD-PETRI and PETRI-seq was calculated for 4097 genes (A) and 2763 genes (B). The "ΔΔ" label represents the RiboD-PETRI protocol; The "Ctrl" label represents the classic PETRI-seq protocol we performed. Each point represents a gene.

      (2) Additionally, I think it is critical that the authors provide processed read counts per cell and gene in their supplementary information to allow others to investigate the performance of their method without going back to raw FASTQ files, as this can represent a significant hurdle for reanalysis. 

      Thank you for your suggestion. However, it's important to clarify that reads and UMIs (Unique Molecular Identifiers) are distinct concepts in single-cell RNA sequencing. Reads can be influenced by PCR amplification during library construction, making their quantity less stable. In contrast, UMIs serve as a more reliable indicator of the number of mRNA molecules detected after PCR amplification. Throughout our study, we primarily utilized UMI counts for quantification. To address your concern about data accessibility, we have included the UMI counts per cell and gene in our supplementary materials provided above (Table S7-15. Some of the files are too large in memory and are therefore stored in GEO: GSE260458). This approach provides a more accurate representation of gene expression levels and allows for robust reanalysis without the need to process raw FASTQ files.

      (3) Finally, the authors should also discuss other approaches to ribosomal depletion in bacterial scRNA-seq. One of the figures appears to contain such a comparison, but it is never mentioned in the text that I can find, and one could read this manuscript and come away believing this is the first attempt to deplete rRNA from bacterial scRNA-seq. 

      We have addressed this concern by including a comparison of different methods for depleting rRNA from bacterial scRNA-seq in Table S4 and make a short text comparison as follows: “Additionally, we compared our findings with other reported methods (Fig. 1B; Table S4). The original PETRI-seq protocol, which does not include an rRNA depletion step, exhibited an mRNA detection rate of approximately 5%. The MicroSPLiT-seq method, which utilizes Poly A Polymerase for mRNA enrichment, achieved a detection rate of 7%. Similarly, M3-seq and BacDrop-seq, which employ RNase H to digest rRNA post-DNA probe hybridization in cells, reported mRNA detection rates of 65% and 61%, respectively. MATQ-DASH, which utilizes Cas9-mediated targeted rRNA depletion, yielded a detection rate of 30%. Among these, RiboD-PETRI demonstrated superior performance in mRNA detection while requiring the least sequencing depth.” We have added this content in the main text (lines 110-120), specifically in relation to Figure 1B and Table S4. This addition provides context for our method and clarifies its position among existing techniques.

      Detailed comments: 

      Line 78: the authors describe the multiplet frequency, but it is not clear to me how this was determined, for which experiments, or where in the SI I should look to see this. Often this is done by mixing cultures of two distinct bacteria, but I see no evidence of this key experiment in the manuscript. 

      The multiplet frequency we discuss in the manuscript is not determined through experimental mixing of distinct bacterial cultures.The PETRI-seq and mirco-SPLIT articles have also done experiments mixing the two libraries to determine the single-cell rate, and both gave good results. Our technique is derived from these two articles (mainly PETRI-seq), and the biggest difference is the difference in the later RiboD part, so we did not do this experiment separately. So the multiple frequencies here are theoretical predictions based on our sequencing results, calculated using a Poisson distribution. We have made this distinction clearer in our manuscript (lines 93-97). The method is available in Materials and Methods section (lines 520-528). The data is available in Table S2. To elaborate:

      To assess the efficiency of single-cell capture in RiboD-PETRI, we calculated the multiplet frequency using a Poisson distribution based on our sequencing results

      (1) Definition: In our study, multiplet frequency is defined as the probability of a non-empty barcode corresponding to more than one cell.

      (2) Calculation Method: We use a Poisson distribution-based approach to calculate the predicted multiplet frequency. The process involves several steps:

      We first calculate the proportion of barcodes corresponding to zero cells: . Then, we calculate the proportion corresponding to one cell: . We derive the proportion for more than zero cells: P(≥1) = 1 - P(0). And for more than one cell: P(≥2) = 1 - P(1) - P(0). Finally, the multiplet frequency is calculated as:

      (3) Parameter λ: This is the ratio of the number of cells to the total number of possible barcode combinations. For instance, when detecting 10,000 cells, .

      Line 94: the concept of "percentage of gene expression" is never clearly defined. Does this mean the authors detect 99.86% of genes expressed in some cells? How is "expressed" defined - is this just detecting a single UMI? 

      The term "percentage gene expression" refers to the proportion of genes in the bacterial strain that were detected as expressed in the sequenced cell population. Specifically, in this context, it means that 99.86% of all genes in the bacterial strain were detected as expressed in at least one cell in our sequencing results. To define "expressed" more clearly: a gene is considered expressed if at least one UMI (Unique Molecular Identifier) detected in a cell in the population. This definition allows for the detection of even low-level gene expression. To enhance clarity in the manuscript, we have rephrased the sentence as “transcriptome-wide gene coverage across the cell population”.

      Line 98: The authors discuss the number of recovered UMIs throughout this paragraph, but there is no clear discussion of the number of detected expressed genes per cell. Could the authors include a discussion of this as well, as this is another important measure of sensitivity? 

      We appreciate your suggestion to include a discussion on the number of detected expressed genes per cell, as this is indeed another important measure of sensitivity. We would like to clarify that we have actually included statistics on the number of genes detected across all cells in the main text of our paper. This information is presented as percentages. However, we understand that you may be looking for a more detailed representation, similar to the UMI statistics we provided. To address this, we have now added a new analysis showing the number of genes detected per cell (lines 132-133, 138-139, 144-145 and 184-186, Fig. 2B, 3B and S2B). This additional result complements our existing UMI data and provides a more comprehensive view of the sensitivity of our method. We have included this new gene-per-cell statistical graph in the supplementary materials.

      Figure 1B: I presume ctrl and delta delta represent the classic PETRI-seq and RiboD protocols, respectively, but this is not specified. This should be clarified in the figure caption, or the names changed. 

      We appreciate you bringing this to our attention. We acknowledge that the labeling in the figure could have been clearer. We have now clarified this information in the figure caption. To provide more specificity: The "ΔΔ" label represents the RiboD-PETRI protocol; The "Ctrl" label represents the classic PETRI-seq protocol we performed. We have updated the figure caption to include these details, which should help readers better understand the protocols being compared in the figure.​

      Line 104: the authors claim "This performance surpassed other reported bacterial scRNA-seq methods" with a long number of references to other methods. "Performance" is not clearly defined, and it is unclear what the exact claim being made is. The authors should clarify what they're claiming, and further discuss the other methods and comparisons they have made with them in a thorough and fair fashion. 

      We appreciate your request for clarification, and we acknowledge that our definition of "performance" should have been more explicit. We would like to clarify that in this context, we define performance primarily in terms of the proportion of mRNA captured. Our improved method demonstrates a significantly higher rate of rRNA removal compared to other bacterial single-cell library construction methods. This results in a higher proportion of mRNA in our sequencing data, which we consider a key performance metric for single-cell RNA sequencing in bacteria. Additionally, when compared to our previous method, PETRI-seq, our improved approach not only enhances rRNA removal but also reduces library construction costs. This dual improvement in both data quality and cost-effectiveness is what we intended to convey with our performance claim.

      We recognize that a more thorough and fair discussion of other methods and their comparisons would be beneficial. We have summarized the comparison in Table S4 and make a short text discussion in the main text (lines 106-120). This addition provides context for our method and clarifies its position among existing techniques.

      Figure 1D: Do the authors have any explanation for the relatively lower performance of their C. crescentus depletion? 

      We appreciate your attention to detail and the opportunity to address this point. The lower efficiency of rRNA removal in C. crescentus compared to other species can be attributed to inherent differences between species. It's important to note that a single method for rRNA depletion may not be universally effective across all bacterial species due to variations in their genetic makeup and rRNA structures. Different bacterial species can have unique rRNA sequences, secondary structures, or associated proteins that may affect the efficiency of our depletion method. This species-specific variation highlights the challenges in developing a one-size-fits-all approach for bacterial rRNA depletion. While our method has shown high efficiency across several species, the results with C. crescentus underscore the need for continued refinement and possibly species-specific optimizations in rRNA depletion techniques. We thank you for bringing attention to this point, as it provides valuable insight into the complexities of bacterial rRNA depletion and areas for future improvement in our method.

      Line 118: The authors claim RiboD-PETRI has a "consistent ability to unveil within-population heterogeneity", however the preceding paragraph shows it detects potential heterogeneity, but provides no evidence this inferred heterogeneity reflects the reality of gene expression in individual cells. 

      We appreciate your careful reading and the opportunity to clarify this point. We acknowledge that our wording may have been too assertive given the evidence presented. We acknowledge that the subpopulations of cells identified in other species have not undergone experimental verification. Our intention in presenting these results was to demonstrate RiboD-PETRI's capability to detect “potential” heterogeneity consistently across different bacterial species, showcasing the method's sensitivity and potential utility in exploring within-population diversity. However, we agree that without further experimental validation, we cannot definitively claim that these detected differences represent true biological heterogeneity in all cases. We have revised this section to reflect the current state of our findings more accurately, emphasizing that while RiboD-PETRI consistently detects potential heterogeneity across species, further experimental validation would be required to confirm the biological significance of the observations (lines 169-171).

      Figure 1 H&I: I'm not entirely sure what I am meant to see in these figures, presumably some evidence for heterogeneity in gene expression. Are there better visualizations that could be used to communicate this? 

      We appreciate your suggestion for improving the visualization of gene expression heterogeneity. We have explored alternative visualization methods in the revised manuscript. Specifically, for the expression levels of marker genes shown in Figure 1H (which is Figure 2D now), we have created violin plots (Supplementary Fig. 4). These plots offer a more comprehensive view of the distribution of expression levels across different cell populations, making it easier to discern heterogeneity. However, due to the number of marker genes and the resulting volume of data, these violin plots are quite extensive and would occupy a significant amount of space. Given the space constraints of the main figure, we propose to include these violin plots as a Fig. S4 immediately following Figure 1 H&I (which is Figure 2D&E now). This arrangement will allow readers to access more detailed information about these marker genes while maintaining the concise style of the main figure.

      Regarding the pathway enrichment figure (Figure 2E), we have also considered your suggestion for improvement. We attempted to use a dot plot to display the KEGG pathway enrichment of the genes. However, our analysis revealed that the genes were only enriched in a single pathway. As a result, the visual representation using a dot plot still did not produce a particularly aesthetically pleasing or informative figure.

      Line 124: The authors state no significant batch effect was observed, but in the methods on line 344 they specify batch effects were removed using Harmony. It's unclear what exactly S2 is showing without a figure caption, but the authors should clarify this discrepancy. 

      We apologize for any confusion caused by the lack of a clear figure caption for Figure S2 (which is Figure S3D now). To address your concern, in addition to adding figure captions for supplementary figure, we would also like to provide more context about the batch effect analysis. In Supplementary Fig. S3, Panel C represents the results without using Harmony for batch effect removal, while Panel D shows the results after applying Harmony. In both panels A and B, the distribution of samples one and two do not show substantial differences. Based on this observation, we concluded that there was no significant batch effect between the two samples. However, we acknowledge that even subtle batch effects could potentially influence downstream analyses. Therefore, out of an abundance of caution and to ensure the highest quality of our results, we decided to apply Harmony to remove any potential minor batch effects. This approach aligns with best practices in single-cell analysis, where even small technical variations are often accounted for to enhance the robustness of the results.

      To improve clarity, we have revised our manuscript to better explain this nuanced approach: 1. We have updated the statement to reflect that while no major batch effect was observed, we applied batch correction as a precautionary measure (lines 181-182). 2. We have added a detailed caption to Figure S3, explaining the comparison between non-corrected and batch-corrected data. 3. We have modified the methods section to clarify that Harmony was applied as a precautionary step, despite the absence of obvious batch effects (lines 492-493).

      Figure 2D: I found this panel fairly uninformative, is there a better way to communicate this finding? 

      Thank you for your feedback regarding Figure 2D. We have explored alternative ways to present this information, using a dot plot to display the enrichment pathways, as this is often an effective method for visualizing such data. Meanwhile, we also provided a more detailed textual description of the enrichment results in the main text, highlighting the most significant findings.

      Figure 2I: the figure itself and caption say GFP, but in the text and elsewhere the authors say this is a BFP fusion. 

      We appreciate your careful review of our manuscript and figures. We apologize for any confusion this may have caused. To clarify: Both GFP (Green Fluorescent Protein) and BFP (Blue Fluorescent Protein) were indeed used in our experiments, but for different purposes: 1. GFP was used for imaging to observe location of PdeI in bacteria and persister cell growth, which is shown in Figure 4C and 4K. 2. BFP was used for cell sorting, imaging of location in biofilm, and detecting the proportion of persister cells which shown in Figure 4D, 4F-J. To address this inconsistency and improve clarity, we will make the following corrections: 1. We have reviewed the main text to ensure that references to GFP and BFP are accurate and consistent with their respective uses in our experiments. 2. We have added a note in the figure caption for Figure 4C to explicitly state that this particular image shows GFP fluorescence for location of PdeI. 3. In the methods section, we have provided a clear explanation of how both fluorescent proteins were used in different aspects of our study (lines 326-340).

      Line 156: The authors compare prices between RiboD and PETRI-seq. It would be helpful to provide a full cost breakdown, e.g. in supplementary information, as it is unclear exactly how the authors came to these numbers or where the major savings are (presumably in sequencing depth?) 

      We appreciate your suggestion to provide a more detailed cost breakdown, and we agree that this would enhance the transparency and reproducibility of our cost analysis. In response to your feedback, we have prepared a comprehensive cost breakdown that includes all materials and reagents used in the library preparation process. Additionally, we've factored in the sequencing depth (50G) and the unit price for sequencing (25¥/G). These calculations allow us to determine the cost per cell after sequencing. As you correctly surmised, a significant portion of the cost reduction is indeed related to sequencing depth. However, there are also savings in the library preparation steps that contribute to the overall cost-effectiveness of our method. We propose to include this detailed cost breakdown as a supplementary table (Table S6) in our paper. This table will provide a clear, itemized list of all expenses involved, including: 1. Reagents and materials for library preparation 2. Sequencing costs (depth and price per G) 3. Calculated cost per cell.

      Line 291: The design and production of the depletion probes are not clearly explained. How did the authors design them? How were they synthesized? Also, it appears the authors have separate probe sets for E. coli, C. crescentus, and S. aureus - this should be clarified, possibly in the main text.

      Thank you for your important questions regarding the design and production of our depletion probes. We included the detailed probe information in Supplementary Table S1, however, we didn’t clarify the information in the main text due to the constrains of the requirements of the Short Report format in eLife. We appreciate the opportunity to provide clarifications. ​

      The core principle behind our probe design is that the probe sequences are reverse complementary to the r-cDNA sequences. This design allows for specific recognition of r-cDNA. The probes are then bound to magnetic beads, allowing the r-cDNA-probe-bead complexes to be separated from the rest of the library. To address your specific questions: 1. Probe Design: We designed separate probe sets for E. coli, C. crescentus, and S. aureus. Each set was specifically constructed to be reverse complementary to the r-cDNA sequences of its respective bacterial species. This species-specific approach ensures high efficiency and specificity in rRNA depletion for each organism. The hybrid DNA complex wasthen removed by Streptavidin magnetic beads. 2. Probe Synthesis: The probes were synthesized based on these design principles. 3. Species-Specific Probe Sets: You are correct in noting that we used separate probe sets for each bacterial species. We have clarified this important point in the main text to ensure readers understand the specificity of our approach. To further illustrate this process, we have created a schematic diagram showing the principle of rRNA removal and clarified the design principle in figure legend, which we have included in the figure legend of Fig. 1A.

      Line 362: I didn't see a description of the construction of the PdeI-BFP strain, I assume this would be important for anyone interested in the specific work on PdeI. 

      Thank you for your astute observation regarding the construction of the PdeI-BFP strain. We appreciate the opportunity to provide this important information. The PdeI-BFP strain was constructed as follows: 1. We cloned the pdeI gene along with its native promoter region (250bp) into a pBAD vector. 2. The original promoter region of the pBAD vector was removed to avoid any potential interference. 3. This construction enables the expression of the PdeI-BFP fusion protein to be regulated by the native promoter of pdeI, thus maintaining its physiological control mechanisms. 4. The BFP coding sequence was fused to the pdeI gene to create the PdeI-BFP fusion construct. We have added a detailed description of the PdeI-BFP strain construction to our methods section (lines 327-334).

      Reviewer #2 (Recommendations For The Authors): 

      (1) General remarks: 

      Reconsider using 'advanced' in the title. It is highly generic and misleading. Perhaps 'cost-efficient' would be a more precise substitute. 

      Thank you for your valuable suggestion. After careful consideration, we have decided to use "improved" in the title. Firstly, our method presents an efficient solution to a persistent challenge in bacterial single-cell RNA sequencing, specifically addressing rRNA abundance. Secondly, it facilitates precise exploration of bacterial population heterogeneity. We believe our method encompasses more than just cost-effectiveness, justifying the use of the term "advanced."

      Consider expanding the introduction. The introduction does not explain the setup of the biological question or basic details such as the organism(s) for which the technique has been developed, or which species biofilms were studied. 

      Thank you for your valuable feedback regarding our introduction. We acknowledge our compressed writing style due to constrains of the requirements of the Short Report format in eLife. We appreciate opportunity to expand this crucial section of our manuscript, which will undoubtedly improve the clarity and impact of our manuscript's introduction.

      We revised our introduction (lines 53-80) according to following principles:

      (1) Initial Biological Question: We explained the initial biological question that motivated our research—understanding the heterogeneity in E. coli biofilms—to provide essential context for our technological development.

      (2) Limitations of Existing Techniques: We briefly described the limitations of current single-cell sequencing techniques for bacteria, particularly regarding their application in biofilm studies.

      (3) Introduction of Improved Technique: We introduced our improved technique, initially developed for E. coli.

      (4) Research Evolution: We highlighted how our research has evolved, demonstrating that our technique is applicable not only to E. coli but also to Gram-positive bacteria and other Gram-negative species, showcasing the broad applicability of our method.

      (5) Specific Organisms Studied: We provided examples of the specific organisms we studied, encompassing both Gram-positive and Gram-negative bacteria.

      (6) Potential Implications: Finally, we outlined the potential implications of our technique for studying bacterial heterogeneity across various species and contexts, extending beyond biofilms.

      (2) Writing remarks: 

      43-45 Reword: "Thus, we address a persistent challenge in bacterial single-cell RNA-seq regarding rRNA abundance, exemplifying the utility of this method in exploring biofilm heterogeneity.". 

      Thank you for highlighting this sentence and requesting a rewording. I appreciate the opportunity to improve the clarity and impact of our statement. We have reworded the sentence as: "Our method effectively tackles a long-standing issue in bacterial single-cell RNA-seq: the overwhelming abundance of rRNA. This advancement significantly enhances our ability to investigate the intricate heterogeneity within biofilms at unprecedented resolution." (lines 47-50)

      49 "Biofilms, comprising approximately 80% of chronic and recurrent microbial infections in the human body..." - probably meant 'contribute to'. 

      Thank you for catching this imprecision in our statement. We have reworded the sentence as: "​Biofilms contribute to approximately 80% of chronic and recurrent microbial infections in the human body...​"

      54-55 Please expand on "this". 

      Thank you for your request to expand on the use of "this" in the sentence. You're right that more clarity would be beneficial here. We have revised and expanded this section in lines 54-69.

      81-84 Unclear why these species samples were either at exponential or stationary phases. The growth stage can influence the proportion of rRNA and other transcripts in the population. 

      Thank you for raising this important point about the growth phases of the bacterial samples used in our study. We appreciate the opportunity to clarify our experimental design. To evaluate the performance of RiboD-PETRI, we designed a comprehensive assessment of rRNA depletion efficiency under diverse physiological conditions, specifically contrasting exponential and stationary phases. This approach allows us to understand how these different growth states impact rRNA depletion efficacy. Additionally, we included a variety of bacterial species, encompassing both gram-negative and gram-positive organisms, to ensure that our findings are broadly applicable across different types of bacteria. By incorporating these variables, we aim to provide insights into the robustness and reliability of the RiboD-PETRI method in various biological contexts. We have included this rationale in our result section (lines 99-106), providing readers with a clear understanding of our experimental design choices.

      86 "compared TO PETRI-seq " (typo). 

      We have corrected this typo in our manuscript.

      94 "gene expression collectively" rephrase. Probably this means coverage of the entire gene set across all cells. Same for downstream usage of the phrase. 

      Thank you for pointing out this ambiguity in our phrasing. Your interpretation of our intended meaning is accurate. We have rephrased the sentence as “transcriptome-wide gene coverage across the cell population”.

      97 What were the median UMIs for the 30,000 cell library {greater than or equal to}15 UMIs? Same question for the other datasets. This would reflect a more comparable statistic with previous studies than the top 3% of the cells for example, since the distributions of the single-cell UMIs typically have a long tail. 

      Thank you for this insightful question and for pointing out the importance of providing more comparable statistics. We agree that median values offer a more robust measure of central tendency, especially for datasets with long-tailed distributions, which are common in single-cell studies. The suggestion to include median Unique Molecular Identifier (UMI) counts would indeed provide a more comparable statistic with previous studies. We have analyzed the median UMIs for our libraries as follows and revised our manuscript according to the analysis (lines 126-130, 133-136, 139-142 and 175-180).

      (1) Median UMI count in Exponential Phase E. coli:

      Total: 102 UMIs per cell

      Top 1,000 cells: 462 UMIs per cell

      Top 5,000 cells: 259 UMIs per cell

      Top 10,000 cells: 193 UMIs per cell

      (2) Median UMI count in Stationary Phase S. aureus:

      Total: 142 UMIs per cell

      Top 1,000 cells: 378 UMIs per cell

      Top 5,000 cells: 207 UMIs per cell

      Top 8,000 cells: 167 UMIs per cell

      (3) Median UMI count in Exponential Phase C. crescentus:

      Total: 182 UMIs per cell

      Top 1,000 cells: 2,190 UMIs per cell

      Top 5,000 cells: 662 UMIs per cell

      Top 10,000 cells: 225 UMIs per cell

      (4) Median UMI count in Static E. coli Biofilm:

      Total of Replicate 1: 34 UMIs per cell

      Total of Replicate 2: 52 UMIs per cell

      Top 1,621 cells of Replicate 1: 283 UMIs per cell

      Top 3,999 cells of Replicate 2: 239 UMIs per cell

      104-105 The performance metric should again be the median UMIs of the majority of the cells passing the filter (15 mRNA UMIs is reasonable). The top 3-5% are always much higher in resolution because of the heavy tail of the single-cell UMI distribution. It is unclear if the performance surpasses the other methods using the comparable metric. Recommend removing this line. 

      We appreciate your suggestion regarding the use of median UMIs as a more appropriate performance metric, and we agree that comparing the top 3-5% of cells can be misleading due to the heavy tail of the single-cell UMI distribution. We have removed the line in question (104-105) that compares our method's performance based on the top 3-5% of cells in the revised manuscript. Instead, we focused on presenting the median UMI counts for cells passing the filter (≥15 mRNA UMIs) as the primary performance metric. This will provide a more representative and comparable measure of our method's performance. We have also revised the surrounding text to reflect this change, ensuring that our claims about performance are based on these more robust statistics (lines 126-130, 133-136, 139-142 and 175-180).

      106-108 The sequencing saturation of the libraries (in %), and downsampling analysis should be added to illustrate this point. 

      Thank you for your valuable suggestion. Your recommendation to add sequencing saturation and downsampling analysis is highly valuable and will help better illustrate our point. Based on your feedback, we have revised our manuscript by adding the following content:

      To provide a thorough evaluation of our sequencing depth and library quality, we performed sequencing saturation analysis on our sequencing samples. The findings reveal that our sequencing saturation is 100% (Fig. 8A & B), indicating that our sequencing depth is sufficient to capture the diversity of most transcripts. To further illustrate the impact of our downstream analysis on the datasets, we have demonstrated the data distribution before and after applying our filtering criteria (Fig. S1B & C). These figures effectively visualized the influence of our filtering process on the data quality and distribution. After filtering, we can have a more refined dataset with reduced noise and outliers, which enhances the reliability of our downstream analyses.

      We have also ensured that a detailed description of the sequencing saturation method is included in the manuscript to provide readers with a comprehensive understanding of our methodology. We appreciate your feedback and believe these additions significantly improve our work.

      122: Please provide more details about the biofilm setup, including the media used. I did not find them in the methods. 

      We appreciate your attention to detail, and we agree that this information is crucial for the reproducibility of our experiments. We propose to add the following information to our methods section (lines 311-318):

      "For the biofilm setup, bacterial cultures were grown overnight. The next day, we diluted the culture 1:100 in a petri dish. We added 2ml of LB medium to the dish. If the bacteria contain a plasmid, the appropriate antibiotic needs to be added to LB. The petri dish was then incubated statically in a growth chamber for 24 hours. After incubation, we performed imaging directly under the microscope. The petri dishes used were glass-bottom dishes from Biosharp (catalog number BS-20-GJM), allowing for direct microscopic imaging without the need for cover slips or slides. This setup allowed us to grow and image the biofilms in situ, providing a more accurate representation of their natural structure and composition.​"

      125: "sequenced 1,563 reads" missing "with" 

      Thank you for correcting our grammar. We have revisd the phrase as “sequenced with 1,563 reads”.

      126: "283/239 UMIs per cell" unclear. 283 and 239 UMIs per cell per replicate, respectively? 

      Thank you for correcting our grammar. We have revised the phrase as “283 and 239 UMIs per cell per replicate, respectively” (lines 184).

      Figure 1D: Please indicate where the comparison datasets are from. 

      We appreciate your question regarding the source of the comparison datasets in Figure 1D. All data presented in Figure 1D are from our own sequencing experiments. We did not use data from other publications for this comparison. Specifically, we performed sequencing on E. coli cells in the exponential growth phase using three different library preparation methods: RiboD-PETRI, PETRI-seq, and RNA-seq. The data shown in Figure 1D represent a comparison of UMIs and/or reads correlations obtained from these three methods. All sequencing results have been uploaded to the Gene Expression Omnibus (GEO) database. The accession number is GSE260458. We have updated the figure legend for Figure 1D to clearly state that all datasets are from our own experiments, specifying the different methods used.

      Figure 1I, 2D: Unable to interpret the color block in the data. 

      We apologize for any confusion regarding the interpretation of the color blocks in Figures 1I and 2D (which are Figure 2E, 3E now). The color blocks in these figures represent the p-values of the data points. The color scale ranges from red to blue. Red colors indicate smaller p-values, suggesting higher statistical significance and more reliable results. Blue colors indicate larger p-values, suggesting lower statistical significance and less reliable results. We have updated the figure legends for both Figure 2E and Figure 3E to include this explanation of the color scale. Additionally, we have added a color legend to each figure to make the interpretation more intuitive for readers.

      Figure1H and 2C: Gene names should be provided where possible. The locus tags are highly annotation-dependent and hard to interpret. Also, a larger size figure should be helpful. The clusters 2 and 3 in 2C are the most important, yet because they have few cells, very hard to see in this panel. 

      We appreciate your suggestions for improving the clarity and interpretability of Figures 1H and 2C (which is Figure 2D, 3D now). We have replaced the locus tags with gene names where possible in both figures. We have increased the size of both figures to improve visibility and readability. We have also made Clusters 2 and 3 in Figure 3D more prominent in the revised figure. Despite their smaller cell count, we recognize their importance and have adjusted the visualization to ensure they are clearly visible. We believe these modifications will significantly enhance the clarity and informativeness of Figures 2D and 3D.​

      (3) Questions to consider further expanding on, by more analyses or experiments and in the discussion: 

      What are the explanations for the apparently contradictory upregulation of c-di-GMP in cells expressing higher PdeI levels? How could a phosphodiesterase lead to increased c-di-GMP levels? 

      We appreciate the reviewer's observation regarding the seemingly contradictory relationship between increased PdeI expression and elevated c-di-GMP levels. This is indeed an intriguing finding that warrants further explanation.

      PdeI was predicted to be a phosphodiesterase responsible for c-di-GMP degradation. This prediction is based on sequence analysis where PdeI contains an intact EAL domain known for degrading c-di-GMP. However, it is noteworthy that PdeI also contains a divergent GGDEF domain, which is typically associated with c-di-GMP synthesis (Fig S8). This dual-domain architecture suggests that PdeI may engage in complex regulatory roles. Previous studies have shown that the knockout of the major phosphodiesterase PdeH in E. coli leads to the accumulation of c-di-GMP. Further, a point mutation on PdeI's divergent GGDEF domain (G412S) in this PdeH knockout strain resulted in decreased c-di-GMP levels2, implying that the wild-type GGDEF domain in PdeI contributes to the maintenance or increase of c-di-GMP levels in the cell. Importantly, our single-cell experiments showed a positive correlation between PdeI expression levels and c-di-GMP levels (Response Fig. 9B). In this revision, we also constructed PdeI(G412S)-BFP mutation strain. Notably, our observations of this strain revealed that c-di-GMP levels remained constant despite increasing BFP fluorescence, which serves as a proxy for PdeI(G412S) expression levels (Fig. 4D). This experimental evidence, along with domain analysis, suggests that PdeI could contribute to c-di-GMP synthesis, rebutting the notion that it solely functions as a phosphodiesterase. HPLC LC-MS/MS analysis further confirmed that PdeI overexpression, induced by arabinose, led to an upregulation of c-di-GMP levels (Fig. 4E). These results strongly suggest that PdeI plays a significant role in upregulating c-di-GMP levels. Our further analysis revealed that PdeI contains a CHASE (cyclases/histidine kinase-associated sensory) domain. Combined with our experimental results demonstrating that PdeI is a membrane-associated protein, we hypothesize that PdeI functions as a sensor that integrates environmental signals with c-di-GMP production under complex regulatory mechanisms.

      We have also included this explanation (lines 193-217) and the supporting experimental data (Fig. 4D & 4J) in our manuscript to clarify this important point. Thank you for highlighting this apparent contradiction, as it has allowed us to provide a more comprehensive explanation of our findings.

      What about the rest of the genes in cluster 2 of the biofilm? They should be used to help interpret the association between PdeI and c-di-GMP. 

      We understand your interest in the other genes present in cluster 2 of the biofilm and their potential relationship to PdeI and c-di-GMP. After careful analysis, we have determined that the other marker genes in this cluster do not have a significant impact on biofilm formation. Furthermore, we have not found any direct relationship between these genes and c-di-GMP or PdeI. Our focus on PdeI in this cluster is due to its unique and significant role in c-di-GMP regulation and biofilm formation, as demonstrated by our experimental results. While the other genes in this cluster may be co-expressed, their functions appear to be unrelated to the PdeI and c-di-GMP pathway we are investigating. We chose not to elaborate on these genes in our main discussion as they do not contribute directly to our understanding of the PdeI and c-di-GMP association. Instead, we could include a brief mention of these genes in the manuscript, noting that they were found to be unrelated to the PdeI-c-di-GMP pathway. This would provide a more comprehensive view of the cluster composition while maintaining focus on the key findings related to PdeI and c-di-GMP.

      Author response image 2.

      Protein-protein interactions of marker genes in cluster 2 of 24-hour static biofilms of E coli data.

      A verification is needed that the protein fusion to PdeI functional/membrane localization is not due to protein interactions with fluorescent protein fusion. 

      We appreciate your concern regarding the potential impact of the fluorescent protein fusion on the functionality and membrane localization of PdeI. It is crucial to verify that the observed effects are attributable to PdeI itself and not an artifact of its fusion with the fluorescent protein. To address this matter, we have incorporated a control group expressing only the fluorescent protein BFP (without the PdeI fusion) under the same promoter. This experimental design allows us to differentiate between effects caused by PdeI and those potentially arising from the fluorescent protein alone.

      Our results revealed the following key observations:

      (1) Cellular Localization: The GFP alone exhibited a uniform distribution in the cytoplasm of bacterial cells, whereas the PdeI-GFP fusion protein was specifically localized to the membrane (Fig. 4C).

      (2) Localization in the Biofilm Matrix: BFP-positive cells were distributed throughout the entire biofilm community. In contrast, PdeI-BFP positive cells localized at the bottom of the biofilm, where cell-surface adhesion occurs (Fig 4F).

      (3) c-di-GMP Levels: Cells with high levels of BFP displayed no increase in c-di-GMP levels. Conversely, cells with high levels of PdeI-BFP exhibited a significant increase in c-di-GMP levels (Fig. 4D).

      (4) Persister Cell Ratio: Cells expressing high levels of BFP showed no increase in persister ratios, while cells with elevated levels of PdeI-BFP demonstrated a marked increase in persister ratios (Fig. 4J).

      These findings from the control experiments have been included in our manuscript (lines 193-244, Fig. 4C, 4D, 4F, 4G and 4J), providing robust validation of our results concerning the PdeI fusion protein. They confirm that the observed effects are indeed due to PdeI and not merely artifacts of the fluorescent protein fusion.

      (!) Vrabioiu, A. M. & Berg, H. C. Signaling events that occur when cells of Escherichia coli encounter a glass surface. Proceedings of the National Academy of Sciences of the United States of America 119, doi:10.1073/pnas.2116830119 (2022). https://doi.org/10.1073/pnas.2116830119

      (2)bReinders, A. et al. Expression and Genetic Activation of Cyclic Di-GMP-Specific Phosphodiesterases in Escherichia coli. J Bacteriol 198, 448-462 (2016). https://doi.org:10.1128/JB.00604-15

    1. Author Response

      The following is the authors’ response to the original reviews.

      Major comments (Public Reviews)

      Generality of grid cells

      We appreciate the reviewers’ concern regarding the generality of our approach, and in particular for analogies in nonlinear spaces. In that regard, there are at least two potential directions that could be pursued. One is to directly encode nonlinear structures (such as trees, rings, etc.) with grid cells, to which DPP-A could be applied as described in our model. The TEM model [1] suggests that grid cells in the medial entorhinal may form a basis set that captures structural knowledge for such nonlinear spaces, such as social hierarchies and transitive inference when formalized as a connected graph. Another would be to use eigen-decomposition of the successor representation [2], a learnable predictive representation of possible future states that has been shown by Stachenfield et al. [3] to provide an abstract structured representation of a space that is analogous to the grid cell code. This general-purpose mechanism could be applied to represent analogies in nonlinear spaces [4], for which there may not be a clear factorization in terms of grid cells (i.e., distinct frequencies and multiple phases within each frequency). Since the DPP-A mechanism, as we have described it, requires representations to be factored in this way it would need to be modified for such purpose. Either of these approaches, if successful, would allow our model to be extended to domains containing nonlinear forms of structure. To the extent that different coding schemes (i.e., basis sets) are needed for different forms of structure, the question of how these are identified and engaged for use in a given setting is clearly an important one, that is not addressed by the current work. We imagine that this is likely subserved by monitoring and selection mechanisms proposed to underlie the capacity for selective attention and cognitive control [5], though the specific computational mechanisms that underlie this function remain an important direction for future research. We have added a discussion of these issues in Section 6 of the updated manuscript.

      (1) Whittington, J.C., Muller, T.H., Mark, S., Chen, G., Barry, C., Burgess, N. and Behrens, T.E., 2020. The Tolman-Eichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation. Cell, 183(5), pp.1249-1263.

      (2) Dayan, P., 1993. Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4), pp.613-624.

      (3) Stachenfeld, K.L., Botvinick, M.M. and Gershman, S.J., 2017. The hippocampus as a predictive map. Nature neuroscience, 20(11), pp.1643-1653.

      (4) Frankland, S., Webb, T.W., Petrov, A.A., O'Reilly, R.C. and Cohen, J., 2019. Extracting and Utilizing Abstract, Structured Representations for Analogy. In CogSci (pp. 1766-1772).

      (5) Shenhav, A., Botvinick, M.M. and Cohen, J.D., 2013. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron, 79(2), pp.217-240. Biological plausibility of DPP-A

      We appreciate the reviewers’ interest in the biological plausibility of our model, and in particular the question of whether and how DPP-A might be implemented in a neural network. In that regard, Bozkurt et al. [1] recently proposed a biologically plausible neural network algorithm using a weighted similarity matrix approach to implement a determinant maximization criterion, which is the core idea underlying the objective function we use for DPP-A, suggesting that the DPP-A mechanism we describe may also be biologically plausible. This could be tested experimentally by exposing individuals (e.g., rodents or humans) to a task that requires consistent exposure to a subregion, and evaluating the distribution of activity over the grid cells. Our model predicts that high frequency grid cells should increase their firing rate more than low frequency cells, since the high frequency grid cells maximize the determinant of the covariance matrix of the grid cell embeddings. It is also worth noting that Frankland et al. [2] have suggested that the use of DPPs may also help explain a mutual exclusivity bias observed in human word learning and reasoning. While this is not direct evidence of biological plausibility, it is consistent with the idea that the human brain selects representations for processing that maximize the volume of the representational space, which can be achieved by maximizing the DPP-A objective function defined in Equation 6. We have added a comment to this effect in Section 6 of the updated manuscript.

      (1) Bozkurt, B., Pehlevan, C. and Erdogan, A., 2022. Biologically-plausible determinant maximization neural networks for blind separation of correlated sources. Advances in Neural Information Processing Systems, 35, pp.13704-13717.

      (2) Frankland, S. and Cohen, J., 2020. Determinantal Point Processes for Memory and Structured Inference. In CogSci.

      Simplicity of analogical problem and comparison to other models using this task

      First, we would like to point out that analogical reasoning is a signatory feature of human cognition, which supports flexible and efficient adaptation to novel inputs that remains a challenge for most current neural network architectures. While humans can exhibit complex and sophisticated forms of analogical reasoning [1, 2, 3], here we focused on a relatively simple form, that was inspired by Rumelhart’s parallelogram model of analogy [4,5] that has been used to explain traditional human verbal analogies (e.g., “king is to what as man is to woman?”). Our model, like that one, seeks to explain analogical reasoning in terms of the computation of simple Euclidean distances (i.e., A - B = C - D, where A, B, C, D are vectors in 2D space). We have now noted this in Section 2.1.1 of the updated manuscript. It is worth noting that, despite the seeming simplicity of this construction, we show that standard neural network architectures (e.g., LSTMs and transformers) struggle to generalize on such tasks without the use of the DPP-A mechanism.

      Second, we are not aware of any previous work other than Frankland et al. [6] cited in the first paragraph of Section 2.2.1, that has examined the capacity of neural network architectures to perform even this simple form of analogy. The models in that study were hardcoded to perform analogical reasoning, whereas we trained models to learn to perform analogies. That said, clearly a useful line of future work would be to scale our model further to deal with more complex forms of representation and analogical reasoning tasks [1,2,3]. We have noted this in Section 6 of the updated manuscript.

      (1) Holyoak, K.J., 2012. Analogy and relational reasoning. The Oxford handbook of thinking and reasoning, pp.234-259.

      (2) Webb, T., Fu, S., Bihl, T., Holyoak, K.J. and Lu, H., 2023. Zero-shot visual reasoning through probabilistic analogical mapping. Nature Communications, 14(1), p.5144.

      (3) Lu, H., Ichien, N. and Holyoak, K.J., 2022. Probabilistic analogical mapping with semantic relation networks. Psychological review.

      (4) Rumelhart, D.E. and Abrahamson, A.A., 1973. A model for analogical reasoning. Cognitive Psychology, 5(1), pp.1-28.

      (5) Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

      (6) Frankland, S., Webb, T.W., Petrov, A.A., O'Reilly, R.C. and Cohen, J., 2019. Extracting and Utilizing Abstract, Structured Representations for Analogy. In CogSci (pp. 1766-1772).

      Clarification of DPP-A attentional modulation

      We would like to clarify several concerns regarding the DPP-A attentional modulation. First, we would like to make it clear that ω is not meant to correspond to synaptic weights, and thank the reviewer for noting the possibility for confusion on this point. It is also distinct from a biasing input, which is often added to the product of the input features and weights. Rather, in our model ω is a vector, and diag (ω) converts it into a matrix with ω as the diagonal of the matrix, and the rest entries are zero. In Equation 6, diag(ω) is matrix multiplied with the covariance matrix V, which results in elementwise multiplication of ω with column vectors of V, and hence acts more like gates. We have noted this in Section 2.2.2 and have changed all instances of “weights (ω)” to “gates (ɡ)” in the updated manuscript. We have also rewritten the definition of Equation 6 and uses of it (as in Algorithm 1) to depict the use of sigmoid nonlinearity (σ) to , so that the resulting values are always between 0 and 1.

      Second, we would like to clarify that we don’t compute the inner product between the gates ɡ and the grid cell embeddings x anywhere in our model. The gates within each frequency were optimized (independent of the task inputs), according to Equation 6, to compute the approximate maximum log determinant of the covariance matrix over the grid cell embeddings individually for each frequency. We then used the grid cell embeddings belonging to the frequency that had the maximum within-frequency log determinant for training the inference module, which always happened to be grid cells within the top three frequencies. Author response image 1 (also added to the Appendix, Section 7.10 of the updated manuscript) shows the approximate maximum log determinant (on the y-axis) for the different frequencies (on the x-axis).

      Author response image 1.

      Approximate maximum log determinant of the covariance matrix over the grid cell embeddings (y-axis) for each frequency (x-axis), obtained after maximizing Equation 6.

      Third, we would like to clarify our interpretation of why DPP-A identified grid cell embeddings corresponding to the highest spatial frequencies, and why this produced the best OOD generalization (i.e., extrapolation on our analogy tasks). It is because those grid cell embeddings exhibited greater variance over the training data than the lower frequency embeddings, while at the same time the correlations among those grid cell embeddings were lower than the correlations among the lower frequency grid cell embeddings. The determinant of the covariance matrix of the grid cell embeddings is maximized when the variances of the grid cell embeddings are high (they are “expressive”) and the correlation among the grid cell embeddings is low (they “cover the representational space”). As a result, the higher frequency grid cell embeddings more efficiently covered the representational space of the training data, allowing them to efficiently capture the same relational structure across training and test distributions which is required for OOD generalization. We have added some clarification to the second paragraph of Section 2.2.2 in the updated manuscript. Furthermore, to illustrate this graphically, Author response image 2 (added to the Appendix, Section 7.10 of the updated manuscript) shows the results after the summation of the multiplication of the grid cell embeddings over the 2d space of 1000x1000 locations, with their corresponding gates for 3 representative frequencies (left, middle and right panels showing results for the lowest, middle and highest grid cell frequencies, respectively, of the 9 used in the model), obtained after maximizing Equation 6 for each grid cell frequency. The color code indicates the responsiveness of the grid cells to different X and Y locations in the input space (lighter color corresponding to greater responsiveness). Note that the dark blue area (denoting regions of least responsiveness to any grid cell) is greatest for the lowest frequency and nearly zero for the highest frequency, illustrating that grid cell embeddings belonging to the highest frequency more efficiently cover the representational space which allows them to capture the same relational structure across training and test distributions as required for OOD generalization.

      Author response image 2.

      Each panel shows the results after summation of the multiplication of the grid cell embeddings over the 2d space of 1000x1000 locations, with their corresponding gates for a particular frequency, obtained after maximizing Equation 6 for each grid cell frequency. The left, middle, and right panels show results for the lowest, middle, and highest grid cell frequencies, respectively, of the 9 used in the model. Lighter color in each panel corresponds to greater responsiveness of grid cells at that particular location in the 2d space.

      Finally, we would like to clarify how the DPP-A attentional mechanism is different from the attentional mechanism in the transformer module, and why both are needed for strong OOD generalization. Use of the standard self-attention mechanism in transformers over the inputs (i.e., A, B, C, and D for the analogy task) in place of DPP-A would lead to weightings of grid cell embeddings over all frequencies and phases. The objective function for the DPP-A represents an inductive bias, that selectively assigns the greatest weight to all grid cell embeddings (i.e., for all phases) of the frequency for which the determinant of the covariance matrix is greatest computed over the training space. The transformer inference module then attends over the inputs with the selected grid cell embeddings based on the DPP-A objective. We have added a discussion of this point in Section 6 of the updated manuscript.

      We would like to thank the reviewers for their recommendations. We have tried our best to incorporate them into our updated manuscript. Below we provide a detailed response to each of the recommendations grouped for each reviewer.

      Reviewer #1 (Recommendations for the authors)

      (1) It would be helpful to see some equations for R in the main text.

      We thank the reviewer for this suggestion. We have now added some equations explaining the working of R in Section 2.2.3 of the updated manuscript.

      (2) Typo: p 11 'alongwith' -> 'along with'

      We have changed all instances of ‘alongwith’ to ‘along with’ in the updated manuscript.

      (3) Presumably, this is related to equivariant ML - it would be helpful to comment on this.

      Yes, this is related to equivariant ML, since the properties of equivariance hold for our model. Specifically, the probability distribution after applying softmax remains the same when the transformation (translation or scaling) is applied to the scores for each of the answer choices obtained from the output of the inference module, and when the same transformation is applied to the stimuli for the task and all the answer choices before presenting as input to the inference module to obtain the scores. We have commented on this in Section 2.2.3 of the updated manuscript.

      Reviewer #2 (Recommendations for the authors)

      (1) Page 2 - "Webb et al." temporal context - they should also cite and compare this to work by Marc Howard on generalization based on multi-scale temporal context.

      While we appreciate the important contributions that have been made by Marc Howard and his colleagues to temporal coding and its role in episodic memory and hippocampal function, we would like to clarify that his temporal context model is unrelated to the temporal context normalization developed by Webb et al. (2020) and mentioned on Page 2. The former (Temporal Context Model) is a computational model that proposes a role for temporal coding in the functions of the medial temporal lobe in support of episodic recall, and spatial navigation. The latter (temporal context normalization) is a normalization procedure proposed for use in training a neural network, similar to batch normalization [1], in which tensor normalization is applied over the temporal instead of the batch dimension, which is shown to help with OOD generalization. We apologize for any confusion engendered by the similarity of these terms, and failure to clarify the difference between these, that we have now attempted to do in a footnote on Page 2.

      Ioffe, S. and Szegedy, C., 2015, June. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). pmlr.

      (2) page 3 - "known to be implemented in entorhinal" - It's odd that they seem to avoid citing the actual biology papers on grid cells. They should cite more of the grid cell recording papers when they mention the entorhinal cortex (i.e. Hafting et al., 2005; Barry et al., 2007; Stensola et al., 2012; Giocomo et al., 2011; Brandon et al., 2011).

      We have now cited the references mentioned below, on page 3 after the phrase “known to be implemented in entohinal cortex”.

      (1) Barry, C., Hayman, R., Burgess, N. and Jeffery, K.J., 2007. Experience-dependent rescaling of entorhinal grids. Nature neuroscience, 10(6), pp.682-684.

      (2) Stensola, H., Stensola, T., Solstad, T., Frøland, K., Moser, M.B. and Moser, E.I., 2012. The entorhinal grid map is discretized. Nature, 492(7427), pp.72-78.

      (3) Giocomo, L.M., Hussaini, S.A., Zheng, F., Kandel, E.R., Moser, M.B. and Moser, E.I., 2011. Grid cells use HCN1 channels for spatial scaling. Cell, 147(5), pp.1159-1170.

      (4) Brandon, M.P., Bogaard, A.R., Libby, C.P., Connerney, M.A., Gupta, K. and Hasselmo, M.E., 2011. Reduction of theta rhythm dissociates grid cell spatial periodicity from directional tuning. Science, 332(6029), pp.595-599.

      (3) To enhance the connection to biological systems, they should cite more of the experimental and modeling work on grid cell coding (for example on page 2 where they mention relational coding by grid cells). Currently, they tend to cite studies of grid cell relational representations that are very indirect in their relationship to grid cell recordings (i.e. indirect fMRI measures by Constaninescu et al., 2016 or the very abstract models by Whittington et al., 2020). They should cite more papers on actual neurophysiological recordings of grid cells that suggest relational/metric representations, and they should cite more of the previous modeling papers that have addressed relational representations. This could include work on using grid cell relational coding to guide spatial behavior (e.g. Erdem and Hasselmo, 2014; Bush, Barry, Manson, Burges, 2015). This could also include other papers on the grid cell code beyond the paper by Wei et al., 2015 - they could also cite work on the efficiency of coding by Sreenivasan and Fiete and by Mathis, Herz, and Stemmler.

      We thank the reviewer for bringing the additional references to our attention. We have cited the references mentioned below on page 2 of the updated manuscript.

      (1) Erdem, U.M. and Hasselmo, M.E., 2014. A biologically inspired hierarchical goal directed navigation model. Journal of Physiology-Paris, 108(1), pp.28-37.

      (2) Sreenivasan, S. and Fiete, I., 2011. Grid cells generate an analog error-correcting code for singularly precise neural computation. Nature neuroscience, 14(10), pp.1330-1337.

      (3) Mathis, A., Herz, A.V. and Stemmler, M., 2012. Optimal population codes for space: grid cells outperform place cells. Neural computation, 24(9), pp.2280-2317.

      (4) Bush, D., Barry, C., Manson, D. and Burgess, N., 2015. Using grid cells for navigation. Neuron, 87(3), pp.507-520

      (4) Page 3 - "Determinantal Point Processes (DPPs)" - it is rather annoying that DPP is defined after DPP-A is defined. There ought to be a spot where the definition of DPP-A is clearly stated in a single location.

      We agree it makes more sense to define Determinantal Point Process (DPP) before DPP-A. We have now rephrased the sentences accordingly. In the “Abstract”, the sentence now reads “Second, we propose an attentional mechanism that operates over the grid cell code using Determinantal Point Process (DPP), which we call DPP attention (DPP-A) - a transformation that ensures maximum sparseness in the coverage of that space.” We have also modified the second paragraph of the “Introduction”. The modified portion now reads “b) an attentional objective inspired from Determinantal Point Processes (DPPs), which are probabilistic models of repulsion arising in quantum physics [1], to attend to abstract representations that have maximum variance and minimum correlation among them, over the training data. We refer to this as DPP attention or DPP-A.” Due to this change, we removed the last sentence of the fifth paragraph of the “Introduction”.

      (1) Macchi, O., 1975. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1), pp.83-122.

      (5) Page 3 - "the inference module R" - there should be some discussion about how this component using LSTM or transformers could relate to the function of actual brain regions interacting with entorhinal cortex. Or if there is no biological connection, they should state that this is not seen as a biological model and that only the grid cell code is considered biological.

      While we agree that the model is not construed to be as specific about the implementation of the R module, we assume that — as a standard deep learning component — it is likely to map onto neocortical structures that interact with the entorhinal cortex and, in particular, regions of the prefrontal-posterior parietal network widely believed to be involved in abstract relational processes [1,2,3,4]. In particular, the role of the prefrontal cortex in the encoding and active maintenance of abstract information needed for task performance (such as rules and relations) has often been modeled using gated recurrent networks, such as LSTMs [5,6], and the posterior parietal cortex has long been known to support “maps” that may provide an important substrate for computing complex relations [4]. We have added some discussion about this in Section 2.2.3 of the updated manuscript.

      (1) Waltz, J.A., Knowlton, B.J., Holyoak, K.J., Boone, K.B., Mishkin, F.S., de Menezes Santos, M., Thomas, C.R. and Miller, B.L., 1999. A system for relational reasoning in human prefrontal cortex. Psychological science, 10(2), pp.119-125.

      (2) Christoff, K., Prabhakaran, V., Dorfman, J., Zhao, Z., Kroger, J.K., Holyoak, K.J. and Gabrieli, J.D., 2001. Rostrolateral prefrontal cortex involvement in relational integration during reasoning. Neuroimage, 14(5), pp.1136-1149.

      (3) Knowlton, B.J., Morrison, R.G., Hummel, J.E. and Holyoak, K.J., 2012. A neurocomputational system for relational reasoning. Trends in cognitive sciences, 16(7), pp.373-381.

      (4) Summerfield, C., Luyckx, F. and Sheahan, H., 2020. Structure learning and the posterior parietal cortex. Progress in neurobiology, 184, p.101717.

      (5) Frank, M.J., Loughry, B. and O’Reilly, R.C., 2001. Interactions between frontal cortex and basal ganglia in working memory: a computational model. Cognitive, Affective, & Behavioral Neuroscience, 1, pp.137-160.

      (6) Braver, T.S. and Cohen, J.D., 2000. On the control of control: The role of dopamine in regulating prefrontal function and working memory. Control of cognitive processes: Attention and performance XVIII, (2000).

      (6) Page 4 - "Learned weighting w" - it is somewhat confusing to use "w" as that is commonly used for synaptic weights, whereas I understand this to be an attentional modulation vector with the same dimensionality as the grid cell code. It seems more similar to a neural network bias input than a weight matrix.

      We refer to the first paragraph of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (7) Page 4 - "parameterization of w... by two loss functions over the training set." - I realize that this has been stated here, but to emphasize the significance to a naïve reader, I think they should emphasize that the learning is entirely focused on the initial training space, and there is NO training done in the test spaces. It's very impressive that the parameterization is allowing generalization to translated or scaled spaces without requiring ANY training on the translated or scaled spaces.

      We have added the sentence “Note that learning of parameter occurs only over the training space and is not further modified during testing (i.e. over the test spaces)” to the updated manuscript.

      (8) Page 4 - "The first," - This should be specific - "The first loss function"

      We have changed it to “The first loss function” in the updated manuscript.

      (9) Page 4 - The analogy task seems rather simplistic when first presented (i.e. just a spatial translation to different parts of a space, which has already been shown to work in simulations of spatial behavior such as Erdem and Hasselmo, 2014 or Bush, Barry, Manson, Burgess, 2015). To make the connection to analogy, they might provide a brief mention of how this relates to the analogy space created by word2vec applied to traditional human verbal analogies (i.e. king-man+woman=queen).

      We agree that the analogy task is simple, and recognize that grid cells can be used to navigate to different parts of space over which the test analogies are defined when those are explicitly specified, as shown by Erdem and Hasselmo (2014) and Bush, Barry, Manson, and Burgess (2015). However, for the analogy task, the appropriate set of grid cell embeddings must be identified that capture the same relational structure between training and test analogies to demonstrate strong OOD generalization, and that is achieved by the attentional mechanism DPP-A. As suggested by the reviewer’s comment, our analogy task is inspired by Rumelhart’s parallelogram model of analogy [1,2] (and therefore similar to traditional human verbal analogies) in as much as it involves differences (i.e A - B = C - D, where A, B, C, D are vectors in 2D space). We have now noted this in Section 2.1.1 of the updated manuscript.

      (1) Rumelhart, D.E. and Abrahamson, A.A., 1973. A model for analogical reasoning. Cognitive Psychology, 5(1), pp.1-28.

      (2) Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

      (10) Page 5 - The variable "KM" is a bit confusing when it first appears. It would be good to re-iterate that K and M are separate points and KM is the vector between these points.

      We apologize for the confusion on this point. KM is meant to refer to an integer value, obtained by multiplying K and M, which is added to both dimensions of A, B, C and D, which are points in ℤ2, to translate them to a different region of the space. K is an integer value ranging from 1 to 9 and M is also an integer value denoting the size of the training region, which in our implementation is 100. We have clarified this in Section 2.1.1 of the updated manuscript.

      (11) Page 5 - "two continuous dimensions (Constantinescu et al._)" - this ought to give credit to the original study showing the abstract six-fold rotational symmetry for spatial coding (Doeller, Barry and Burgess).

      We have now cited the original work by Doeller et al. [1] along with Constantinescu et al. (2016) in the updated manuscript after the phrase “two continuous dimensions” on page 5.

      (1) Doeller, C.F., Barry, C. and Burgess, N., 2010. Evidence for grid cells in a human memory network. Nature, 463(7281), pp.657-661.

      (12) Page 6 - Np=100. This is done later, but it would be clearer if they right away stated that Np*Nf=900 in this first presentation.

      We have now added this sentence after Np=100. “Hence Np*Nf=900, which denotes the number of grid cells.”

      (13) Page 6 - They provide theorem 2.1 on the determinant of the covariance matrix of the grid code, but they ought to cite this the first time this is mentioned.

      We have cited Gilenwater et al. (2012) before mentioning theorem 2.1. The sentence just before that reads “We use the following theorem from Gillenwater et al. (2012) to construct :”

      (14) Page 6 - It would greatly enhance the impact of the paper if they could give neuroscientists some sense of how the maximization of the determinant of the covariance matrix of the grid cell code could be implemented by a biological circuit. OR at least to show an example of the output of this algorithm when it is used as an inner product with the grid cell code. This would require plotting the grid cell code in the spatial domain rather than the 900 element vector.

      We refer to our response above to the topic “Biological plausibility of DPP-A” and second, third, and fourth paragraphs of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contain our responses to this issue.

      (15) Page 6 - "That encode higher spatial frequencies..." This seems intuitive, but it would be nice to give a more intuitive description of how this is related to the determinant of the covariance matrix.

      We refer to the third paragraph of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (16) Page 7 - log of both sides... Nf is number of frequencies... Would be good to mention here that they are referring to equation 6 which is only mentioned later in the paragraph.

      As suggested, we now refer to Equation 6 in the updated manuscript. The sentence now reads “This is achieved by maximizing the determinant of the covariance matrix over the within frequency grid cell embeddings of the training data, and Equation 6 is obtained by applying the log on both sides of Theorem 2.1, and in our case where refers to grid cells of a particular frequency.”

      (17) Page 7 - Equation 6 - They should discuss how this is proposed to be implemented in brain circuits.

      We refer to our response above to the topic “Biological plausibility of DPP-A” under “Major comments (Public Reviews)”, which contains our response to this issue.

      18) Page 9 - "egeneralize" - presumably this is a typo?

      Yes. We have corrected it to “generalize” in the updated manuscript.

      (19) Page 9 - "biologically plausible encoding scheme" - This is valid for the grid cell code, but they should be clear that this is not valid for other parts of the model, or specify how other parts of the model such as DPP-A could be biologically plausible.

      We refer to our response above to the topic “Biological plausibility of DPP-A” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (20) Page 12 - Figure 7 - comparsion to one-hots or smoothed one-hots. The text should indicate whether the smoothed one-hots are similar to place cell coding. This is the most relevant comparison of coding for those knowledgeable about biological coding schemes.

      Yes, smoothed one-hots are similar to place cell coding. We now mention this in Section 5.3 of the updated manuscript.

      (21) Page 12 - They could compare to a broader range of potential biological coding schemes for the overall space. This could include using coding based on the boundary vector cell coding of the space, band cell coding (one dimensional input to grid cells), or egocentric boundary cell coding.

      We appreciate these useful suggestions, which we now mention as potentially valuable directions for future work in the second paragraph of Section 6 of the updated manuscript.

      (22) Page 13 - "transformers are particularly instructive" - They mention this as a useful comparison, but they might discuss further why a much better function is obtained when attention is applied to the system twice (once by DPP-A and then by a transformer in the inference module).

      We refer to the last paragraph of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (23) Page 13 - "Section 5.1 for analogy and Section 5.2 for arithmetic" - it would be clearer if they perhaps also mentioned the specific figures (Figure 4 and Figure 6) presenting the results for the transformer rather than the LSTM.

      We have now rephrased to also refer to the figures in the updated manuscript. The phrase now reads “a transformer (Figure 4 in Section 5.1 for analogy and Figure 6 in Section 5.2 for arithmetic tasks) failed to achieve the same level of OOD generalization as the network that used DPP-A.”

      (24) Page 14 - "statistics of the training data" - The most exciting feature of this paper is that learning during the training space analogies can so effectively generalize to other spaces based on the right attention DPP-A, but this is not really made intuitive. Again, they should illustrate the result of the xT w inner product to demonstrate why this work so effectively!

      We refer to the second, third, and fourth paragraphs of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (25) Bibliography - Silver et al., go paper - journal name "nature" should be capitalized. There are other journal titles that should be capitalized. Also, I believe eLife lists family names first.

      We have made the changes to the bibliography of the updated manuscript suggested by the reviewer.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the editors and the reviewers for their time and constructive comments, which helped us to improve our manuscript “The Hungry Lens: Hunger Shifts Attention and Attribute Weighting in Dietary Choice” substantially. In the following we address the comments in depth:

      R1.1: First, in examining some of the model fits in the supplements, e.g. Figures S9, S10, S12, S13, it looks like the "taste weight" parameter is being constrained below 1. Theoretically, I understand why the authors imposed this constraint, but it might be unfairly penalizing these models. In theory, the taste weight could go above 1 if participants had a negative weight on health. This might occur if there is a negative correlation between attractiveness and health and the taste ratings do not completely account for attractiveness. I would recommend eliminating this constraint on the taste weight.

      We appreciate the reviewer’s suggestion to test a multi-attribute attentional drift-diffusion model (maaDDM) that does not constrain the taste and health weights to the range of 0 and 1. We tested two versions of such a model. First, we removed the phi-transformation, allowing the weight to take on any value (see Author response image 1). The results closely matched those found in the original model. Partially consistent with the reviewer’s comment, the health weight became slightly negative in some individuals in the hungry condition. However, this model had convergence issues with a maximal Rhat of 4.302. Therefore, we decided to run a second model in which we constrained the weights to be between -1 and 2. Again, we obtained effects that matched the ones found in the original model (see Author response image 2), but again we had convergence issues. These convergence issues could arise from the fact that the models become almost unidentifiable, when both attention parameters (theta and phi) as well as the weight parameters are unconstrained.

      Author response image 1.

      Author response image 2.

      R1.2: Second, I'm not sure about the mediation model. Why should hunger change the dwell time on the chosen item? Shouldn't this model instead focus on the dwell time on the tasty option?

      We thank the reviewer for spotting this inconsistency. In our GLMMs and the mediation model, we indeed used the proportion of dwell time on the tasty option as predictors and mediator, respectively. The naming and description of this variable was inconsistent in our manuscript and the supplements. We have now rephrased both consistently.

      R1.3: Third, while I do appreciate the within-participant design, it does raise a small concern about potential demand effects. I think the authors' results would be more compelling if they replicated when only analyzing the first session from each participant. Along similar lines, it would be useful to know whether there was any effect of order.

      R3.2: On the interpretation side, previous work has shown that beliefs about the nourishing and hunger-killing effectiveness of drinks or substances influence subjective and objective markers of hunger, including value-based dietary decision-making, and attentional mechanisms approximated by computational models and the activation of cognitive control regions in the brain. The present study shows differences between the protein shake and a natural history condition (fasted, state). This experimental design, however, cannot rule between alternative interpretations of observed effects. Notably, effects could be due to (a) the drink's active, nourishing ingredients, (b) consuming a drink versus nothing, or (c) both. […]

      R3 Recommendation 1:

      Therefore, I recommend discussing potential confounds due to expectancy or placebo effects on hunger ratings, dietary decision-making, and attention. […] What were verbatim instructions given to the participants about the protein shake and the fasted, hungry condition? Did participants have full knowledge about the study goals (e.g. testing hunger versus satiation)? Adding the instructions to the supplement is insightful for fully harnessing the experimental design and frame.

      Both reviewer 1 and reviewer 3 raise potential demand/ expectancy effects, which we addressed in several ways. First, we have translated and added participants’ instructions to the supplements SOM 6, in which we transparently communicate the two conditions to the participants. Second, we have added a paragraph in the discussion section addressing potential expectancy/demand effects in our design:

      “The present results and supplementary analyses clearly support the two-fold effect of hunger state on the cognitive mechanisms underlying choice. However, we acknowledge potential demand effects arising from the within-subject Protein-shake manipulation. A recent study (Khalid et al., 2024) showed that labeling water to decrease or increase hunger affected participants subsequent hunger ratings and food valuations. For instance, participants expecting the water to decrease hunger showed less wanting for food items. DDM modeling suggested that this placebo manipulation affected both drift rate and starting point. The absence of a starting point effect in our data speaks against any prior bias in participants due to any demand effects. Yet, we cannot rule out that such effects affected the decision-making process, for example by increasing the taste weight (and thus the drift rate) in the hungry condition.”

      Third, we followed Reviewer 1’s suggestion and tested, whether the order of testing affected the results. We did so by adding “order” to the main choice and response time (RT) GLMM. We neither found an effect of order on choice (β<sub>order</sub>=-0.001, SE\=0.163, p<.995), nor on RT (β<sub>order</sub>=0.106, SE\=0.205, p<.603) and the original effects remain stable (see Author response table 1a and Author response table 1 2a below). Further, we used two ANOVAs to compare models with and without the predictor “order”. The ANOVAs indicated that GLMMs without “order” better explained choice and RT (see Author response table 1b and Author response table 2b). Taken together, these results suggest that demand effects played a negligible role in our study.

      Author response table 1.

      a) GLMM: Results of Tasty vs Healthy Choice Given Condition, Attention and Order

      Note. p-values were calculated using Satterthwaites approximations. Model equation: choice ~ condition + scale(_rel_taste_DT) + order + (1+condition|subject);_ rel_taste_DT refers to the relative dwell time on the tasty option; order with hungry/sated as the reference

      b) Model Comparison

      Author response table 2.

      a) GLMM: Response Time Given Condition, Choice, Attention and Order

      Note. p-values were calculated using Satterthwaites approximations. Model equation: RT ~ choice + condition + scale(_rel_taste_DT) + order + choice * scale(rel_taste_DT) (1+condition|subject);_ rel_taste_DT refers to the relative dwell time on the tasty option; order with hungry/sated as the reference

      b) Model Comparison

      R1.4: Fourth, the authors report that tasty choices are faster. Is this a systematic effect, or simply due to the fact that tasty options were generally more attractive? To put this in the context of the DDM, was there a constant in the drift rate, and did this constant favor the tasty option?

      We thank the reviewer for their observant remark about faster tasty choices and potential links to the drift rate. While our starting point models show that there might be a small starting point bias towards the taste boundary, which would result in faster tasty decisions, we took a closer look at the simulated value differences as obtained in our posterior predictive checks to see if the drift rate was systematically more extreme for tasty choices (Author response image 3). In line with the reviewer’s suggestion that tasty options were generally more attractive, tasty decisions were associated with higher value differences (i.e., further away from 0) and consequently with faster decisions. This indicates that the main reason for faster tasty choices was a higher drift rate in those trials (as a consequence of the combination of attribute weights and attribute values rather than “a constant in the drift rate”), whereas a strong starting point bias played only a minor role.

      Author response image 3.

      Note. Value Difference as obtained from Posterior Predictive Checks of the maaDDM2𝜙 in hungry and sated condition for healthy (green) and tasty (orange) choices.

      R1.5: Fifth, I wonder about the mtDDM. What are the units on the "starting time" parameters? Seconds? These seem like minuscule effects. Do they align with the eye-tracking data? In other words, which attributes did participants look at first? Was there a correlation between the first fixations and the relative starting times? If not, does that cast doubt on the mtDDM fits? Did the authors do any parameter recovery exercises on the mtDDM?

      We thank Reviewer 1 for their observant remarks about the mtDDM. In line with their suggestion, we have performed a parameter recovery which led to a good recovery of all parameters except relative starting time (rst). In addition, we had convergence issues of rst as revealed by parameter Rhats around 20. Together these results indicate potential limitations of the mtDDM when applied to tasks with substantially different visual representations of attributes leading to differences in dwell time for each attribute (see Figure 3b and Figure S6b). We have therefore decided not to report the mtDDM in the main paper, only leaving a remark about convergence and recovery issues.

      R2: My main criticism, which doesn't affect the underlying results, is that the labeling of food choices as being taste- or health-driven is misleading. Participants were not cued to select health vs taste. Studies in which people were cued to select for taste vs health exist (and are cited here). Also, the label "healthy" is misleading, as here it seems to be strongly related to caloric density. A high-calorie food is not intrinsically unhealthy (even if people rate it as such). The suggestion that hunger impairs making healthy decisions is not quite the correct interpretation of the results here (even though everyone knows it to be true). Another interpretation is that hungry people in negative calorie balance simply prefer more calories.

      First, we agree with the reviewer that it should be tested to what extent participants’ choice behavior can be reduced to contrasting taste vs. health aspects of their dietary decisions (but note that prior to making decisions, they were asked to rate these aspects and thus likely primed to consider them in the choice task). Having this question in mind, we performed several analyses to demonstrate the suitability of framing decisions as contrasting taste vs. health aspects (including the PCA reported in the Supplemental Material).

      Second, we agree with the reviewer in that despite a negative correlation (Author response image 4) between caloric density and health, high-caloric items are not intrinsically unhealthy. This may apply only to two stimuli in our study (nuts and dried fruit), which are also by our participants recognized as such.

      Finally, Reviewer 2’s alternative explanation, that hungry individuals prefer more calories is tested in SOM5. In line with the reviewer’s interpretation, we show that hungry individuals indeed are more likely to select higher caloric options. This effect is even stronger than the effect of hunger state on tasty vs healthy choice. However, in this paper we were interested in the effect of hunger state on tasty vs healthy decisions, a contrast that is often used in modeling studies (e.g., Barakchian et al., 2021; Maier et al., 2020; Rramani et al., 2020; Sullivan & Huettel, 2021). In sum, we agree with Reviewer 2 in all aspects and have tested and provided evidence for their interpretation, which we do not see to stand in conflict with ours.

      Author response image 4.

      Note. strong negative correlation between health ratings and objective caloric content in both hungry (r\=-.732, t(64)=-8.589, p<.001) and sated condition (r\=-.731, t(64)=-8.569, p<.001).

      R3.1: On the positioning side, it does not seem like a 'bad' decision to replenish energy states when hungry by preferring tastier, more often caloric options. In this sense, it is unclear whether the observed behavior in the fasted state is a fallacy or a response to signals from the body. The introduction does mention these two aspects of preferring more caloric food when hungry. However, some ambiguity remains about whether the study results indeed reflect suboptimal choice behavior or a healthy adaptive behavior to restore energy stores.

      We thank Reviewer 3 for this remark, which encouraged us to interpret the results also form a slightly different perspective. We agree that choosing tasty over healthy options under hunger may be evolutionarily adaptive. We have now extended a paragraph in our discussion linking the cognitive mechanisms to neurobiological mechanisms:

      “From a neurobiological perspective, both homeostatic and hedonic mechanisms drive eating behaviour. While homeostatic mechanisms regulate eating behaviour based on energy needs, hedonic mechanisms operate independent of caloric deficit (Alonso-Alonso et al., 2015; Lowe & Butryn, 2007; Saper et al., 2002). Participants’ preference for tasty high caloric food options in the hungry condition aligns with a drive for energy restoration and could thus be taken as an adaptive response to signals from the body. On the other hand, our data shows that participants preferred less healthy options also in the sated condition. Here, hedonic drivers could predominate indicating potentially maladaptive decision-making that could lead to adverse health outcomes if sustained. Notably, our modeling analyses indicated that participants in the sated condition showed reduced attentional discounting of health information, which poses potential for attention-based intervention strategies to counter hedonic hunger. This has been investigated for example in behavioral (Barakchian et al., 2021; Bucher et al., 2016; Cheung et al., 2017; Sullivan & Huettel, 2021), eye-tracking (Schomaker et al., 2022; Vriens et al., 2020) and neuroimaging studies (Hare et al., 2011; Hutcherson & Tusche, 2022) showing that focusing attention on health aspects increased healthy choice. For example, Hutcherson and Tusche (2022) compellingly demonstrated that the mechanism through which health cues enhance healthy choice is shaped by increased value computations in the dorsolateral prefrontal cortex (dlPFC) when cue and choice are conflicting (i.e., health cue, tasty choice). In the context of hunger, these findings together with our analyses suggest that drawing people’s attention towards health information will promote healthy choice by mitigating the increased attentional discounting of such information in the presence of tempting food stimuli.”

      Recommendations for the authors:

      R1: The Results section needs to start with a brief description of the task. Otherwise, the subsequent text is difficult to understand.

      We included a paragraph at the beginning of the results section briefly describing the experimental design.

      R1/R2: In Figure 1a it might help the reader to have a translation of the rating scales in the figure legend.

      We have implemented an English rating scale in Figure 1a.

      R2: Were the ratings redone at each session? E.g. were all tastiness ratings for the sated session made while sated? This is relevant as one would expect the ratings of tastiness and wanting to be affected by the current fed state.

      The ratings were done at the respective sessions. As shown in S3a there is a high correlation of taste ratings across conditions. We decided to take the ratings of the respective sessions (rather than mean ratings across sessions) to define choice and taste/health value in the modeling analyses, for several reasons. First, by using mean ratings we might underestimate the impact of particularly high or low ratings that drove choice in the specific session (regression to the mean). Second, for the modeling analysis in particular, we want to model a decision-making process at a particular moment in time. Consequently, the subjective preferences in that moment are more accurate than mean preferences.

      R2: It would be helpful to have a diagram of the DDM showing the drifting information to the boundary, and the key parameters of the model (i.e. showing the nDT, drift rate, boundary, and other parameters). (Although it might be tricky to depict all 9 models).

      We thank the reviewer for their recommendation and have created Figure 6, which illustrates the decision-making process as depicted by the maaDDM2phi.

      R3.1: Past work has shown that prior preferences can bias/determine choices. This effect might have played a role during the choice task, which followed wanting, taste, health, and calorie ratings during which participants might have already formed their preferences. What are the authors' positions on such potential confound? How were the food images paired for the choice task in more detail?

      The data reported here, were part of a larger experiment. Next to the food rating and choice task, participants also completed a social preference rating and choice task, as well as rating and choice tasks for intertemporal discounting. These tasks were counterbalanced such that first the three rating tasks were completed in counterbalanced order and second the three choice tasks were completed in the same order (e.g. food rating, social rating, intertemporal rating; food choice, social choice, intertemporal choice). This means that there were always two other tasks between the food rating and food choice task. In addition, to the temporal delay between rating and choice tasks, our modeling analyses revealed that models including a starting point bias performed worse than those without the bias. Although we cannot rule out that participants might occasionally have tried to make their decision before the actual task (e.g., by keeping their most/least preferred option in mind and then automatically choosing/rejecting it in the choice task), we think that both our design as well as our modeling analyses speak against any systematic bias of preference in our choice task. The options were paired such that approximately half of the trials were random, while for the other half one option was rated healthier and the other option was rated tastier (e.g., Sullivan & Huettel, 2021)

      R3.2: In line with this thought, theoretically, the DDMs could also be fitted to reaction times and wanting ratings (binarized). This could be an excellent addition to corroborate the findings for choice behavior.

      We have implemented several alternative modeling analyses, including taste vs health as defined by Nutri-Score (Table S12 and Figures S22-S30) and higher wanted choice vs healthy choice (Table S13; Figure S30-34). Indeed, these models corroborate those reported in the main text demonstrating the robustness of our findings.

      R3.3: The principal component analysis was a good strategy for reducing the attribute space (taste, health, wanting, calories, Nutriscore, objective calories) into two components. Still, somehow, this part of the results added confusion to harnessing in which of the analyses the health attribute corresponded only to the healthiness ratings and taste to the tastiness ratings and if and when the components were used as attributes. This source of confusion could be mitigated by more clearly stating what health and taste corresponded to in each of the analyses.

      We thank the reviewer for this recommendation and have now reported the PCA before reporting the behavioural results to clarify that choices are binarized based on participants’ taste and health ratings, rather than the composite scores. We have chosen this approach, as it is closer to our hypotheses and improves interpretability.

      R3.4: From the methods, it seems that 66 food images were used, and 39 fell into A, B, C, and D Nutriscores. How were the remaining 27 images selected, and how healthy and tasty were the food stimuli overall?

      The selection of food stimuli was done in three steps: First, from Charbonnier and collegues (2016) standardized food image database (available at osf.io/cx7tp/) we excluded food items that were not familiar in Germany/unavailable in regular German supermarkets. Second, we excluded products that we would not be able to incentivize easily (i.e., fastfood, pastries and items that required cooking/baking/other types of preparation). Third, we added the Nutri Scores to the remaining products aiming to have an equal number of items for each Nutri-Score, of which approximately half of the items were sweet and the other half savory. This resulted in a final stimuli-set of 66 food images (13 items =A; 13 items=B; 12 items=C; 14 items =D; 14 items = E). The experiment with including the set of food stimuli used in our study is also uploaded here: osf.io/pef9t/.With respect to the second question, we would like to point out that preference of food stimuli is very individual, therefore we obtained the ratings (taste, health, wanting and estimated caloric density) of each participant individually. However, we also added the objective total calories, which is positively correlated subjective caloric density and negatively correlated with Nutri-Score (coded as A=5; B=4; C=3; D=2; E=1) and health ratings (see Figure S7).

      R3.5: It seems that the degrees of freedom for the paired t-test comparing the effects of the condition hungry versus satiated on hunger ratings were 63, although the participant sample counted 70. Please verify.

      This is correct and explained in the methods section under data analysis: “Due to missing values for one timepoint in six participants (these participants did not fill in the VAS and PANAS before the administration of the Protein Shake in the sated condition) the analyses of the hunger state manipulation had a sample size of 64.”

      R3.5: Please add the range of BMI and age of participants. Did all participants fall within a healthy BMI range

      The BMI ranged from 17.306 to 48.684 (see Author response image 5), with the majority of participants falling within a normal BMI (i.e., between 18.5 and 24.9. In our sample, 3 participants had a BMI lager than 30. By using subject as a random intercept in our GLMMs we accounted for potential deviations in their response.

      Author response image 5.

      R3.5: Defining the inference criterion used for the significance of the posterior parameter chains in more detail can be pedagogical for those new to or unfamiliar with inferences drawn from hierarchical Bayesian model estimations and Bayesian statistics.

      We have added an explanation of the highest density intervals and what they mean with respect to our data in the respective result section.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      In this study, Ger and colleagues present a valuable new technique that uses recurrent neural networks to distinguish between model misspecification and behavioral stochasticity when interpreting cognitivebehavioral model fits. Evidence for the usefulness of this technique, which is currently based primarily on a relatively simple toy problem, is considered incomplete but could be improved via comparisons to existing approaches and/or applications to other problems. This technique addresses a long-standing problem that is likely to be of interest to researchers pushing the limits of cognitive computational modeling.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Ger and colleagues address an issue that often impedes computational modeling: the inherent ambiguity between stochasticity in behavior and structural mismatch between the assumed and true model. They propose a solution to use RNNs to estimate the ceiling on explainable variation within a behavioral dataset. With this information in hand, it is possible to determine the extent to which "worse fits" result from behavioral stochasticity versus failures of the cognitive model to capture nuances in behavior (model misspecification). The authors demonstrate the efficacy of the approach in a synthetic toy problem and then use the method to show that poorer model fits to 2-step data in participants with low IQ are actually due to an increase in inherent stochasticity, rather than systemic mismatch between model and behavior.

      Strengths:

      Overall I found the ideas conveyed in the paper interesting and the paper to be extremely clear and wellwritten. The method itself is clever and intuitive and I believe it could be useful in certain circumstances, particularly ones where the sources of structure in behavioral data are unknown. In general, the support for the method is clear and compelling. The flexibility of the method also means that it can be applied to different types of behavioral data - without any hypotheses about the exact behavioral features that might be present in a given task.

      Thank you for taking the time to review our work and for the positive remarks regarding the manuscript. Below is a point-by-point response to the concerns raised.

      Weaknesses:

      That said, I have some concerns with the manuscript in its current form, largely related to the applicability of the proposed methods for problems of importance in computational cognitive neuroscience. This concern stems from the fact that the toy problem explored in the manuscript is somewhat simple, and the theoretical problem addressed in it could have been identified through other means (for example through the use of posterior predictive checking for model validation), and the actual behavioral data analyzed were interpreted as a null result (failure to reject that the behavioral stochasticity hypothesis), rather than actual identification of model-misspecification. I expand on these primary concerns and raise several smaller points below.

      A primary question I have about this work is whether the method described would actually provide any advantage for real cognitive modeling problems beyond what is typically done to minimize the chance of model misspecification (in particular, post-predictive checking). The toy problem examined in the manuscript is pretty extreme (two of the three synthetic agents are very far from what a human would do on the task, and the models deviate from one another to a degree that detecting the difference should not be difficult for any method). The issue posed in the toy data would easily be identified by following good modeling practices, which include using posterior predictive checking over summary measures to identify model insufficiencies, which in turn would call for the need for a broader set of models (See Wilson & Collins 2019). Thus, I am left wondering whether this method could actually identify model misspecification in real world data, particularly in situations where standard posterior predictive checking would fall short. The conclusions from the main empirical data set rest largely on a null result, and the utility of a method for detecting model misspecification seems like it should depend on its ability to detect its presence, not just its absence, in real data.

      Beyond the question of its advantage above and beyond data- and hypothesis-informed methods for identifying model misspecification, I am also concerned that if the method does identify a modelinsufficiency, then you still would need to use these other methods in order to understand what aspect of behavior deviated from model predictions in order to design a better model. In general, it seems that the authors should be clear that this is a tool that might be helpful in some situations, but that it will need to be used in combination with other well-described modeling techniques (posterior predictive checking for model validation and guiding cognitive model extensions to capture unexplained features of the data). A general stylistic concern I have with this manuscript is that it presents and characterizes a new tool to help with cognitive computational modeling, but it does not really adhere to best modeling practices (see Collins & Wilson, eLife), which involve looking at data to identify core behavioral features and simulating data from best-fitting models to confirm that these features are reproduced. One could take away from this paper that you would be better off fitting a neural network to your behavioral data rather than carefully comparing the predictions of your cognitive model to your actual data, but I think that would be a highly misleading takeaway since summary measures of behavior would just as easily have diagnosed the model misspecification in the toy problem, and have the added advantage that they provide information about which cognitive processes are missing in such cases.

      As a more minor point, it is also worth noting that this method could not distinguish behavioral stochasticity from the deterministic structure that is not repeated across training/test sets (for example, because a specific sequence is present in the training set but not the test set). This should be included in the discussion of method limitations. It was also not entirely clear to me whether the method could be applied to real behavioral data without extensive pretraining (on >500 participants) which would certainly limit its applicability for standard cases.

      The authors focus on model misspecification, but in reality, all of our models are misspecified to some degree since the true process-generating behavior almost certainly deviates from our simple models (ie. as George Box is frequently quoted, "all models are wrong, but some of them are useful"). It would be useful to have some more nuanced discussion of situations in which misspecification is and is not problematic.

      We thank the reviewer for these comments and have made changes to the manuscript to better describe these limitations. We agree with the reviewer and accept that fitting a neural network is by no means a substitute for careful and dedicated cognitive modeling. Cognitive modeling is aimed at describing the latent processes that are assumed to generate the observed data, and we agree that careful description of the data-generating mechanisms, including posterior predictive checks, is always required. However, even a well-defined cognitive model might still have little predictive accuracy, and it is difficult to know how much resources should be put into trying to test and develop new cognitive models to describe the data. We argue that RNN can lead to some insights regarding this question, and highlight the following limitations that were mentioned by the review: 

      First, we accept that it is important to provide positive evidence for the existence of model misspecification. In that sense, a result where the network shows dramatic improvement over the best-fitting theoretical model is easier to interpret compared to when the network shows no (or very little) improvement in predictive accuracy. This is because there is always an option that the network, for some reason, was not flexible enough to learn the data-generating model, or because the data-generating mechanism has changed from training to test. We have now added this more clearly in the limitation section. However, when it comes to our empirical results, we would like to emphasize that the network did in fact improve the predictive accuracy for all participants. The result shows support in favor of a "null" hypothesis in the sense that we seem to find evidence that the change in predictive accuracy between the theoretical model and RNN is not systematic across levels of IQ. This allows us to quantify evidence (use Bayesian statistics) for no systematic model misspecification as a function of IQ. While it is always possible that a different model might systematically improve the predictive accuracy of low vs high IQ individuals' data, this seems less likely given the flexibility of the current results.  

      Second, we agree that our current study only applies to the RL models that we tested. In the context of RL, we have used a well-established and frequently applied paradigm and models. We emphasize in the discussion that simulations are required to further validate other uses for this method with other paradigms.  

      Third, we also accept that posterior predictive checks should always be capitalized when possible, which is now emphasized in the discussion. However, we note that these are not always easy to interpret in a meaningful way and may not always provide details regarding model insufficiencies as described by the reviewer. It is very hard to determine what should be considered as a good prediction and since the generative model is always unknown, sometimes very low predictive accuracy can still be at the peak of possible model performance. This is because the data might be generated from a very noisy process, capping the possible predictive accuracy at a very low point. However, when strictly using theoretical modeling, it is very hard to determine what predictive accuracy to expect. Also, predictive checks are not always easy to interpret visually or otherwise. For example, in two-armed bandit tasks where there are only two actions, the prediction of choices is easier to understand in our opinion when described using a confusion matrix that summarizes the model's ability to predict the empirical behavior (which becomes similar to the predictive estimation we describe in eq 22).  

      Finally, this approach indeed requires a large dataset, with at least three sessions for each participant (training, validation, and test). Further studies might shed more light on the use of optimal epochs as a proxy for noise/complexity that can be used with less data (i.e., training and validation, without a test set).

      Please see our changes at the end of this document.  

      Reviewer #2 (Public Review):

      SUMMARY:

      In this manuscript, Ger and colleagues propose two complementary analytical methods aimed at quantifying the model misspecification and irreducible stochasticity in human choice behavior. The first method involves fitting recurrent neural networks (RNNs) and theoretical models to human choices and interpreting the better performance of RNNs as providing evidence of the misspecifications of theoretical models. The second method involves estimating the number of training iterations for which the fitted RNN achieves the best prediction of human choice behavior in a separate, validation data set, following an approach known as "early stopping". This number is then interpreted as a proxy for the amount of explainable variability in behavior, such that fewer iterations (earlier stopping) correspond to a higher amount of irreducible stochasticity in the data. The authors validate the two methods using simulations of choice behavior in a two-stage task, where the simulated behavior is generated by different known models. Finally, the authors use their approach in a real data set of human choices in the two-stage task, concluding that low-IQ subjects exhibit greater levels of stochasticity than high-IQ subjects.

      STRENGTHS:

      The manuscript explores an extremely important topic to scientists interested in characterizing human decision-making. While it is generally acknowledged that any computational model of behavior will be limited in its ability to describe a particular data set, one should hope to understand whether these limitations arise due to model misspecification or due to irreducible stochasticity in the data. Evidence for the former suggests that better models ought to exist; evidence for the latter suggests they might not.

      To address this important topic, the authors elaborate carefully on the rationale of their proposed approach. They describe a variety of simulations - for which the ground truth models and the amount of behavioral stochasticity are known - to validate their approaches. This enables the reader to understand the benefits (and limitations) of these approaches when applied to the two-stage task, a task paradigm commonly used in the field. Through a set of convincing analyses, the authors demonstrate that their approach is capable of identifying situations where an alternative, untested computational model can outperform the set of tested models, before applying these techniques to a realistic data set.

      Thank you for reviewing our work and for the positive tone. Please find below a point-by-point response to the concerns you have raised.

      WEAKNESSES:

      The most significant weakness is that the paper rests on the implicit assumption that the fitted RNNs explain as much variance as possible, an assumption that is likely incorrect and which can result in incorrect conclusions. While in low-dimensional tasks RNNs can predict behavior as well as the data-generating models, this is not *always* the case, and the paper itself illustrates (in Figure 3) several cases where the fitted RNNs fall short of the ground-truth model. In such cases, we cannot conclude that a subject exhibiting a relatively poor RNN fit necessarily has a relatively high degree of behavioral stochasticity. Instead, it is at least conceivable that this subject's behavior is generated precisely (i.e., with low noise) by an alternative model that is poorly fit by an RNN - e.g., a model with long-term sequential dependencies, which RNNs are known to have difficulties in capturing.

      These situations could lead to incorrect conclusions for both of the proposed methods. First, the model misspecification analysis might show equal predictive performance for a particular theoretical model and for the RNN. While a scientist might be inclined to conclude that the theoretical model explains the maximum amount of explainable variance and therefore that no better model should exist, the scenario in the previous paragraph suggests that a superior model might nonetheless exist. Second, in the earlystopping analysis, a particular subject may achieve optimal validation performance with fewer epochs than another, leading the scientist to conclude that this subject exhibits higher behavioral noise. However, as before, this could again result from the fact that this subject's behavior is produced with little noise by a different model. Admittedly, the existence of such scenarios *in principle* does not mean that such scenarios are common, and the conclusions drawn in the paper are likely appropriate for the particular examples analyzed. However, it is much less obvious that the RNNs will provide optimal fits in other types of tasks, particularly those with more complex rules and long-term sequential dependencies, and in such scenarios, an ill-advised scientist might end up drawing incorrect conclusions from the application of the proposed approaches.

      Yes, we understand and agree. A negative result where RNN is unable to overcome the best fitting theoretical model would always leave room for doubt regarding the fact that a different approach might yield better results. In contrast, a dramatic improvement in predictive accuracy for RNN is easier to interpret since it implies that the theoretical model can be improved. We have made an effort to make this issue clear and more articulated in the discussion. We specifically and directly mention in the discussion that “Equating RNN performance with the generative model should be avoided”.   

      However, we would like to note that our empirical results provided a somewhat more nuanced scenario where we found that the RNN generally improved the predictive accuracy of most participants. Importantly, this improvement was found to be equal across participants with no systematic benefits for low vs high IQ participants. We understand that there is always the possibility that another model would show a systematic benefit for low vs. high IQ participants, however, we suggest that this is less likely given the current evidence. We have made an effort to clearly note these issues in the discussion.  

      In addition to this general limitation, the paper also makes a few additional claims that are not fully supported by the provided evidence. For example, Figure 4 highlights the relationship between the optimal epochs and agent noise. Yet, it is nonetheless possible that the optimal epoch is influenced by model parameters other than inverse temperature (e.g., learning rate). This could again lead to invalid conclusions, such as concluding that low-IQ is associated with optimal epoch when an alternative account might be that low-IQ is associated with low learning rate, which in turn is associated with optimal epoch. Yet additional factors such as the deep double-descent (Nakkiran et al., ICLR 2020) can also influence the optimal epoch value as computed by the authors.

      An additional issue is that Figure 4 reports an association between optimal epoch and noise, but noise is normalized by the true minimal/maximal inverse-temperature of hybrid agents (Eq. 23). It is thus possible that the relationship does not hold for more extreme values of inverse-temperature such as beta=0 (extremely noisy behavior) or beta=inf (deterministic behavior), two important special cases that should be incorporated in the current study. Finally, even taking the association in Figure 4 at face value, there are potential issues with inferring noise from the optimal epoch when their correlation is only r~=0.7. As shown in the figures, upon finding a very low optimal epoch for a particular subject, one might be compelled to infer high amounts of noise, even though several agents may exhibit a low optimal epoch despite having very little noise.

      Thank you for these comments. Indeed, there is much we do not yet fully understand about the factors that influence optimal epochs. Currently, it is clear to us that the number of optimal epochs is influenced by a variety of factors, including network size, the data size, and other cognitive parameters, such as the learning rate. We hope that our work serves as a proof-of-concept, suggesting that, in certain scenarios, the number of epochs can be utilized as an empirical estimate. Moreover, we maintain that, at least within the context of the current paradigm, the number of optimal epochs is primarily sensitive to the amount of true underlying noise, assuming the number of trials and network size are constant. We are therefore hopeful that this proofof-concept will encourage research that will further examine the factors that influence the optimal epochs in different behavioral paradigms.  

      To address the reviewer's justified concerns, we have made several amendments to the manuscript. First, we added an additional version of Figure 4 in the Supplementary Information material, where the noise parameter values are not scaled. We hope this adjustment clarifies that the parameters were tested across a broad spectrum of values (e.g., 0 to 10 for the hybrid model), spanning the two extremes of complete randomness and high determinism. Second, we included a linear regression analysis showing the association of all model parameters (including noise) with the optimal number of epochs. As anticipated by the reviewer, the learning rate was also found to be associated with the number of optimal epochs. Nonetheless, the noise parameter appears to maintain the most substantial association with the number of optimal epochs. We have also added a specific mentioning of these associations in the discussion, to inform readers that the association between the number of optimal epochs and model parameters should be examined using simulation for other paradigms/models. Lastly, we acknowledge in the discussion that the findings regarding the association between the number of optimal epochs and noise warrant further investigation, considering other factors that might influence the determination of the optimal epoch point and the fact that the correlation with noise is strong, but not perfect (in the range of 0.7).

      The discussion now includes the following:

      “Several limitations should be considered in our proposed approach. First, fitting a data-driven neural network is evidently not enough to produce a comprehensive theoretical description of the data generation mechanisms. Currently, best practices for cognitive modeling \citep{wilson2019ten} require identifying under what conditions the model struggles to predict the data (e.g., using posterior predictive checks), and describing a different theoretical model that could account for these disadvantages in prediction. However, identifying conditions where the model shortcomings in predictive accuracy are due to model misspecifications rather than noisier behavior is a challenging task. We propose leveraging data-driven RNNs as a supplementary tool, particularly when they significantly outperform existing theoretical models, followed by refined theoretical modeling to provide insights into what processes were mis-specified in the initial modeling effort.

      Second, although we observed a robust association between the optimal number of epochs and true noise across varying network sizes and dataset sizes (see Fig.~\ref{figS2}), additional factors such as network architecture and other model parameters (e.g., learning rate, see .~\ref{figS7}) might influence this estimation. Further research is required to allow us to better understand how and why different factors change the number of optimal epochs for a given dataset before it can be applied with confidence to empirical investigations. 

      Third, the empirical dataset used in our study consisted of data collected from human participants at a single time point, serving as the training set for our RNN. The test set data, collected with a time interval of approximately $\sim6$ and $\sim18$ months, introduced the possibility of changes in participants' decision-making strategies over time. In our analysis, we neglected any possible changes in participants' decision-making strategies during that time, changes that may lead to poorer generalization performance of our approach. Thus, further studies are needed to eliminate such possible explanations.

      Fourth, our simulations, albeit illustrative, were confined to known models, necessitating in-silico validation before extrapolating the efficacy of our approach to other model classes and tasks. Our aim was to showcase the potential benefits of using a data-driven approach, particularly when faced with unknown models. However, whether RNNs will provide optimal fits for tasks with more complex rules and long-term sequential dependencies remains uncertain.

      Finally, while positive outcomes where RNNs surpass theoretical models can prompt insightful model refinement, caution is warranted in directly equating RNN performance with that of the generative model, as seen in our simulations (e.g., Figure 3). We highlight that our empirical findings depict a more complex scenario, wherein the RNN enhanced the predictive accuracy for all participants uniformly. Notably, we also provide evidence supporting a null effect among individuals, with no consistent difference in RNN improvement over the theoretical model based on IQ. Although it remains conceivable that a different datadriven model could systematically heighten the predictive accuracy for individuals with lower IQs in this task, such a possibility seems less probable in light of the current findings.”

      Reviewer #1 (Recommendations For The Authors):

      Minor comments:

      Is the t that gets fed as input to RNN just timestep?

      t = last transition type (rare/common). not timestep

      Line 378: what does "optimal epochs" mean here?

      The number of optimal training epochs that minimize both underfitting and overfitting (define in the line ~300)

      Line 443: I don't think "identical" is the right word here - surely the authors just mean that there is not an obvious systematic difference in the distributions.

      Fixed

      I was expecting to see ~500 points in Figure 7a, but there seem to be only 50... why weren't all datasets with at least 2 sessions used for this analysis?

      We used the ~500 subjects (only 2 datasets) to pre-train the RNN, and then fine-tuned the pre-trained RNN on the other 54 subjects that have 3 datasets. The correlation of IQ and optimal epoch also hold for the 500 subjects as shown below. 

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):

      Figure 3b: despite spending a long time trying to understand the meaning of each cell of the confusion matrix, I'm still unsure what they represent. Would be great if you could spell out the meaning of each cell individually, at least for the first matrix in the paper.

      We added a clarification to the Figure caption. 

      Figure 5: Why didn't the authors show this exact scenario using simulated data? It would be much easier to understand the predictions of this figure if they had been demonstrated in simulated data, such as individuals with different amounts of behavioral noise or different levels of model misspecifications.

      In Figure 5 the x-axis represents IQ. Replacing the x-axis with true noise would make what we present now as Figure 4. We have made an effort to emphasize the meaning of the axes in the caption. 

      Line 195 ("...in the action selection. Where"). Typo? No period is needed before "where".

      Fixed

      Line 213 ("K dominated-hand model"). I was intrigued by this model, but wasn't sure whether it has been used previously in the literature, or whether this is the first time it has been proposed.

      This is the first time that we know of that this model is used.  

      Line 345 ("This suggests that RNN is flexible enough to approximate a wide range of different behavioral models"): Worth explaining why (i.e., because the GRUs are able to capture dependencies across longer delays than a k-order Logistic Regression model).

      Line 356 ("We were interested to test"): Suggestion: "We were interested in testing".

      Fixed

      Line 389 ("However, as long as the number of observations and the size of the network is the same between two datasets, the number of optimal epochs can be used to estimate whether the dataset of one participant is noisier compared with a second dataset."): This is an important claim that should ideally be demonstrated directly. The paper only illustrates this effect through a correlation and a scatter plot, where higher noise tends to predict a lower optimal epoch. However, is the claim here that, in some circumstances, optimal epoch can be used to *deterministically* estimate noise? If so, this would be a strong result and should ideally be included in the paper.

      We have now omitted this sentenced and toned down our claims, suggesting that while we did find a strong association between noise and optimal epochs, future research is required to established to what extent this could be differentiated from other factors (i.e., network size, amount of observations).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors develop a method to fluorescently tag peptides loaded onto dendritic cells using a two-step method with a tetracystein motif modified peptide and labelling step done on the surface of live DC using a dye with high affinity for the added motif. The results are convincing in demonstrating in vitro and in vivo T cell activation and efficient label transfer to specific T cells in vivo. The label transfer technique will be useful to identify T cells that have recognised a DC presenting a specific peptide antigen to allow the isolation of the T cell and cloning of its TCR subunits, for example. It may also be useful as a general assay for in vitro or in vivo T-DC communication that can allow the detection of genetic or chemical modulators.

      Strengths:

      The study includes both in vitro and in vivo analysis including flow cytometry and two-photon laser scanning microscopy. The results are convincing and the level of T cell labelling with the fluorescent pMHC is surprisingly robust and suggests that the approach is potentially revealing something about fundamental mechanisms beyond the state of the art.

      Weaknesses:

      The method is demonstrated only at high pMHC density and it is not clear if it can operate at at lower peptide doses where T cells normally operate. However, this doesn't limit the utility of the method for applications where the peptide of interest is known. It's not clear to me how it could be used to de-orphan known TCR and this should be explained if they want to claim this as an application. Previous methods based on biotin-streptavidin and phycoerythrin had single pMHC sensitivity, but there were limitations to the PE-based probe so the use of organic dyes could offer advantages.

      We thank the reviewer for the valuable comments and suggestions. Indeed, we have shown and optimized this labeling technique for a commonly used peptide at rather high doses to provide a proof of principle for the possible use of tetracysteine tagged peptides for in vitro and in vivo studies. However, we completely agree that the studies that require different peptides and/or lower pMHC concentrations may require preliminary experiments if the use of biarsenical probes is attempted. We think it can help investigate the functional and biological properties of the peptides for TCRs deorphaned by techniques. Tetracysteine tagging of such peptides would provide a readily available antigen-specific reagent for the downstream assays and validation. Other possible uses for modified immunogenic peptides could be visualizing the dynamics of neoantigen vaccines or peptide delivery methods in vivo. For these additional uses, we recommend further optimization based on the needs of the prospective assay.

      Reviewer #2 (Public Review):

      Summary:

      The authors here develop a novel Ovalbumin model peptide that can be labeled with a site-specific FlAsH dye to track agonist peptides both in vitro and in vivo. The utility of this tool could allow better tracking of activated polyclonal T cells particularly in novel systems. The authors have provided solid evidence that peptides are functional, capable of activating OTII T cells, and that these peptides can undergo trogocytosis by cognate T cells only.

      Strengths:

      -An array of in vitro and in vivo studies are used to assess peptide functionality.

      -Nice use of cutting-edge intravital imaging.

      -Internal controls such as non-cogate T cells to improve the robustness of the results (such as Fig 5A-D).

      -One of the strengths is the direct labeling of the peptide and the potential utility in other systems.

      Weaknesses:

      1. What is the background signal from FlAsH? The baselines for Figure 1 flow plots are all quite different. Hard to follow. What does the background signal look like without FLASH (how much fluorescence shift is unlabeled cells to No antigen+FLASH?). How much of the FlAsH in cells is actually conjugated to the peptide? In Figure 2E, it doesn't look like it's very specific to pMHC complexes. Maybe you could double-stain with Ab for MHCII. Figure 4e suggests there is no background without MHCII but I'm not fully convinced. Potentially some MassSpec for FLASH-containing peptides.

      We thank the reviewer for pointing out a possible area of confusion. In fact, we have done extensive characterization of the background and found that it has varied with the batch of FlAsH, TCEP, cytometer and also due to the oxidation prone nature of the reagents. Because Figure 1 subfigures have been derived from different experiments, a combination of the factors above have likely contributed to the inconsistent background. To display the background more objectively, we have now added the No antigen+Flash background to the revised Fig 1.

      It is also worthwhile noting that nonspecific Flash incorporation can be toxic at increasing doses, and live cells that display high backgrounds may undergo early apoptotic changes in vitro. However, when these cells are adoptively transferred and tracked in vivo, the compromised cells with high background possibly undergo apoptosis and get cleared by macrophages in the lymph node. The lack of clearance in vitro further contributes to different backgrounds between in vitro and in vivo, which we think is also a possible cause for the inconsistent backgrounds throughout the manuscript. Altogether, comparison of absolute signal intensities from different experiments would be misleading and the relative differences within each experiment should be relied upon. We have added further discussion about this issue.

      1. On the flip side, how much of the variant peptides are getting conjugated in cells? I'd like to see some quantification (HPLC or MassSpec). If it's ~10% of peptides that get labeled, this could explain the low shifts in fluorescence and the similar T cell activation to native peptides if FlasH has any deleterious effects on TCR recognition. But if it's a high rate of labeling, then it adds confidence to this system.

      We agree that mass spectrometry or, more specifically tandem MS/MS, would be an excellent addition to support our claim about peptide labeling by FlAsH being reliable and non-disruptive. Therefore, we have recently undertaken a tandem MS/MS quantitation project with our collaborators. However, this would require significant time to determine the internal standard based calibration curves and to run both analytical and biological replicates. Hence, we have decided pursuing this as a follow up study and added further discussion on quantification of the FlAsH-peptide conjugates by tandem MS/MS.

      1. Conceptually, what is the value of labeling peptides after loading with DCs? Why not preconjugate peptides with dye, before loading, so you have a cleaner, potentially higher fluorescence signal? If there is a potential utility, I do not see it being well exploited in this paper. There are some hints in the discussion of additional use cases, but it was not clear exactly how they would work. One mention was that the dye could be added in real-time in vivo to label complexes, but I believe this was not done here. Is that feasible to show?

      We have already addressed preconjugation as a possible avenue for labeling peptides. In our hands, preconjugation resulted in low FlAsH intensity overall in both the control and tetracysteine labeled peptides (Author response image 1). While we don’t have a satisfactory answer as to why the signal was blunted due to preconjugation, it could be that the tetracysteine tagged peptides attract biarsenical compounds better intracellularly. It may be due to the redox potential of the intracellular environment that limits disulfide bond formation. (PMID: 18159092)

      Author response image 1.

      Preconjugation yields poor FlAsH signal. Splenic DCs were pulsed with peptide then treated with FlAsH or incubated with peptide-FlAsH preconjugates. Overlaid histograms show the FlAsH intensities on DCs following the two-step labeling (left) and preconjugation (right). Data are representative of two independent experiments, each performed with three biological replicates.

      1. Figure 5D-F the imaging data isn't fully convincing. For example, in 5F and 2G, the speeds for T cells with no Ag should be much higher (10-15micron/min or 0.16-0.25micron/sec). The fact that yours are much lower speeds suggests technical or biological issues, that might need to be acknowledged or use other readouts like the flow cytometry.

      We thank the reviewer for drawing attention to this technical point. We would like to point out that the imaging data in fig 5 d-f was obtained from agarose embedded live lymph node sections. Briefly, the lymph nodes were removed, suspended in 2% low melting temp agarose in DMEM and cut into 200 µm sections with a vibrating microtome. Prior to imaging, tissue sections were incubated in complete RPMI medium at 37 °C for 2 h to resume cell mobility. Thus, we think the cells resuming their typical speeds ex vivo may account for slightly reduced T cell speeds overall, for both control and antigen-specific T cells (PMID: 32427565, PMID: 25083865). We have added text to prevent the ambiguity about the technique for dynamic imaging. The speeds in Figure 2g come from live imaging of DC-T cell cocultures, in which the basal cell movement could be hampered by the cell density. Additionally, glass bottom dishes have been coated with Fibronectin to facilitate DC adhesion, which may be responsible for the lower average speeds of the T cells in vitro.

      Reviewer #1 (Recommendations For The Authors):

      Does the reaction of ReAsH with reactive sites on the surface of DC alter them functionally? Functions have been attributed to redox chemistry at the cell surface- could this alter this chemistry?

      We thank the reviewer for the insight. It is possible that the nonspecific binding of biarsenical compounds to cysteine residues, which we refer to as background throughout the manuscript, contribute to some alterations. One possible way biarsenicals affect the redox events in DCs can be via reducing glutathione levels (PMID: 32802886). Glutathione depletion is known to impair DC maturation and antigen presentation (PMID: 20733204). To avoid toxicity, we have carried out a stringent titration to optimize ReAsH and FlAsH concentrations for labeling and conducted experiments using doses that did not cause overt toxicity or altered DC function.

      Have the authors compared this to a straightforward approach where the peptide is just labelled with a similar dye and incubated with the cell to load pMHC using the MHC knockout to assess specificity? Why is this that involves exposing the DC to a high concentration of TCEP, better than just labelling the peptide? The Davis lab also arrived at a two-step method with biotinylated peptide and streptavidin-PE, but I still wonder if this was really necessary as the sensitivity will always come down to the ability to wash out the reagents that are not associated with the MHC.

      We agree with the reviewer that small undisruptive fluorochrome labeled peptide alternatives would greatly improve the workflow and signal to noise ratio. In fact, we have been actively searching for such alternatives since we have started working on the tetracysteine containing peptides. So far, we have tried commercially available FITC and TAMRA conjugated OVA323-339 for loading the DCs, however failed to elicit any discernible signal. We also have an ongoing study where we have been producing and testing various in-house modified OVA323-339 that contain fluorogenic properties. Unfortunately, at this moment, the ones that provided us with a crisp, bright signal for loading revealed that they have also incorporated to DC membrane in a nonspecific fashion and have been taken up by non-cognate T cells from double antigen-loaded DCs. We are actively pursuing this area of investigation and developing better optimized peptides with low/non-significant membrane incorporation.

      Lastly, we would like to point out that tetracysteine tags are visible by transmission electron microscopy without FlAsH treatment. Thus, this application could add a new dimension for addressing questions about the antigen/pMHCII loading compartments in future studies. We have now added more in-depth discussion about the setbacks and advantages of using tetracysteine labeled peptides in immune system studies.

      The peptide dosing at 5 µM is high compared to the likely sensitivity of the T cells. It would be helpful to titrate the system down to the EC50 for the peptide, which may be nM, and determine if the specific fluorescence signal can still be detected in the optimal conditions. This will not likely be useful in vivo, but it will be helpful to see if the labelling procedure would impact T cell responses when antigen is limited, which will be more of a test. At 5 µM it's likely the system is at a plateau and even a 10-fold reduction in potency might not impact the T cell response, but it would shift the EC50.

      We thank the reviewer for the comment and suggestion. We agree that it is possible to miss minimally disruptive effects at 5 µM and titrating the native peptide vs. modified peptide down to the nM doses would provide us a clearer view. This can certainly be addressed in future studies and also with other peptides with different affinity profiles. A reason why we have chosen a relatively high dose for this study was that lowering the peptide dose had costed us the specific FlAsH signal, thus we have proceeded with the lowest possible peptide concentration.

      In Fig 3b the level of background in the dsRed channel is very high after DC transfer. What cells is this associated with and does this appear be to debris? Also, I wonder where the ReAsH signal is in the experiments in general. I believe this is a red dye and it would likely be quite bright given the reduction of the FlAsH signal. Will this signal overlap with signals like dsRed and PHK-26 if the DC is also treated with this to reduce the FlAsH background?

      We have already shown that ReAsH signal with DsRed can be used for cell-tracking purposes as they don’t get transferred to other cells during antigen specific interactions (Author response image 2). In fact, combining their exceptionally bright fluorescence provided us a robust signal to track the adoptively transferred DCs in the recipient mice. On the other hand, the lipophilic membrane dye PKH-26 gets transferred by trogocytosis while the remaining signal contributes to the red fluorescence for tracking DCs. Therefore, the signal that we show to be transferred from DCs to T cells only come from the lipophilic dye. To address this, we have added a sentence to elaborate on this in the results section. Regarding the reviewer’s comment on DsRed background in Figure 3b., we agree that the cells outside the gate in recipient mice seems slightly higher that of the control mice. It may suggest that the macrophages clearing up debris from apoptotic/dying DCs might contribute to the background elicited from the recipient lymph node. Nevertheless, it does not contribute to any DsRed/ReAsH signal in the antigen-specific T cells.

      Author response image 2.

      ReAsH and DsRed are not picked up by T cells during immune synapse. DsRed+ DCs were labeled with ReAsH, pulsed with 5 μM OVACACA, labeled with FlAsH and adoptively transferred into CD45.1 congenic mice mice (1-2 × 106 cells) via footpad. Naïve e450-labeled OTII and e670-labeled polyclonal CD4+ T cells were mixed 1:1 (0.25-0.5 × 106/ T cell type) and injected i.v. Popliteal lymph nodes were removed at 42 h post-transfer and analyzed by flow cytometry. Overlaid histograms show the ReAsh/DsRed, MHCII and FlAsH intensities of the T cells. Data are representative of two independent experiments with n=2 mice per group.

      In Fig 5b there is a missing condition. If they look at Ea-specific T cells for DC with without the Ova peptide do they see no transfer of PKH-26 to the OTII T cells? Also, the FMI of the FlAsH signal transferred to the T cells seems very high compared to other experiments. Can the author estimate the number of peptides transferred (this should be possible) and would each T cell need to be collecting antigens from multiple DC? Could the debris from dead DC also contribute to this if picked up by other DC or even directly by the T cells? Maybe this could be tested by transferring DC that are killed (perhaps by sonication) prior to inoculation?

      To address the reviewer’s question on the PKH-26 acquisition by T cells, Ea-T cells pick up PKH-26 from Ea+OVA double pulsed DCs, but not from the unpulsed or single OVA pulsed DCs. OTII T cells acquire PKH-26 from OVA-pulsed DCs, whereas Ea T cells don’t (as expected) and serve as an internal negative control for that condition. Regarding the reviewer’s comment on the high FlAsH signal intensity of T cells in Figure 5b, a plausible explanation can be that the T cells accumulate pMHCII through serial engagements with APCs. In fact, a comparison of the T cell FlAsH intensities 18 h and 36-48 h post-transfer demonstrate an increase (Author response image 3) and thus hints at a cumulative signal. As DCs are known to be short-lived after adoptive transfer, the debris of dying DCs along with its peptide content may indeed be passed onto macrophages, neighboring DCs and eventually back to T cells again (or for the first time, depending on the T:DC ratio that may not allow all T cells to contact with the transferred DCs within the limited time frame). We agree that the number and the quality of such contacts can be gauged using fluorescent peptides. However, we think peptides chemically conjugated to fluorochromes with optimized signal to noise profiles and with less oxidation prone nature would be more suitable for quantification purposes.

      Author response image 3.

      FlAsH signal acquisition by antigen specific T cells becomes more prominent at 36-48 h post-transfer. DsRed+ splenic DCs were double-pulsed with 5 μM OVACACA and 5 μM OVA-biotin and adoptively transferred into CD45.1 recipients (2 × 106 cells) via footpad. Naïve e450-labeled OTII (1 × 106 cells) and e670-labeled polyclonal T cells (1 × 106 cells) were injected i.v. Popliteal lymph nodes were analyzed by flow cytometry at 18 h or 48 h post-transfer. Overlaid histograms show the T cell levels of OVACACA (FlAsH). Data are representative of three independent experiments with n=3 mice per time point

      Reviewer #2 (Recommendations For The Authors):

      As mentioned in weaknesses 1 & 2, more validation of how much of the FlAsH fluorescence is on agonist peptides and how much is non-specific would improve the interpretation of the data. Another option would be to preconjugate peptides but that might be a significant effort to repeat the work.

      We agree that mass spectrometry would be the gold standard technique to measure the percentage of tetracysteine tagged peptide is conjugated to FlAsH in DCs. However, due to the scope of such endevour this can only be addressed as a separate follow up study. As for the preconjugation, we have tried and unfortunately failed to get it to work (Reviewer Figure 1). Therefore, we have shifted our focus to generating in-house peptide probes that are chemically conjugated to stable and bright fluorophore derivates. With that, we aim to circumvent the problems that the two-step FlAsH labeling poses.

      Along those lines, do you have any way to quantify how many peptides you are detecting based on fluorescence? Being able to quantify the actual number of peptides would push the significance up.

      We think two step procedure and background would pose challenges to such quantification in this study. although it would provide tremendous insight on the antigen-specific T cell- APC interactions in vivo, we think it should be performed using peptides chemically conjugated to fluorochromes with optimized signal to noise profiles.

      In Figure 3D or 4 does the SA signal correlate with Flash signal on OT2 cells? Can you correlate Flash uptake with T cell activation, downstream of TCR, to validate peptide transfers?

      To answer the reviewer’s question about FlAsH and SA correlation, we have revised the Figure 3d to show the correlation between OTII uptake of FlAsH, Streptavidin and MHCII. We also thank the reviewer for the suggestion on correlating FlAsH uptake with T cell activation and/or downstream of TCR activation. We have used proliferation and CD44 expressions as proxies of activation (Fig 2, 6). Nevertheless, we agree that the early events that correspond to the initiation of T-DC synapse and FlAsH uptake would be valuable to demonstrate the temporal relationship between peptide transfer and activation. Therefore, we have addressed this in the revised discussion.

      Author response image 4.

      FlAsH signal acquisition by antigen specific T cells is correlates with the OVA-biotin (SA) and MHCII uptake. DsRed+ splenic DCs were double-pulsed with 5 μM OVACACA and 5 μM OVA-biotin and adoptively transferred into CD45.1 recipients (2 × 106 cells) via footpad. Naïve e450-labeled OTII (1 × 106 cells) and e670-labeled polyclonal T cells (1 × 106 cells) were injected i.v. Popliteal lymph nodes were analyzed by flow cytometry. Overlaid histograms show the T cell levels of OVACACA (FlAsH) at 48 h post-transfer. Data are representative of three independent experiments with n=3 mice.

      Minor:

      Figure 3F, 5D, and videos: Can you color-code polyclonal T cells a different color than magenta (possibly white or yellow), as they have the same look as the overlay regions of OT2-DC interactions (Blue+red = magenta).

      We apologize for the inconvenience about the color selection. We have had difficulty in assigning colors that are bright and distinct. Unfortunately, yellow and white have also been easily mixed up with the FlAsH signal inside red and blue cells respectively. We have now added yellow and white arrows to better point out the polyclonal vs. antigen specific cells in 3f and 5d.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study provides solid evidence that both psychiatric dimensions (e.g. anhedonia, apathy, or depression) and chronotype (i.e., being a morning or evening person) influence effort-based decision-making. Notably, the current study does not elucidate whether there may be interactive effects of chronotype and psychiatric dimensions on decision-making. This work is of importance to researchers and clinicians alike, who may make inferences about behaviour and cognition without taking into account whether the individual may be tested or observed out-of-sync with their phenotype.

      We thank the three reviewers for their comments, and the Editors at eLife. We have taken the opportunity to revise our manuscript considerably from its original form, not least because we feel a number of the reviewers’ suggested analyses strengthen our manuscript considerably (in one instance even clarifying our conclusions, leading us to change our title)—for which we are very appreciative indeed. 

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This study uses an online cognitive task to assess how reward and effort are integrated in a motivated decision-making task. In particular the authors were looking to explore how neuropsychiatric symptoms, in particular apathy and anhedonia, and circadian rhythms affect behavior in this task. Amongst many results, they found that choice bias (the degree to which integrated reward and effort affects decisions) is reduced in individuals with greater neuropsychiatric symptoms, and late chronotypes (being an 'evening person').

      Strengths:

      The authors recruited participants to perform the cognitive task both in and out of sync with their chronotypes, allowing for the important insight that individuals with late chronotypes show a more reduced choice bias when tested in the morning.<br /> Overall, this is a well-designed and controlled online experimental study. The modelling approach is robust, with care being taken to both perform and explain to the readers the various tests used to ensure the models allow the authors to sufficiently test their hypotheses.

      Weaknesses:

      This study was not designed to test the interactions of neuropsychiatric symptoms and chronotypes on decision making, and thus can only make preliminary suggestions regarding how symptoms, chronotypes and time-of-assessment interact.

      We appreciate the Reviewer’s positive view of our research and agree with their assessment of its weaknesses; the study was not designed to assess chronotype-mental health interactions. We hope that our new title and contextualisation makes this clearer. We respond in more detail point-by-point below.

      Reviewer #2 (Public Review):

      Summary:

      The study combines computational modeling of choice behavior with an economic, effort-based decision-making task to assess how willingness to exert physical effort for a reward varies as a function of individual differences in apathy and anhedonia, or depression, as well as chronotype. They find an overall reduction in effort selection that scales with apathy and anhedonia and depression. They also find that later chronotypes are less likely to choose effort than earlier chronotypes and, interestingly, an interaction whereby later chronotypes are especially unwilling to exert effort in the morning versus the evening.

      Strengths:

      This study uses state-of-the-art tools for model fitting and validation and regression methods which rule out multicollinearity among symptom measures and Bayesian methods which estimate effects and uncertainty about those estimates. The replication of results across two different kinds of samples is another strength. Finally, the study provides new information about the effects not only of chronotype but also chronotype by timepoint interactions which are previously unknown in the subfield of effort-based decision-making.

      Weaknesses:

      The study has few weaknesses. One potential concern is that the range of models which were tested was narrow, and other models might have been considered. For example, the Authors might have also tried to fit models with an overall inverse temperature parameter to capture decision noise. One reason for doing so is that some variance in the bias parameter might be attributed to noise, which was not modeled here. Another concern is that the manuscripts discuss effort-based choice as a transdiagnostic feature - and there is evidence in other studies that effort deficits are a transdiagnostic feature of multiple disorders. However, because the present study does not investigate multiple diagnostic categories, it doesn't provide evidence for transdiagnosticity, per se.

      We appreciate Reviewer 2’s assessment of our research and agree generally with its weaknesses. We have now addressed the Reviewer’s comments regarding transdiagnosticity in the discussion of our revised version and have addressed their detailed recommendations below (see point-by-point responses).

      In addition to the below specific changes, in our Discussion section, we now have also added the following (lines 538 – 540):

      “Finally, we would like to note that as our study is based on a general population sample, rather than a clinical one. Hence, we cannot speak to transdiagnosticity on the level of multiple diagnostic categories.”

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, Mehrhof and Nord study a large dataset of participants collected online (n=958 after exclusions) who performed a simple effort-based choice task. They report that the level of effort and reward influence choices in a way that is expected from prior work. They then relate choice preferences to neuropsychiatric syndromes and, in a smaller sample (n<200), to people's circadian preferences, i.e., whether they are a morning-preferring or evening-preferring chronotype. They find relationships between the choice bias (a model parameter capturing the likelihood to accept effort-reward challenges, like an intercept) and anhedonia and apathy, as well as chronotype. People with higher anhedonia and apathy and an evening chronotype are less likely to accept challenges (more negative choice bias). People with an evening chronotype are also more reward sensitive and more likely to accept challenges in the evening, compared to the morning.

      Strengths:

      This is an interesting and well-written manuscript which replicates some known results and introduces a new consideration related to potential chronotype relationships which have not been explored before. It uses a large sample size and includes analyses related to transdiagnostic as well as diagnostic criteria. I have some suggestions for improvements.

      Weaknesses:

      (1) The novel findings in this manuscript are those pertaining to transdiagnostic and circadian phenotypes. The authors report two separate but "overlapping" effects: individuals high on anhedonia/apathy are less willing to accept offers in the task, and similarly, individuals tested off their chronotype are less willing to accept offers in the task. The authors claim that the latter has implications for studying the former. In other words, because individuals high on anhedonia/apathy predominantly have a late chronotype (but might be tested early in the day), they might accept less offers, which could spuriously look like a link between anhedonia/apathy and choices but might in fact be an effect of the interaction between chronotype and time-of-testing. The authors therefore argue that chronotype needs to be accounted for when studying links between depression and effort tasks.

      The authors argue that, if X is associated with Y and Z is associated with Y, X and Z might confound each other. That is possible, but not necessarily true. It would need to be tested explicitly by having X (anhedonia/apathy) and Z (chronotype) in the same regression model. Does the effect of anhedonia/apathy on choices disappear when accounting for chronotype (and time-of-testing)? Similarly, when adding the interaction between anhedonia/apathy, chronotype, and time-of-testing, within the subsample of people tested off their chronotype, is there a residual effect of anhedonia/apathy on choices or not?

      If the effect of anhedonia/apathy disappeared (or got weaker) while accounting for chronotype, this result would suggest that chronotype mediates the effect of anhedonia/apathy on effort choices. However, I am not sure it renders the direct effect of anhedonia/apathy on choices entirely spurious. Late chronotype might be a feature (induced by other symptoms) of depression (such as fatigue and insomnia), and the association between anhedonia/apathy and effort choices might be a true and meaningful one. For example, if the effect of anhedonia/apathy on effort choices was mediated by altered connectivity of the dorsal ACC, we would not say that ACC connectivity renders the link between depression and effort choices "spurious", but we would speak of a mechanism that explains this effect. The authors should discuss in a more nuanced way what a significant mediation by the chronotype/time-of-testing congruency means for interpreting effects of depression in computational psychiatry.

      We thank the Reviewer for pointing out this crucial weakness in the original version of our manuscript. We have now thought deeply about this and agree with the Reviewer that our original results did not warrant our interpretation that reported effects of anhedonia and apathy on measures of effort-based decision-making could potentially be spurious. At the Reviewer’s suggestion, we decided to test this explicitly in our revised version—a decision that has now deepened our understanding of our results, and changed our interpretation thereof.  

      To investigate how the effects of neuropsychiatric symptoms and the effects of circadian measures relate to each other, we have followed the Reviewer’s advice and conducted an additional series of analyses (see below). Surprisingly (to us, but perhaps not the Reviewer) we discovered that all three symptom measures (two of anhedonia, one of apathy) have separable effects from circadian measures on the decision to expend effort (note we have also re-named our key parameter ‘motivational tendency’ to address this Reviewer’s next comment that the term ‘choice bias’ was unclear). In model comparisons (based on leave-one-out information criterion which penalises for model complexity) the models including both circadian and psychiatric measures always win against the models including either circadian or psychiatric measures. In essence, this strengthens our claims about the importance of measuring circadian rhythm in effort-based tasks generally, as circadian rhythm clearly plays an important role even when considering neuropsychiatric symptoms, but crucially does not support the idea of spurious effects: statistically, circadian measures contributes separably from neuropsychiatric symptoms to the variance in effort-based decision-making. We think this is very interesting indeed, and certainly clarifies (and corrects the inaccuracy in) our original interpretation—and can only express our thanks to the Reviewer for helping us understand our effect more fully.

      In response to these new insights, we have made numerous edits to our manuscript. First, we changed the title from “Overlapping effects of neuropsychiatric symptoms and circadian rhythm on effort-based decision-making” to “Both neuropsychiatric symptoms and circadian rhythm alter effort-based decision-making”. In the remaining manuscript we now refrain from using the word ‘overlapping’ (which could be interpreted as overlapping in explained variance), and instead opted to describe the effects as parallel. We hope our new analyses, title, and clarified/improved interpretations together address the Reviewer’s valid concern about our manuscript’s main weakness.

      We detail these new analyses in the Methods section as follows (lines 800 – 814):

      “4.5.2. Differentiating between the effects of neuropsychiatric symptoms and circadian measures on motivational tendency

      To investigate how the effects of neuropsychiatric symptoms on motivational tendency (2.3.1) relate to effects of chronotype and time-of-day on motivational tendency we conducted exploratory analyses. In the subsamples of participants with an early or late chronotype (including additionally collected data), we first ran Bayesian GLMs with neuropsychiatric questionnaire scores (SHAPS, DARS, AES respectively) predicting motivational tendency, controlling for age and gender. We next added an interaction term of chronotype and time-of-day into the GLMs, testing how this changes previously observed neuropsychiatric and circadian effects on motivational tendency. Finally, we conducted a model comparison using LOO, comparing between motivational tendency predicted by a neuropsychiatric questionnaire, motivational tendency predicted by chronotype and time-of-day, and motivational tendency predicted by a neuropsychiatric questionnaire and time-of-day (for each neuropsychiatric questionnaire, and controlling for age and gender).”

      Results of the outlined analyses are reported in the results section as follows (lines 356 – 383):

      “2.5.2.1 Neuropsychiatric symptoms and circadian measures have separable effects on motivational tendency

      Exploratory analyses testing for the effects of neuropsychiatric questionnaires on motivational tendency in the subsamples of early and late chronotypes confirmed the predictive value of the SHAPS (M=-0.24, 95% HDI=[-0.42,-0.06]), the DARS (M=-0.16, 95% HDI=[-0.31,-0.01]), and the AES (M=-0.18, 95% HDI=[-0.32,-0.02]) on motivational tendency.

      For the SHAPS, we find that when adding the measures of chronotype and time-of-day back into the GLMs, the main effect of the SHAPS (M=-0.26, 95% HDI=[-0.43,-0.07]), the main effect of chronotype (M=-0.11, 95% HDI=[-0.22,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remain. Model comparison by LOOIC reveals motivational tendency is best predicted by the model including the SHAPS, chronotype and time-of-day as predictors, followed by the model including only the SHAPS. Note that this approach to model comparison penalizes models for increasing complexity.

      Repeating these steps with the DARS, the main effect of the DARS is found numerically, but the 95% HDI just includes 0 (M=-0.15, 95% HDI=[-0.30,0.002]). The main effect of chronotype (M=-0.11, 95% HDI=[-0.21,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.18, 95% HDI=[0.05,0.33]) on motivational tendency remain. Model comparison identifies the model including the DARS and circadian measures as the best model, followed by the model including only the DARS.

      For the AES, the main effect of the AES is found (M=-0.19, 95% HDI=[-0.35,-0.04]). For the main effect of chronotype, the 95% narrowly includes 0 (M=-0.10, 95% HDI=[-0.21,0.002]), while the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remains. Model comparison identifies the model including the AES and circadian measures as the best model, followed by the model including only the AES.”

      We have now edited parts of our Discussion to discuss and reflect these new insights, including the following.

      Lines 399 – 402:

      “Various neuropsychiatric disorders are marked by disruptions in circadian rhythm, such as a late chronotype. However, research has rarely investigated how transdiagnostic mechanisms underlying neuropsychiatric conditions may relate to inter-individual differences in circadian rhythm.”

      Lines 475 – 480:

      “It is striking that the effects of neuropsychiatric symptoms on effort-based decision-making largely are paralleled by circadian effects on the same neurocomputational parameter. Exploratory analyses predicting motivational tendency by neuropsychiatric symptoms and circadian measures simultaneously indicate the effects go beyond recapitulating each other, but rather explain separable parts of the variance in motivational tendency.”

      Lines 528 – 532:

      “Our reported analyses investigating neuropsychiatric and circadian effects on effort-based decision-making simultaneously are exploratory, as our study design was not ideally set out to examine this. Further work is needed to disentangle separable effects of neuropsychiatric and circadian measures on effort-based decision-making.”

      Lines 543 – 550:

      “We demonstrate that neuropsychiatric effects on effort-based decision-making are paralleled by effects of circadian rhythm and time-of-day. Exploratory analyses suggest these effects account for separable parts of the variance in effort-based decision-making. It unlikely that effects of neuropsychiatric effects on effort-based decision-making reported here and in previous literature are a spurious result due to multicollinearity with chronotype. Yet, not accounting for chronotype and time of testing, which is the predominant practice in the field, could affect results.”

      (2) It seems that all key results relate to the choice bias in the model (as opposed to reward or effort sensitivity). It would therefore be helpful to understand what fundamental process the choice bias is really capturing in this task. This is not discussed, and the direction of effects is not discussed either, but potentially quite important. It seems that the choice bias captures how many effortful reward challenges are accepted overall which maybe captures general motivation or task engagement. Maybe it is then quite expected that this could be linked with questionnaires measuring general motivation/pleasure/task engagement. Formally, the choice bias is the constant term or intercept in the model for p(accept), but the authors never comment on what its sign means. If I'm not mistaken, people with higher anhedonia but also higher apathy are less likely to accept challenges and thus engage in the task (more negative choice bias). I could not find any discussion or even mention of what these results mean. This similarly pertains to the results on chronotype. In general, "choice bias" may not be the most intuitive term and the authors may want to consider renaming it. Also, given the sign of what the choice bias means could be flipped with a simple sign flip in the model equation (i.e., equating to accepting more vs accepting less offers), it would be helpful to show some basic plots to illustrate the identified differences (e.g., plotting the % accepted for people in the upper and lower tertile for the SHAPS score etc).

      We apologise that this was not made clear previously: the meaning and directionality of “choice bias” is indeed central to our results. We also thank the Reviewer for pointing out the previousely-used term “choice bias” itself might not be intuitive. We have now changed this to ‘motivational tendency’ (see below) as well as added substantial details on this parameter to the manuscript, including additional explanations and visualisations of the model as suggested by the Reviewer (new Figure 3) and model-agnostic results to aid interpretation (new Figure S3). Note the latter is complex due to our staircasing procedure (see new figure panel D further detailing our staircasing procedure in Figure 2). This shows that participants with more pronounced anhedonia are less likely to accept offers than those with low anhedonia (Fig. S3A), a model-agnostic version of our central result.

      Our changes are detailed below:

      After careful evaluation we have decided to term the parameter “motivational tendency”, hoping that this will present a more intuitive description of the parameter.

      To aid with the understanding and interpretation of the model parameters, and motivational tendency in particular, we have added the following explanation to the main text:

      Lines 149 – 155:

      “The models posit efforts and rewards are joined into a subjective value (SV), weighed by individual effort (and reward sensitivity (parameters. The subjective value is then integrated with an individual motivational tendency (a) parameter to guide decision-making. Specifically, the motivational tendency parameter determines the range at which subjective values are translated to acceptance probabilities: the same subjective value will translate to a higher acceptance probability the higher the motivational tendency.”

      Further, we have included a new figure, visualizing the model. This demonstrates how the different model parameters contribute to the model (A), and how different values on each parameter affects the model (B-D).

      We agree that plotting model agnostic effects in our data may help the reader gain intuition of what our task results mean. We hope to address this with our added section on “Model agnostic task measures relating to questionnaires”. We first followed the reviewer’s suggestion of extracting subsamples with higher and low anhedonia (as measured with the SHAPS, highest and lowest quantile) and plotted the acceptance proportion across effort and reward levels (panel A in figure below). However, due to our implemented task design, this only shows part of the picture: the staircasing procedure individualises which effort-reward combination a participant is presented with. Therefore, group differences in choice behaviour will lead to differences in the development of the staircases implemented in our task. Thus, we plotted the count of offered effort-reward combinations for the subsamples of participants with high vs. low SHAPS scores by the end of the task, averaged across staircases and participants.

      As the aspect of task development due to the implemented staircasing may not have been explained sufficiently in the main text, we have included panel (D) in figure 2.

      Further, we have added the following figure reference to the main text (lines 189 – 193):

      “The development of offered effort and reward levels across trials is shown in figure 2D; this shows that as participants generally tend to accept challenges rather than reject them, the implemented staircasing procedure develops toward higher effort and lover reward challenges.”

      To statistically test effects of model-agnostic task measures on the neuropsychiatric questionnaires, we performed Bayesian GLMs with the proportion of accepted trials predicted by SHAPS and AES. This is reported in the text as follows.

      Supplement, lines 172 – 189:

      “To explore the relationship between model agnostic task measures to questionnaire measures of neuropsychiatric symptoms, we conducted Bayesian GLMs, with the proportion of accepted trials predicted by SHAPS scores, controlling for age and gender. The proportion of accepted trials averaged across effort and reward levels was predicted by the Snaith-Hamilton Pleasure Scale (SHAPS) sum scores (M=-0.07; 95%HDI=[-0.12,-0.03]) and the Apathy Evaluation Scale (AES) sum scores (M=-0.05; 95%HDI=[-0.10,-0.002]). Note that this was not driven only by higher effort levels; even confining data to the lowest two effort levels, SHAPS has a predictive value for the proportion of accepted trials: M=-0.05; 95%HDI=[-0.07,-0.02].<br /> A visualisation of model agnostic task measures relating to symptoms is given in Fig. S4, comparing subgroups of participants scoring in the highest and lowest quartile on the SHAPS. This shows that participants with a high SHAPS score (i.e., more pronounced anhedonia) are less likely to accept offers than those with a low SHAPS score (Fig. S4A). Due to the implemented staircasing procedure, group differences can also be seen in the effort-reward combinations offered per trial. While for both groups, the staircasing procedure seems to devolve towards high effort – low reward offers, this is more pronounced in the subgroup of participants with a lower SHAPS score (Fig S4B).”

      (3) None of the key effects relate to effort or reward sensitivity which is somewhat surprising given the previous literature and also means that it is hard to know if choice bias results would be equally found in tasks without any effort component. (The only analysis related to effort sensitivity is exploratory and in a subsample of N=56 per group looking at people meeting criteria for MDD vs matched controls.) Were stimuli constructed such that effort and reward sensitivity could be separated (i.e., are uncorrelated/orthogonal)? Maybe it would be worth looking at the % accepted in the largest or two largest effort value bins in an exploratory analysis. It seems the lowest and 2nd lowest effort level generally lead to accepting the challenge pretty much all the time, so including those effort levels might not be sensitive to individual difference analyses?

      We too were initially surprised by the lack of effect of neuropsychiatric symptoms on reward and effort sensitivity. To address the Reviewer’s first comment, the nature of the ‘choice bias’ parameter (now motivational tendency) is its critical importance in the context of effort-based decision-making: it is not modelled or measured explicitly in tasks without effort (such as typical reward tasks), so it would be impossible to test this in tasks without an effort component. 

      For the Reviewer’s second comment, the exploratory MDD analysis is not our only one related to effort sensitivity: the effort sensitivity parameter is included in all of our central analyses, and (like reward sensitivity), does not relate to our measured neuropsychiatric symptoms (e.g., see page 15). Note most previous effort tasks do not include a ‘choice bias’/motivational tendency parameter, potentially explaining this discrepancy. However, our model was quantitatively superior to models without this parameter, for example with only effort- and reward-sensitivity (page 11, Fig. 3).

      Our three model parameters (reward sensitivity, effort sensitivity, and choice bias/motivational tendency) were indeed uncorrelated/orthogonal to one another (see parameter orthogonality analyses below), making it unlikely that the variance and effect captured by our motivational tendency parameter (previously termed “choice bias”) should really be attributed to reward sensitivity. As per the Reviewer’s suggestion, we also examined whether the lowest two effort levels might not be sensitive to individual differences; in fact, we found out proportion of accepted trials on the lowest effort levels alone was nevertheless predicted by anhedonia (see ceiling effect analyses below).

      Specifically, in terms of parameter orthogonality:

      When developing our task design and computational modelling approach we were careful to ensure that meaningful neurocomputational parameters could be estimated and that no spurious correlations between parameters would be introduced by modelling. By conducting parameter recoveries for all models, we showed that our modelling approach could reliably estimate parameters, and that estimated parameters are orthogonal to the other underlying parameters (as can be seen in Figure S1 in the supplement). It is thus unlikely that the variance and effect captured by our motivational tendency parameter (previously termed “choice bias”) should really be attributed to reward sensitivity.

      And finally, regarding the possibility of a ceiling effect for low effort levels:

      We agree that visual inspection of the proportion of accepted results across effort and reward values can lead to the belief that a ceiling effect prevents the two lowest effort levels from capturing any inter-individual differences. To test whether this is the case, we ran a Bayesian GLM with the SHAPS sum score predicting the proportion of accepted trials (controlling for age and gender), in a subset of the data including only trials with an effort level of 1 or 2. We found the SHAPS has a predictive value for the proportion of accepted trials in the lowest two effort levels: M=-0.05; 95%HDI=[-0.07,-0.02]). This is noted in the text as follows.

      Supplement, lines 175 – 180:

      “The proportion of accepted trials averaged across effort and reward levels was predicted by the Snaith-Hamilton Pleasure Scale (SHAPS) sum scores (M=-0.07; 95%HDI=[-0.12,-0.03]) and the Apathy Evaluation Scale (AES) sum scores (M=-0.05; 95%HDI=[-0.10,-0.002]). Note that this was not driven only by higher effort levels; even confining data to the lowest two effort levels, SHAPS has a predictive value for the proportion of accepted trials: M=-0.05; 95%HDI=[-0.07,-0.02].”

      (4) The abstract and discussion seem overstated (implications for the school system and statements on circadian rhythms which were not measured here). They should be toned down to reflect conclusions supported by the data.

      We thank the Reviewer for pointing this out, and have now removed these claims from the abstract and Discussion; we hope they now better reflect conclusions supported by these data directly.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Suggestions for improved or additional experiments, data or analyses.

      - For a non-computational audience, it would be useful to unpack the influence of the choice bias on behavior, as it is less clear how this would affect decision-making than sensitivity to effort or reward. Perhaps a figure showing accept/reject decisions when sensitivities are held and choice bias is high would be beneficial.

      We thank the Reviewer for suggesting additional explanations of the choice bias parameter to aid interpretation for non-computational readers; as per the Reviewer’s suggestion, we have now included additional explanations and visualisations (Figure 3) to make this as clear as possible. Please note also that, in response to one of the other Reviewers and after careful considerations, we have decided to rename the “choice bias” parameter to “motivational tendency”, hoping this will prove more intuitive.

      To aid with the understanding and interpretation of this and the other model parameters, we have added the following explanation to the main text.

      Lines 149 – 155:

      “The models posit efforts and rewards are joined into a subjective value (SV), weighed by individual effort (and reward sensitivity (parameters. The subjective value is then integrated with an individual motivational tendency (a) parameter to guide decision-making. Specifically, the motivational tendency parameter determines the range at which subjective values are translated to acceptance probabilities: the same subjective value will translate to a higher acceptance probability the higher the motivational tendency.”

      Additionally, we add the following explanation to the Methods section.

      Lines 698 – 709:

      First, a cost function transforms costs and rewards associated with an action into a subjective value (SV):

      with and for reward and effort sensitivity, and ℛ and 𝐸 for reward and effort. Higher effort and reward sensitivity mean the SV is more strongly influenced by changes in effort and reward, respectively (Fig. 3B-C). Hence, low effort and reward sensitivity mean the SV, and with that decision-making, is less guided by effort and reward offers, as would be in random decision-making.

      This SV is then transformed to an acceptance probability by a softmax function:

      with for the predicted acceptance probability and 𝛼 for the intercept representing motivational tendency. A high motivational tendency means a subjects has a tendency, or bias, to accept rather than reject offers (Fig. 3D).

      Our new figure (panels A-D in figure 3) visualizes the model. This demonstrates how the different model parameters come at play in the model (A), and how different values on each parameter affects the model (B-D).

      - The early and late chronotype groups have significant differences in ages and gender. Additional supplementary analysis here may mitigate any concerns from readers.

      The Reviewer is right to notice that our subsamples of early and late chronotypes differ significantly in age and gender, but it important to note that all our analyses comparing these two groups take this into account, statistically controlling for age and gender. We regret that this was previously only mentioned in the Methods section, so this information was not accessible where most relevant. To remedy this, we have amended the Results section as follows.

      Lines 317 – 323:

      “Bayesian GLMs, controlling for age and gender, predicting task parameters by time-of-day and chronotype showed effects of chronotype on reward sensitivity (i.e. those with a late chronotype had a higher reward sensitivity; M= 0.325, 95% HDI=[0.19,0.46]) and motivational tendency (higher in early chronotypes; M=-0.248, 95% HDI=[-0.37,-0.11]), as well as an interaction between chronotype and time-of-day on motivational tendency (M=0.309, 95% HDI=[0.15,0.48]).”

      (2) Recommendations for improving the writing and presentation.

      - I found the term 'overlapping' a little jarring. I think the authors use it to mean both neuropsychiatric symptoms and chronotypes affect task parameters, but they are are not tested to be 'separable', nor is an interaction tested. Perhaps being upfront about how interactions are not being tested here (in the introduction, and not waiting until the discussion) would give an opportunity to operationalize this term.

      We agree with the Reviewer that our previously-used term “overlapping” was not ideal: it may have been misleading, and was not necessarily reflective of the nature of our findings. We now state explicitly that we are not testing an interaction between neuropsychiatric symptoms and chronotypes in our primary analyses. Additionally, following suggestions made by Reviewer 3, we ran new exploratory analyses to investigate how the effects of neuropsychiatric symptoms and circadian measures on motivational tendency relate to one another. These results in fact show that all three symptom measures have separable effects from circadian measures on motivational tendency. This supports the Reviewer’s view that ‘overlapping’ was entirely the wrong word—although it nevertheless shows the important contribution of circadian rhythm as well as neuropsychiatric symptoms in effort-based decision-making. We have changed the manuscript throughout to better describe this important, more accurate interpretation of our findings, including replacing the term “overlapping”. We changed the title from “Overlapping effects of neuropsychiatric symptoms and circadian rhythm on effort-based decision-making” to “Both neuropsychiatric symptoms and circadian rhythm alter effort-based decision-making”.

      To clarify the intention of our primary analyses, we have added the following to the last paragraph of the introduction.

      Lines 107 – 112:

      “Next, we pre-registered a follow-up experiment to directly investigate how circadian preference interacts with time-of-day on motivational decision-making, using the same task and computational modelling approach. While this allows us to test how circadian effects on motivational decision-making compare to neuropsychiatric effects, we do not test for possible interactions between neuropsychiatric symptoms and chronobiology.”

      We detail our new analyses in the Methods section as follows.

      Lines 800 – 814:

      “4.5.2 Differentiating between the effects of neuropsychiatric symptoms and circadian measures on motivational tendency

      To investigate how the effects of neuropsychiatric symptoms on motivational tendency (2.3.1) relate to effects of chronotype and time-of-day on motivational tendency we conducted exploratory analyses. In the subsamples of participants with an early or late chronotype (including additionally collected data), we first ran Bayesian GLMs with neuropsychiatric questionnaire scores (SHAPS, DARS, AES respectively) predicting motivational tendency, controlling for age and gender. We next added an interaction term of chronotype and time-of-day into the GLMs, testing how this changes previously observed neuropsychiatric and circadian effects on motivational tendency. Finally, we conducted a model comparison using LOO, comparing between motivational tendency predicted by a neuropsychiatric questionnaire, motivational tendency predicted by chronotype and time-of-day, and motivational tendency predicted by a neuropsychiatric questionnaire and time-of-day (for each neuropsychiatric questionnaire, and controlling for age and gender).”

      Results of the outlined analyses are reported in the Results section as follows.

      Lines 356 – 383:

      “2.5.2.1 Neuropsychiatric symptoms and circadian measures have separable effects on motivational tendency

      Exploratory analyses testing for the effects of neuropsychiatric questionnaires on motivational tendency in the subsamples of early and late chronotypes confirmed the predictive value of the SHAPS (M=-0.24, 95% HDI=[-0.42,-0.06]), the DARS (M=-0.16, 95% HDI=[-0.31,-0.01]), and the AES (M=-0.18, 95% HDI=[-0.32,-0.02]) on motivational tendency.

      For the SHAPS, we find that when adding the measures of chronotype and time-of-day back into the GLMs, the main effect of the SHAPS (M=-0.26, 95% HDI=[-0.43,-0.07]), the main effect of chronotype (M=-0.11, 95% HDI=[-0.22,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remain. Model comparison by LOOIC reveals motivational tendency is best predicted by the model including the SHAPS, chronotype and time-of-day as predictors, followed by the model including only the SHAPS. Note that this approach to model comparison penalizes models for increasing complexity.

      Repeating these steps with the DARS, the main effect of the DARS is found numerically, but the 95% HDI just includes 0 (M=-0.15, 95% HDI=[-0.30,0.002]). The main effect of chronotype (M=-0.11, 95% HDI=[-0.21,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.18, 95% HDI=[0.05,0.33]) on motivational tendency remain. Model comparison identifies the model including the DARS and circadian measures as the best model, followed by the model including only the DARS.

      For the AES, the main effect of the AES is found (M=-0.19, 95% HDI=[-0.35,-0.04]). For the main effect of chronotype, the 95% narrowly includes 0 (M=-0.10, 95% HDI=[-0.21,0.002]), while the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remains. Model comparison identifies the model including the AES and circadian measures as the best model, followed by the model including only the AES.”

      In addition to the title change, we edited our Discussion to discuss and reflect these new insights, including the following.

      Lines 399 – 402:

      “Various neuropsychiatric disorders are marked by disruptions in circadian rhythm, such as a late chronotype. However, research has rarely investigated how transdiagnostic mechanisms underlying neuropsychiatric conditions may relate to inter-individual differences in circadian rhythm.”

      Lines 475 – 480:

      “It is striking that the effects of neuropsychiatric symptoms on effort-based decision-making largely are paralleled by circadian effects on the same neurocomputational parameter. Exploratory analyses predicting motivational tendency by neuropsychiatric symptoms and circadian measures simultaneously indicate the effects go beyond recapitulating each other, but rather explain separable parts of the variance in motivational tendency.”

      Lines 528 – 532:

      “Our reported analyses investigating neuropsychiatric and circadian effects on effort-based decision-making simultaneously are exploratory, as our study design was not ideally set out to examine this. Further work is needed to disentangle separable effects of neuropsychiatric and circadian measures on effort-based decision-making.”

      Lines 543 – 550:

      “We demonstrate that neuropsychiatric effects on effort-based decision-making are paralleled by effects of circadian rhythm and time-of-day. Exploratory analyses suggest these effects account for separable parts of the variance in effort-based decision-making. It unlikely that effects of neuropsychiatric effects on effort-based decision-making reported here and in previous literature are a spurious result due to multicollinearity with chronotype. Yet, not accounting for chronotype and time of testing, which is the predominant practice in the field, could affect results.”

      - A minor point, but it could be made clearer that many neurotransmitters have circadian rhythms (and not just dopamine).

      We agree this should have been made clearer, and have added the following to the Introduction.

      Lines 83 – 84:

      “Bi-directional links between chronobiology and several neurotransmitter systems have been reported, including dopamine47.

      (47) Kiehn, J.-T., Faltraco, F., Palm, D., Thome, J. & Oster, H. Circadian Clocks in the Regulation of Neurotransmitter Systems. Pharmacopsychiatry 56, 108–117 (2023).”

      - Making reference to other studies which have explored circadian rhythms in cognitive tasks would allow interested readers to explore the broader field. One such paper is: Bedder, R. L., Vaghi, M. M., Dolan, R. J., & Rutledge, R. B. (2023). Risk taking for potential losses but not gains increases with time of day. Scientific reports, 13(1), 5534, which also includes references to other similar studies in the discussion.

      We thank the Reviewer for pointing out that we failed to cite this relevant work. We have now included it in the Introduction as follows.

      Lines 97 – 98:

      “A circadian effect on decision-making under risk is reported, with the sensitivity to losses decreasing with time-of-day66.

      (66) Bedder, R. L., Vaghi, M. M., Dolan, R. J. & Rutledge, R. B. Risk taking for potential losses but not gains increases with time of day. Sci Rep 13, 5534 (2023).”

      (3) Minor corrections to the text and figures.

      None, clearly written and structured. Figures are high quality and significantly aid understanding.

      Reviewer #2 (Recommendations For The Authors):

      I did have a few more minor comments:

      - The manuscript doesn't clarify whether trials had time limits - so that participants might fail to earn points - or instead they did not and participants had to continue exerting effort until they were done. This is important to know since it impacts on decision-strategies and behavioral outcomes that might be analyzed. For example, if there is no time limit, it might be useful to examine the amount of time it took participants to complete their effort - and whether that had any relationship to choice patterns or symptomatology. Or, if they did, it might be interesting to test whether the relationship between choices and exerted effort depended on symptoms. For example, someone with depression might be less willing to choose effort, but just as, if not more likely to successfully complete a trial once it is selected.

      We thank the Reviewer for pointing out this important detail in the task design, which we should have made clearer. The trials did indeed have a time limit which was dependent on the effort level. To clarify this in the manuscript, we have made changes to Figure 2 and the Methods section. We agree it would be interesting to explore whether the exerted effort in the task related to symptoms. We explored this in our data by predicting the participant average proportion of accepted but failed trials by SHAPS score (controlling for age and gender). We found no relationship: M=0.01, 95% HDI=[-0.001,0.02]. However, it should be noted that the measure of proportion of failed trials may not be suitable here, as there are only few accepted but failed trials (M = 1.3% trials failed, SD = 3.50). This results from several task design characteristics aimed at preventing subjects from failing accepted trials, to avoid confounding of effort discounting with risk discounting. As an alternative measure, we explored the extent to which participants went “above and beyond” the target in accepted trials. Specifically, considering only accepted and succeeded trials, we computed the factor by which the required number of clicks was exceeded (i.e., if a subject clicked 15 times when 10 clicks were required the factor would be 1.3), averaging across effort and reward level. We then conducted a Bayesian GLM to test whether this subject wise click-exceedance measure can be predicted by apathy or anhedonia, controlling for age and gender. We found neither the SHAPS (M=-0.14, 95% HDI=[-0.43,0.17]) nor the AES (M=0.07, 95% HDI=[-0.26,0.41]) had a predictive value for the amount to which subjects exert “extra effort”. We have now added this to the manuscript.

      In Figure 2, which explains the task design in the results section, we have added the following to the figure description.

      Lines 161 – 165:

      “Each trial consists of an offer with a reward (2,3,4, or 5 points) and an effort level (1,2,3, or 4, scaled to the required clicking speed and time the clicking must be sustained for) that subjects accept or reject. If accepted, a challenge at the respective effort level must be fulfilled for the required time to win the points.”

      In the Methods section, we have added the following.

      Lines 617 – 622:

      “We used four effort-levels, corresponding to a clicking speed at 30% of a participant’s maximal capacity for 8 seconds (level 1), 50% for 11 seconds (level 2), 70% for 14 seconds (level 3), and 90% for 17 seconds (level 4). Therefore, in each trial, participants had to fulfil a certain number of mouse clicks (dependent on their capacity and the effort level) in a specific time (dependent on the effort level).”

      In the Supplement, we have added the additional analyses suggested by the Reviewer.

      Lines 195 – 213:

      “3.2 Proportion of accepted but failed trials

      For each participant, we computed the proportion of trial in which an offer was accepted, but the required effort then not fulfilled (i.e., failed trials). There was no relationship between average proportion of accepted but failed trials and SHAPS score (controlling for age and gender): M=0.01, 95% HDI=[-0.001,0.02]. However, there are intentionally few accepted but failed trials (M = 1.3% trials failed, SD = 3.50). This results from several task design characteristics aimed at preventing subjects from failing accepted trials, to avoid confounding of effort discounting with risk discounting.”

      “3.3 Exertion of “extra effort”

      We also explored the extent to which participants went “above and beyond” the target in accepted trials. Specifically, considering only accepted and succeeded trials, we computed the factor by which the required number of clicks was exceeded (i.e., if a subject clicked 15 times when 10 clicks were required the factor would be 1.3), averaging across effort and reward level. We then conducted a Bayesian GLM to test whether this subject wise click-exceedance measure can be predicted by apathy or anhedonia, controlling for age and gender. We found neither the SHAPS (M=-0.14, 95% HDI=[-0.43,0.17]) nor the AES (M=0.07, 95% HDI=[-0.26,0.41]) had a predictive value for the amount to which subjects exert “extra effort”.”

      - Perhaps relatedly, there is evidence that people with depression show less of an optimism bias in their predictions about future outcomes. As such, they show more "rational" choices in probabilistic decision tasks. I'm curious whether the Authors think that a weaker choice bias among those with stronger depression/anhedonia/apathy might be related. Also, are choices better matched with actual effort production among those with depression?

      We think this is a very interesting comment, but unfortunately feel our manuscript cannot properly speak to it: as in our response to the previous comment, our exploratory analysis linking the proportion of accepted but failed trials to anhedonia symptoms (i.e. less anhedonic people making more optimistic judgments of their likelihood of success) did not show a relationship between the two. However, this null finding may be the result of our task design which is not laid out to capture such an effect (in fact to minimize trials of this nature). We have added to the Discussion section.

      Lines 442 – 445:

      “It is possible that a higher motivational tendency reflects a more optimistic assessment of future task success, in line with work on the optimism bias95; however our task intentionally minimized unsuccessful trials by titrating effort and reward; future studies should explore this more directly.

      (95) Korn, C. W., Sharot, T., Walter, H., Heekeren, H. R. & Dolan, R. J. Depression is related to an absence of optimistically biased belief updating about future life events. Psychological Medicine 44, 579–592 (2014).”

      - The manuscript does not clarify: How did the Authors ensure that each subject received each effort-reward combination at least once if a given subject always accepted or always rejected offers?

      We have made the following edit to the Methods section to better explain this aspect of our task design.

      Lines 642 – 655:

      “For each subject, trial-by-trial presentation of effort-reward combinations were made semi-adaptively by 16 randomly interleaved staircases. Each of the 16 possible offers (4 effort-levels x 4 reward-levels) served as the starting point of one of the 16 staircase. Within each staircase, after a subject accepted a challenge, the next trial’s offer on that staircase was adjusted (by increasing effort or decreasing reward). After a subject rejected a challenge, the next offer on that staircase was adjusted by decreasing effort or increasing reward. This ensured subjects received each effort-reward combination at least once (as each participant completed all 16 staircases), while individualizing trial presentation to maximize the trials’ informative value. Therefore, in practice, even in the case of a subject rejecing all offers (and hence the staircasing procedures always adapting by decreasing effort or increasing reward), the full range of effort-reward combinations will be represented in the task across the startingpoints of all staircases (and therefore before adaption takeplace).”

      - The word "metabolic" is misspelled in Table 1

      - Figure 2 is missing panel label "C"

      - The word "effort" is repeated on line 448.

      We thank the Reviewer for their attentive reading of our manuscript and have corrected the mistakes mentioned.

      Reviewer #3 (Recommendations For The Authors):

      It is a bit difficult to get a sense of people's discounting from the plots provided. Could the authors show a few example individuals and their fits (i.e., how steep was effort discounting on average and how much variance was there across individuals; maybe they could show the mean discount function or some examples etc)

      We appreciate very much the Reviewer's suggestion to visualise our parameter estimates within and across individuals. We have implemented this in Figure .S2

      It would be helpful if correlations between the various markers used as dependent variables (SHAPS, DARS, AES, chronotype etc) could plotted as part of each related figure (e.g., next to the relevant effects shown).

      We agree with the Reviewer that a visual representation of the various correlations between dependent variables would be a better and more assessable communication than our current paragraph listing the correlations. We have implemented this by adding a new figure plotting all correlations in a heat map, with asterisks indicating significance.

      The authors use the term "meaningful relationship" - how is this defined? If undefined, maybe consider changing (do they mean significant?)

      We understand how our use of the term “(no) meaningful relationship” was confusing here. As we conducted most analyses in a Bayesian fashion, this is a formal definition of ‘meaningful’: the 95% highest density interval does not span across 0. However, we do not want this to be misunderstood as frequentist “significance” and agree clarity can be improved here, To avoid confusion, we have amended the manuscript where relevant (i.e., we now state “we found a (/no) relationship / effect” rather than “we found a meaningful relationship”.

      The authors do not include an inverse temperature parameter in their discounting models-can they motivate why? If a participant chose nearly randomly, which set of parameter values would they get assigned?

      Our decision to not include an inverse temperature parameter was made after an extensive simulation-based investigation of different models and task designs. A series of parameter recovery studies including models with an inverse temperature parameter revealed the inverse temperature parameter could not be distinguished from the reward sensitivity parameter. Specifically, inverse temperature seemed to capture the variance of the true underlying reward sensitivity parameter, leading to confounding between the two. Hence, including both reward sensitivity and inverse temperature would not have allowed us to reliably estimate either parameter. As our pre-registered hypotheses related to the reward sensitivity parameter, we opted to include models with the reward sensitivity parameter rather than the inverse temperature parameter in our model space. We have now added these simulations to our supplement.

      Nevertheless, we believe our models can capture random decision-making. The parameters of effort and reward sensitivity capture how sensitive one is to changes in effort/reward level. Hence, random decision-making can be interpreted as low effort and reward sensitivity, such that one’s decision-making is not guided by changes in effort and reward magnitude. With low effort/reward sensitivity, the motivational tendency parameter (previously “choice bias”) would capture to what extend this random decision-making is biased toward accepting or rejecting offers.

      The simulation results are now detailed in the Supplement.

      Lines 25 – 46:

      “1.2.1 Parameter recoveries including inverse temperature

      In the process of task and model space development, we also considered models incorportating an inverse temperature paramater. To this end, we conducted parameter recoveries for four models, defined in Table S3.

      Parameter recoveries indicated that, parameters can be recovered reliably in model 1, which includes only effort sensitivity ( ) and inverse temperature as free parameters (on-diagonal correlations: .98 > r > .89, off-diagonal correlations: .04 > |r| > .004). However, as a reward sensitivity parameter is added to the model (model 2), parameter recovery seems to be compromised, as parameters are estimated less accurately (on-diagonal correlations: .80 > r > .68), and spurious correlations between parameters emerge (off-diagonal correlations: .40 > |r| > .17). This issue remains when motivational tendency is added to the model (model 4; on-diagonal correlations: .90 > r > .65; off-diagonal correlations: .28 > |r| > .03), but not when inverse temperature is modelled with effort sensitivity and motivational tendency, but not reward sensitivity (model 3; on-diagonal correlations: .96 > r > .73; off-diagonal correlations: .05 > |r| > .003).

      As our pre-registered hypotheses related to the reward sensitivity parameter, we opted to include models with the reward sensitivity parameter rather than the inverse temperature parameter in our model space.”

      And we now discuss random decision-making specifically in the Methods section.

      Lines 698 – 709:

      “First, a cost function transforms costs and rewards associated with an action into a subjective value (SV):

      with and for reward and effort sensitivity, and  and  for reward and effort. Higher effort and reward sensitivity mean the SV is more strongly influenced by changes in effort and reward, respectively (Fig. 3B-C). Hence, low effort and reward sensitivity mean the SV, and with that decision-making, is less guided by effort and reward offers, as would be in random decision-making.

      This SV is then transformed to an acceptance probability by a softmax function:

      with for the predicted acceptance probability and  for the intercept representing motivational tendency. A high motivational tendency means a subjects has a tendency, or bias, to accept rather than reject offers (Fig. 3D).”

      The pre-registration mentions effects of BMI and risk of metabolic disease-those are briefly reported the in factor loadings, but not discussed afterwards-although the authors stated hypotheses regarding these measures in their preregistration. Were those hypotheses supported?

      We reported these results (albeit only briefly) in the factor loadings resulting from our PLS regression and results from follow-up GLMs (see below). We have now amended the Discussion to enable further elaboration on whether they confirmed our hypotheses (this evidence was unclear, but we have subsequently followed up in a sample with type-2 diabetes, who also show reduced motivational tendency).

      Lines 258 – 261:

      “For the MEQ (95%HDI=[-0.09,0.06]), MCTQ (95%HDI=[-0.17,0.05]), BMI (95%HDI=[-0.19,0.01]), and FINDRISC (95%HDI=[-0.09,0.03]) no relationship with motivational tendency was found, consistent with the smaller magnitude of reported component loadings from the PLS regression.”

      We have added the following paragraph to our discussion.

      Lines 491 – 502:

      “To our surprise, we did not find statistical evidence for a relationship between effort-based decision-making and measures of metabolic health (BMI and risk for type-2 diabetes). Our analyses linking BMI to motivational tendency reveal a numeric effect in line with our hypothesis: a higher BMI relating to a lower motivational tendency. However, the 95% HDI for this effect narrowly included zero (95%HDI=[-0.19,0.01]). Possibly, our sample did not have sufficient variance in metabolic health to detect dimensional metabolic effects in a current general population sample. A recent study by our group investigates the same neurocomputational parameters of effort-based decision-making in participants with type-2 diabetes and non-diabetic controls matched by age, gender, and physical activity105. We report a group effect on the motivational tendency parameter, with type-2 diabetic patients showing a lower tendency to exert effort for reward.”

      “(105) Mehrhof, S. Z., Fleming, H. A. & Nord, C. A cognitive signature of metabolic health in effort-based decision-making. Preprint at https://doi.org/10.31234/osf.io/4bkm9 (2024).”

      R-values are indicated as a range (e.g., from 0.07-0.72 for the last one in 2.1 which is a large range). As mentioned above, the full correlation matrix should be reported in figures as heatmaps.

      We agree with the Reviewer that a heatmap is a better way of conveying this information – see Figure 1 in response to their previous comment.  

      The answer on whether data was already collected is missing on the second preregistration link. Maybe this is worth commenting on somewhere in the manuscript.

      This question appears missing because, as detailed in the manuscript, we felt that technically some data *was* already collected by the time our second pre-registration was posted. This is because the second pre-registration detailed an additional data collection, with the goal of extending data from the original dataset to include extreme chronotypes and increase precision of analyses. To avoid any confusion regarding the lack of reply to this question in the pre-registration, we have added the following disclaimer to the description of the second pre-registration:

      “Please note the lack of response to the question regarding already collected data. This is because the data collection in the current pre-registration extends data from the original dataset to increase the precision of analyses. While this original data is already collected, none of the data collection described here has taken place.”

      Some referencing is not reflective of the current state of the field (e.g., for effort discounting: Sugiwaka et al., 2004 is cited). There are multiple labs that have published on this since then including Philippe Tobler's and Sven Bestmann's groups (e.g., Hartmann et al., 2013; Klein-Flügge et al., Plos CB, 2015).

      We agree absolutely, and have added additional, more recent references on effort discounting.

      Lines 67 – 68:

      “Higher costs devalue associated rewards, an effect referred to as effort-discounting33–37.”

      (33) Sugiwaka, H. & Okouchi, H. Reformative self-control and discounting of reward value by delay or effort1. Japanese Psychological Research 46, 1–9 (2004).

      (34) Hartmann, M. N., Hager, O. M., Tobler, P. N. & Kaiser, S. Parabolic discounting of monetary rewards by physical effort. Behavioural Processes 100, 192–196 (2013).

      (35) Klein-Flügge, M. C., Kennerley, S. W., Saraiva, A. C., Penny, W. D. & Bestmann, S. Behavioral Modeling of Human Choices Reveals Dissociable Effects of Physical Effort and Temporal Delay on Reward Devaluation. PLOS Computational Biology 11, e1004116 (2015).

      (36) Białaszek, W., Marcowski, P. & Ostaszewski, P. Physical and cognitive effort discounting across different reward magnitudes: Tests of discounting models. PLOS ONE 12, e0182353 (2017).

      (37) Ostaszewski, P., Bąbel, P. & Swebodziński, B. Physical and cognitive effort discounting of hypothetical monetary rewards. Japanese Psychological Research 55, 329–337 (2013).

      There are lots of typos throughout (e.g., Supplementary martial, Mornignness etc)

      We thank the Reviewer for their attentive reading of our manuscript and have corrected our mistakes.

      In Table 1, it is not clear what the numbers given in parentheses are. The figure note mentions SD, IQR, and those are explicitly specified for some rows, but not all.

      After reviewing Table 1 we understand the comment regarding the clarity of the number in parentheses. In our original manuscript, for some variables, numbers were given per category (e.g. for gender and ethnicity), rather than per row, in which case the parenthetical statistic was indicated in the header row only. However, we now see that the clarity of the table would have been improved by adding the reported statistic for each row—we have corrected this.

      In Figure 1C, it would be much more helpful if the different panels were combined into one single panel (using differently coloured dots/lines instead of bars).

      We agree visualizing the proportion of accepted trials across effort and reward levels in one single panel aids interpretability. We have implemented it in the following plot (now Figure 2C).

      In Sections 2.2.1 and 4.2.1, the authors mention "mixed-effects analysis of variance (ANOVA) of repeated measures" (same in the preregistration). It is not clear if this is a standard RM-ANOVA (aggregating data per participant per condition) or a mixed-effects model (analysing data on a trial-by-trial level). This model seems to only include within-subjects variable, so it isn't a "mixed ANOVA" mixing within and between subjects effects.

      We apologise that our use of the term "mixed-effects analysis of variance (ANOVA) of repeated measures" is indeed incorrectly applied here. We aggregate data per participant and effort-by-reward combination, meaning there are no between-subject effects tested. We have corrected this to “repeated measures ANOVA”.

      In Section 2.2.2, the authors write "R-hats>1.002" but probably mean "R-hats < 1.002". ESS is hard to evaluate unless the total number of samples is given.

      We thank the Reviewer for noticing this mistake and have corrected it in the manuscript.

      In Section 2.3, the inference criterion is unclear. The authors first report "factor loadings" and then perform a permutation test that is not further explained. Which of these factors are actually needed for predicting choice bias out of chance? The permutation test suggests that the null hypothesis is just "none of these measures contributes anything to predicting choice bias", which is already falsified if only one of them shows an association with choice bias. It would be relevant to know for which measures this is the case. Specifically, it would be relevant to know whether adding circadian measures into a model that already contains apathy/anhedonia improves predictive performance.

      We understand the Reviewer’s concerns regarding the detail of explanation we have provided for this part of our analysis, but we believe there may have been a misunderstanding regarding the partial least squares (PLS) regression. Rather than identifying a number of factors to predict the outcome variable, a PLS regression identifies a model with one or multiple components, with various factor loadings of differing magnitude. In our case, the PLS regression identified a model with one component to best predict our outcome variable (motivational tendency, which in our previous various we called choice bias). This one component had factor loadings of our questionnaire-based measures, with measures of apathy and anhedonia having highest weights, followed by lesser weighted factor loadings by measures of circadian rhythm and metabolic health. The permutation test tests whether this component (consisting of the combination of factor loadings) can predict the outcome variable out of sample.

      We hope we have improved clarity on this in the manuscript by making the following edits to the Results section.

      Lines 248 – 251:

      “Permutation testing indicated the predictive value of the resulting component (with factor loadings described above) was significant out-of-sample (root-mean-squared error [RMSE]=0.203, p=.001).”

      Further, we hope to provide a more in-depth explanation of these results in the Methods section.

      Lines 755 – 759:

      “Statistical significance of obtained effects (i.e., the predictive accuracy of the identified component and factor loadings) was assessed by permutation tests, probing the proportion of root-mean-squared errors (RMSEs) indicating stronger or equally strong predictive accuracy under the null hypothesis.”

      In Section 2.5, the authors simply report "that chronotype showed effects of chronotype on reward sensitivity", but the direction of the effect (higher reward sensitivity in early vs. late chronotype) remains unclear.

      We thank the Reviewer for pointing this out. While we did report the direction of effect, this was only presented in the subsequent parentheticals and could have been made much clearer. To assist with this, we have made the following addition to the text.

      Lines 317 – 320:

      “Bayesian GLMs, controlling for age and gender, predicting task parameters by time-of-day and chronotype showed effects of chronotype on reward sensitivity (i.e. those with a late chronotype had a higher reward sensitivity; M= 0.325, 95% HDI=[0.19,0.46])”

      In Section 4.2, the authors write that they "implemented a previously-described procedure using Prolific pre-screeners", but no reference to this previous description is given.

      We thank the Reviewer for bringing our attention to this missing reference, which has now been added to the manuscript.

      In Supplementary Table S2, only the "on-diagonal correlations" are given, but off-diagonal correlations (indicative of trade-offs between parameters) would also be informative.

      We agree with the Reviewer that off-diagonal correlations between underlying and recovered parameters are crucial to assess confounding between parameters during model estimation. We reported this in figure S1D, where we present the full correlation matric between underlying and recovered parameters in a heatmap. We have now noticed that this plot was missing axis labels, which have been added now.

      I found it somewhat difficult to follow the results section without having read the methods section beforehand. At the beginning of the Results section, could the authors briefly sketch the outline of their study? Also, given they have a pre-registration, could the authors introduce each section with a statement of what they expected to find, and close with whether the data confirmed their expectations? In the current version of the manuscript, many results are presented without much context of what they mean.

      We agree a brief outline of the study procedure before reporting the results would be beneficial to following the subsequently text and have added the following to the end of our Introduction.

      Lines 101 – 106:

      “Here, we tested the relationship between motivational decision-making and three key neuropsychiatric syndromes: anhedonia, apathy, and depression, taking both a transdiagnostic and categorical (diagnostic) approach. To do this, we validate a newly developed effort-expenditure task, designed for online testing, and gamified to increase engagement. Participants completed the effort-expenditure task online, followed by a series of self-report questionnaires.”

      We have added references to our pre-registered hypotheses at multiple points in our manuscript.

      Lines 185 – 187:

      “In line with our pre-registered hypotheses, we found significant main effects for effort (F(1,14367)=4961.07, p<.0001) and reward (F(1,14367)=3037.91, p<.001), and a significant interaction between the two (F(1,14367)=1703.24, p<.001).”

      Lines 215 – 221:

      “Model comparison by out-of-sample predictive accuracy identified the model implementing three parameters (motivational tendency a, reward sensitivity , and effort sensitivity ), with a parabolic cost function (subsequently referred to as the full parabolic model) as the winning model (leave-one-out information criterion [LOOIC; lower is better] = 29734.8; expected log posterior density [ELPD; higher is better] = -14867.4; Fig. 31ED). This was in line with our pre-registered hypotheses.”

      Lines 252 – 258:

      “Bayesian GLMs confirmed evidence for psychiatric questionnaire measures predicting motivational tendency (SHAPS: M=-0.109; 95% highest density interval (HDI)=[-0.17,-0.04]; AES: M=-0.096; 95%HDI=[-0.15,-0.03]; DARS: M=-0.061; 95%HDI=[-0.13,-0.01]; Fig. 4A). Post-hoc GLMs on DARS sub-scales showed an effect for the sensory subscale (M=-0.050; 95%HDI=[-0.10,-0.01]). This result of neuropsychiatric symptoms predicting a lower motivational tendency is in line with our pre-registered hypothesis.”

      Lines 258 – 263:

      “For the MEQ (95%HDI=[-0.09,0.06]), MCTQ (95%HDI=[-0.17,0.05]), BMI (95%HDI=[-0.19,0.01]), and FINDRISC (95%HDI=[-0.09,0.03]) no meaningful relationship with choice biasmotivational tendency was found, consistent with the smaller magnitude of reported component loadings from the PLS regression. This null finding for dimensional measures of circadian rhythm and metabolic health was not in line with our pre-registered hypotheses.”

      Lines 268 – 270:

      “For reward sensitivity, the intercept-only model outperformed models incorporating questionnaire predictors based on RMSE. This result was not in line with our pre-registered expectations.”

      Lines 295 – 298:

      “As in our transdiagnostic analyses of continuous neuropsychiatric measures (Results 2.3), we found evidence for a lower motivational tendency parameter in the MDD group compared to HCs (M=-0.111, 95% HDI=[ -0.20,-0.03]) (Fig. 4B). This result confirmed our pre-registered hypothesis.”

      Lines 344 – 355:

      “Late chronotypes showed a lower motivational tendency than early chronotypes (M=-0.11, 95% HDI=[-0.22,-0.02])—comparable to effects of transdiagnostic measures of apathy and anhedonia, as well as diagnostic criteria for depression. Crucially, we found motivational tendency was modulated by an interaction between chronotype and time-of-day (M=0.19, 95% HDI=[0.05,0.33]): post-hoc GLMs in each chronotype group showed this was driven by a time-of-day effect within late, rather than early, chronotype participants (M=0.12, 95% HDI=[0.02,0.22], such that late chronotype participants showed a lower motivational tendency in the morning testing sessions, and a higher motivational tendency in the evening testing sessions; early chronotype: 95% HDI=[-0.16,0.04]) (Fig. 5A). These results of a main effect and an interaction effect of chronotype on motivational tendency confirmed our pre-registered hypothesis.”

      Lines 390 – 393:

      “Participants with an early chronotype had a lower reward sensitivity parameter than those with a late chronotype (M=0.27, 95% HDI=[0.16,0.38]). We found no effect of time-of-day on reward sensitivity (95%HDI=[-0.09,0.11]) (Fig. 5B). These results were in line with our pre-registered hypotheses.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors used structural and biophysical methods to provide insight into Parkin regulation. The breadth of data supporting their findings was impressive and generally well-orchestrated. Still, the impact of their results builds on recent structural studies and the stated impact is based on these prior works.

      Strengths:

      (1) After reading through the paper, the major findings are:

      - RING2 and pUbl compete for binding to RING0.

      - Parkin can dimerize.

      - ACT plays an important role in enzyme kinetics.

      (2) The use of molecular scissors in their construct represents a creative approach to examining inter-domain interactions.

      (3) From my assessment, the experiments are well-conceived and executed.

      We thank the reviewer for their positive remark and extremely helpful suggestions.

      Weaknesses:

      The manuscript, as written, is NOT for a general audience. Admittedly, I am not an expert on Parkin structure and function, but I had to do a lot of homework to try to understand the underlying rationale and impact. This reflects, I think, that the work generally represents an incremental advance on recent structural findings.

      To this point, it is hard to understand the impact of this work without more information highlighting the novelty. There are several structures of Parkin in various auto-inhibited states, and it was hard to delineate how this is different.

      For the sake of the general audience, we have included all the details of Parkin structures and conformations seen (Extended Fig. 1). The structures in the present study are to validate the biophysical/biochemical experiments, highlighting key findings. For example, we solved the phospho-Parkin (complex with pUb) structure after treatment with 3C protease (Fig. 2C), which washes off the pUbl-linker, as shown in Fig 2B. The structure of the pUbl-linker depleted phospho-Parkin-pUb complex showed that RING2 returned to the closed state (Fig. 2C), which is confirmation of the SEC assay in Fig. 2B. Similarly, the structure of the pUbl-linker depleted phospho-Parkin R163D/K211N-pUb complex (Fig. 3C), was done to validate the SEC data showing displacement of pUbl-linker is independent of pUbl interaction with the basic patch on RING0 (Fig. 3B). In addition, the latter structure also revealed a new donor ubiquitin binding pocket in the linker (connecting REP and RING2) region of Parkin (Fig. 9). Similarly, trans-complex structure of phospho-Parkin (Fig. 4D) was done to validate the biophysical data (Fig. 4A-C, Fig. 5A-D) showing trans-complex between phospho-Parkin and native Parkin. The latter also confirmed that the trans-complex was mediated by interactions between pUbl and the basic patch on RING0 (Fig. 4D). Furthermore, we noticed that the ACT region was disordered in the trans-complex between phospho-Parkin (1-140 + 141-382 + pUb) (Fig. 8A) which had ACT from the trans molecule, indicating ACT might be present in the cis molecule. The latter was validated from the structure of trans-complex between phospho-Parkin with cis ACT (1-76 + 77-382 + pUb) (Fig. 8C), showing the ordered ACT region. The structural finding was further validated by biochemical assays (Fig. 8 D-F, Extended Data Fig. 9C-E).

      The structure of TEV-treated R0RBR (TEV) (Extended Data Fig. 4C) was done to ensure that the inclusion of TEV and treatment with TEV protease did not perturb Parkin folding, an important control for our biophysical experiments.

      As noted, I appreciated the use of protease sites in the fusion protein construct. It is unclear how the loop region might affect the protein structure and function. The authors worked to demonstrate that this did not introduce artifacts, but the biological context is missing.

      We thank the reviewer for appreciating the use of protease sites in the fusion protein construct.  Protease sites were used to overcome the competing mode of binding that makes interactions very transient and beyond the detection limit of methods such as ITC or SEC. While these interactions are quite transient in nature, they could still be useful for the activation of various Parkin isoforms that lack either the Ubl domain or RING2 domain (Extended Data Fig. 6, Fig. 10). Also, our Parkin localization assays also suggest an important role of these interactions in the recruitment of Parkin molecules to the damaged mitochondria (Fig. 6).

      While it is likely that the binding is competitive between the Ubl and RING2 domains, the data is not quantitative. Is it known whether the folding of the distinct domains is independent? Or are there interactions that alter folding? It seems plausible that conformational rearrangements may invoke an orientation of domains that would be incompatible. The biological context for the importance of this interaction was not clear to me.

      This is a great point. In the revised manuscript, we have included quantitative data between phospho-Parkin and untethered ∆Ubl-Parkin (TEV) (Fig. 5B) showing similar interactions using phospho-Parkin K211N and untethered ∆Ubl-Parkin (TEV) (Fig. 4B). Folding of Ubl domain or various combinations of RING domains lacking Ubl seems okay. Also, folding of the RING2 domain on its own appears to be fine. However, human Parkin lacking the RING2 domain seems to have some folding issues, majorly due to exposure of hydrophobic pocket on RING0, also suggested by previous efforts (Gladkova et al.ref. 24, Sauve et al. ref. 29).  The latter could be overcome by co-expression of RING2 lacking Parkin construct with PINK1 (Sauve et al. ref. 29) as phospho-Ubl binds on the same hydrophobic pocket on RING0 where RING2 binds. A drastic reduction in the melting temperature of phospho-Parkin (Gladkova et al.ref. 24), very likely due to exposure of hydrophobic surface between RING0 and RING2, correlates with the folding issues of RING0 exposed human Parkin constructs.

      From the biological context, the competing nature between phospho-Ubl and RING2 domains could block the non-specific interaction of phosphorylated-ubiquitin-like proteins (phospho-Ub or phospho-NEDD8) with RING0 (Lenka et al. ref. 33), during Parkin activation. 

      (5) What is the rationale for mutating Lys211 to Asn? Were other mutations tried? Glu? Ala? Just missing the rationale. I think this may have been identified previously in the field, but not clear what this mutation represents biologically.

      Lys211Asn is a Parkinson’s disease mutation; therefore, we decided to use the same mutation for biophysical studies.  

      I was confused about how the phospho-proteins were generated. After looking through the methods, there appear to be phosphorylation experiments, but it is unclear what the efficiency was for each protein (i.e. what % gets modified). In the text, the authors refer to phospho-Parkin (T270R, C431A), but not clear how these mutations might influence this process. I gather that these are catalytically inactive, but it is unclear to me how this is catalyzing the ubiquitination in the assay.

      This is an excellent question. Because different phosphorylation statuses would affect the analysis, we ensured complete phosphorylation status using Phos-Tag SDS-PAGE, as shown below.

      Author response image 1.

      Our biophysical experiments in Fig. 5C show that trans complex formation is mediated by interactions between the basic patch (comprising K161, R163, K211) on RING0 and phospho-Ubl domain in trans. These interactions result in the displacement of RING2 (Fig. 5C). Parkin activation is mediated by displacement of RING2 and exposure of catalytic C431 on RING2. While phospho-Parkin T270R/C431A is catalytically dead, the phospho-Ubl domain of phospho-Parkin T270R/C431would bind to the basic patch on RING0 of WT-Parkin resulting in activation of WT-Parkin as shown in Fig. 5E. A schematic figure is shown below to explain the same.

      Author response image 2.

      (7) The authors note that "ACT can be complemented in trans; however, it is more efficient in cis", but it is unclear whether both would be important or if the favored interaction is dominant in a biological context.

      First, this is an excellent question about the biological context of ACT and needs further exploration. While due to the flexible nature of ACT, it can be complemented both in cis and trans, we can only speculate cis interactions between ACT and RING0 could be more relevant from the biological context as during protein synthesis and folding, ACT would be translated before RING2, and thus ACT would occupy the small hydrophobic patch on RING0 in cis. Unpublished data shows the replacement of the ACT region by Biogen compounds to activate Parkin (https://doi.org/10.21203/rs.3.rs-4119143/v1). The latter finding further suggests the flexibility in this region.        

      (8) The authors repeatedly note that this study could aid in the development of small-molecule regulators against Parkin to treat PD, but this is a long way off. And it is not clear from their manuscript how this would be achieved. As stated, this is conjecture.

      As suggested by this reviewer, we have removed this point in the revised manuscript.

      Reviewer #2 (Public Review):

      This manuscript uses biochemistry and X-ray crystallography to further probe the molecular mechanism of Parkin regulation and activation. Using a construct that incorporates cleavage sites between different Parkin domains to increase the local concentration of specific domains (i.e., molecular scissors), the authors suggest that competitive binding between the p-Ubl and RING2 domains for the RING0 domain regulates Parkin activity. Further, they demonstrate that this competition can occur in trans, with a p-Ubl domain of one Parkin molecule binding the RING0 domain of a second monomer, thus activating the catalytic RING1 domain. In addition, they suggest that the ACT domain can similarly bind and activate Parkin in trans, albeit at a lower efficiency than that observed for p-Ubl. The authors also suggest from crystal structure analysis and some biochemical experiments that the linker region between RING2 and repressor elements interacts with the donor ubiquitin to enhance Parkin activity.<br /> Ultimately this manuscript challenges previous work suggesting that the p-Ubl domain does not bind to the Parkin core in the mechanism of Parkin activation. The use of the 'molecular scissors' approach to probe these effects is an interesting approach to probe this type of competitive binding. However, there are issues with the experimental approach manuscript that detract from the overall quality and potential impact of the work.

      We thank the reviewer for their positive remark and constructive suggestions.

      The competitive binding between p-Ubl and RING2 domains for the Parkin core could have been better defined using biophysical and biochemical approaches that explicitly define the relative affinities that dictate these interactions. A better understanding of these affinities could provide more insight into the relative bindings of these domains, especially as it relates to the in trans interactions.

      This is an excellent point regarding the relative affinities of pUbl and RING2 for the Parkin core (lacking Ubl and RING2). While we could purify p-Ubl, we failed to purify human Parkin (lacking RING2 and phospho-Ubl). The latter folding issues were likely due to the exposure of a highly hydrophobic surface on RING0 (as shown below) in the absence of pUbl and RING2 in the R0RB construct. Also, RING2 with an exposed hydrophobic surface would be prone to folding issues, which is not suitable for affinity measurements. A drastic reduction in the melting temperature of phospho-Parkin (Gladkova et al.ref. 24) also highlights the importance of a hydrophobic surface between RING0 and RING2 on Parkin folding/stability. A separate study would be required to try these Parkin constructs from different species and ensure proper folding before using them for affinity measurements.

      Author response image 3.

      I also have concerns about the results of using molecular scissors to 'increase local concentrations' and allow for binding to be observed. These experiments are done primarily using proteolytic cleavage of different domains followed by size exclusion chromatography. ITC experiments suggest that the binding constants for these interactions are in the µM range, although these experiments are problematic as the authors indicate in the text that protein precipitation was observed during these experiments. This type of binding could easily be measured in other assays. My issue relates to the ability of a protein complex (comprising the core and cleaved domains) with a Kd of 1 µM to be maintained in an SEC experiment. The off-rates for these complexes must be exceeding slow, which doesn't really correspond to the low µM binding constants discussed in the text. How do the authors explain this? What is driving the Koff to levels sufficiently slow to prevent dissociation by SEC? Considering that the authors are challenging previous work describing the lack of binding between the p-Ubl domain and the core, these issues should be better resolved in this current manuscript. Further, it's important to have a more detailed understanding of relative affinities when considering the functional implications of this competition in the context of full-length Parkin. Similar comments could be made about the ACT experiments described in the text.

      This is a great point. In the revised manuscript, we repeated ITC measurements in a different buffer system, which gave nice ITC data. In the revised manuscript, we have also performed ITC measurements using native phospho-Parkin. Phospho-Parkin and untethered ∆Ubl-Parkin (TEV) (Fig. 5B) show similar affinities as seen between phospho-Parkin K211N and untethered ∆Ubl-Parkin (TEV) (Fig. 4B). However, Kd values were consistent in the range of 1.0 ± 0.4 µM which could not address the reviewer’s point regarding slow off-rate. The crystal structure of the trans-complex of phospho-Parkin shows several hydrophobic and ionic interactions between p-Ubl and Parkin core, suggesting a strong interaction and, thus, justifying the co-elution on SEC. Additionally, ITC measurements between E2-Ub and P-Parkin-pUb show similar affinity (Kd = 0.9 ± 0.2 µM) (Kumar et al., 2015, EMBO J.), and yet they co-elute on SEC (Kumar et al., 2015, EMBO J.).

      Ultimately, this work does suggest additional insights into the mechanism of Parkin activation that could contribute to the field. There is a lot of information included in this manuscript, giving it breadth, albeit at the cost of depth for the study of specific interactions. Further, I felt that the authors oversold some of their data in the text, and I'd recommend being a bit more careful when claiming an experiment 'confirms' a specific model. In many cases, there are other models that could explain similar results. For example, in Figure 1C, the authors state that their crystal structure 'confirms' that "RING2 is transiently displaced from the RING0 domain and returns to its original position after washing off the p-Ubl linker". However, it isn't clear to me that RING2 ever dissociated when prepared this way. While there are issues with the work that I feel should be further addressed with additional experiments, there are interesting mechanistic details suggested by this work that could improve our understanding of Parkin activation. However, the full impact of this work won't be fully appreciated until there is a more thorough understanding of the regulation and competitive binding between p-Ubl and RIGN2 to RORB both in cis and in trans.

      We thank the reviewer for their positive comment. In the revised manuscript, we have included the reviewer’s suggestion. The conformational changes in phospho-Parkin were established from the SEC assay (Fig. 2A and Fig. 2B), which show displacement/association of phospho-Ubl or RING2 after treatment of phospho-Parkin with 3C and TEV, respectively. For crystallization, we first phosphorylated Parkin, where RING2 is displaced due to phospho-Ubl (as shown in SEC), followed by treatment with 3C protease, which led to pUbl wash-off. The Parkin core separated from phospho-Ubl on SEC was used for crystallization and structure determination in Fig. 2C, where RING2 returned to the RING0 pocket, which confirms SEC data (Fig. 2B).

      Reviewer #3 (Public Review):

      Summary:

      In their manuscript "Additional feedforward mechanism of Parkin activation via binding of phospho-UBL and RING0 in trans", Lenka et al present data that could suggest an "in trans" model of Parkin ubiquitination activity. Parkin is an intensely studied E3 ligase implicated in mitophagy, whereby missense mutations to the PARK2 gene are known to cause autosomal recessive juvenile parkinsonism. From a mechanistic point of view, Parkin is extremely complex. Its activity is tightly controlled by several modes of auto-inhibition that must be released by queues of mitochondrial damage. While the general overview of Parkin activation has been mapped out in recent years, several details have remained murky. In particular, whether Parkin dimerizes as part of its feed-forward signaling mechanism, and whether said dimerization can facilitate ligase activation, has remained unclear. Here, Lenka et al. use various truncation mutants of Parkin in an attempt to understand the likelihood of dimerization (in support of an "in trans" model for catalysis).

      Strengths:

      The results are bolstered by several distinct approaches including analytical SEC with cleavable Parkin constructs, ITC interaction studies, ubiquitination assays, protein crystallography, and cellular localization studies.

      We thank the reviewer for their positive remark.

      Weaknesses:

      As presented, however, the storyline is very confusing to follow and several lines of experimentation felt like distractions from the primary message. Furthermore, many experiments could only indirectly support the author's conclusions, and therefore the final picture of what new features can be firmly added to the model of Parkin activation and function is unclear.

      We thank the reviewer for their constructive criticism, which has helped us to improve the quality of this manuscript.

      Major concerns:

      (1) This manuscript solves numerous crystal structures of various Parkin components to help support their idea of in trans transfer. The way these structures are presented more resemble models and it is unclear from the figures that these are new complexes solved in this work, and what new insights can be gleaned from them.

      The structures in the present study are to validate the biophysical/biochemical experiments highlighting key findings. For example, we solved the phospho-Parkin (complex with pUb) structure after treatment with 3C protease (Fig. 2C), which washes off the pUbl-linker, as shown in Fig. 2B. The structure of pUbl-linker depleted phospho-Parkin-pUb complex showed that RING2 returned to the closed state (Fig. 2C), which is confirmation of the SEC assay in Fig. 2B. Similarly, the structure of the pUbl-linker depleted phospho-Parkin R163D/K211N-pUb complex (Fig. 3C), was done to validate the SEC data showing displacement of pUbl-linker is independent of pUbl interaction with the basic patch on RING0 (Fig. 3B). In addition, the latter structure also revealed a new donor ubiquitin binding pocket in the linker (connecting REP and RING2) region of Parkin (Fig. 9). Similarly, trans-complex structure of phospho-Parkin (Fig. 4D) was done to validate the biophysical data (Fig. 4A-C, Fig. 5A-D) showing trans-complex between phospho-Parkin and native Parkin. The latter also confirmed that the trans-complex was mediated by interactions between pUbl and the basic patch on RING0 (Fig. 4D). Furthermore, we noticed that the ACT region was disordered in the trans-complex between phospho-Parkin (1-140 + 141-382 + pUb) (Fig. 8A) which had ACT from the trans molecule, indicating ACT might be present in the cis molecule. The latter was validated from the structure of trans-complex between phospho-Parkin with cis ACT (1-76 + 77-382 + pUb) (Fig. 8C), showing the ordered ACT region. The structural finding was further validated by biochemical assays (Fig. 8 D-F, Extended Data Fig. 9C-E).

      The structure of TEV-treated R0RBR (TEV) (Extended Data Fig. 4C) was done to ensure that the inclusion of TEV and treatment with TEV protease did not perturb Parkin folding, an important control for our biophysical experiments.

      (2) There are no experiments that definitively show the in trans activation of Parkin. The binding experiments and size exclusion chromatography are a good start, but the way these experiments are performed, they'd be better suited as support for a stronger experiment showing Parkin dimerization. In addition, the rationale for an in trans activation model is not convincingly explained until the concept of Parkin isoforms is introduced in the Discussion. The authors should consider expanding this concept into other parts of the manuscript.

      We thank the reviewer for appreciating the Parkin dimerization. Our biophysical data in Fig. 5C shows that Parkin dimerization is mediated by interactions between phospho-Ubl and RING0 in trans, leading to the displacement of RING2. However, Parkin K211N (on RING0) mutation perturbs interaction with phospho-Parkin and leads to loss of Parkin dimerization and loss of RING2 displacement (Fig. 5C). The interaction between pUbl and K211 pocket on RING0 leads to the displacement of RING2 resulting in Parkin activation as catalytic residue C431 on RING2 is exposed for catalysis. The biophysical experiment is further confirmed by a biochemical experiment where the addition of catalytically in-active phospho-Parkin T270R/C431A activates autoinhibited WT-Parkin in trans using the mechanism as discussed (a schematic representation also shown in Author response image 2).

      We thank this reviewer regarding Parkin isoforms. In the revised manuscript, we have included Parkin isoforms in the results section, too.

      (2a) For the in trans activation experiment using wt Parkin and pParkin (T270R/C431A) (Figure 3D), there needs to be a large excess of pParkin to stimulate the catalytic activity of wt Parkin. This experiment has low cellular relevance as these point mutations are unlikely to occur together to create this nonfunctional pParkin protein. In the case of pParkin activating wt Parkin (regardless of artificial point mutations inserted to study specifically the in trans activation), if there needs to be much more pParkin around to fully activate wt Parkin, isn't it just more likely that the pParkin would activate in cis?

      To test phospho-Parkin as an activator of Parkin in trans, we wanted to use the catalytically inactive version of phospho-Parkin to avoid the background activity of p-Parkin. While it is true that a large excess of pParkin (T270R/C431A) is required to activate WT-Parkin in the in vitro set-up, it is not very surprising as in WT-Parkin, the unphosphorylated Ubl domain would block the E2 binding site on RING1. Also, due to interactions between pParkin (T270R/C431A) molecules, the net concentration of pParkin (T270R/C431A) as an activator would be much lower. However, the Ubl blocking E2 binding site on RING1 won’t be an issue between phospho-Parkin molecules or between Parkin isoforms (lacking Ubl domain or RING2).

      (2ai) Another underlying issue with this experiment is that the authors do not consider the possibility that the increased activity observed is a result of increased "substrate" for auto-ubiquitination, as opposed to any role in catalytic activation. Have the authors considered looking at Miro as a substrate in order to control for this?

      This is quite an interesting point. However, this will be only possible if Parkin is ubiquitinated in trans, as auto-ubiquitination is possible with active Parkin and not with catalytically dead (phospho-Parkin T270R, C431A) or autoinhibited (WT-Parkin). Also, in the previous version of the manuscript, where we used only phospho-Ubl as an activator of Parkin in trans, we tested Miro1 ubiquitination and auto-ubiquitination, and the results were the same (Author response image 4).

      Author response image 4.

      (2b) The authors mention a "higher net concentration" of the "fused domains" with RING0, and use this to justify artificially cleaving the Ubl or RING2 domains from the Parkin core. This fact should be moot. In cells, it is expected there will only be a 1:1 ratio of the Parkin core with the Ubl or RING2 domains. To date, there is no evidence suggesting multiple pUbls or multiple RING2s can bind the RING0 binding site. In fact, the authors here even show that either the RING2 or pUbl needs to be displaced to permit the binding of the other domain. That being said, there would be no "higher net concentration" because there would always be the same molar equivalents of Ubl, RING2, and the Parkin core.

      We apologize for the confusion. “Higher net concentration” is with respect to fused domains versus the domain provided in trans. Due to the competing nature of the interactions between pUbl/RING2 and RING0, the interactions are too transient and beyond the detection limit of the biophysical techniques. While the domains are fused (for example, RING0-RING2 in the same polypeptide) in a polypeptide, their effective concentrations are much higher than those (for example, pUbl) provided in trans; thus, biophysical methods fail to detect the interaction. Treatment with protease solves the above issue due to the higher net concentration of the fused domain, and trans interactions can be measured using biophysical techniques. However, the nature of these interactions and conformational changes is very transient, which is also suggested by the data. Therefore, Parkin molecules will never remain associated; rather, Parkin will transiently interact and activate Parkin molecules in trans.

      (2c) A larger issue remaining in terms of Parkin activation is the lack of clarity surrounding the role of the linker (77-140); particularly whether its primary role is to tether the Ubl to the cis Parkin molecule versus a role in permitting distal interactions to a trans molecule. The way the authors have conducted the experiments presented in Figure 2 limits the possible interactions that the activated pUbl could have by (a) ablating the binding site in the cis molecule with the K211N mutation; (b) further blocking the binding site in the cis molecule by keeping the RING2 domain intact. These restrictions to the cis parkin molecule effectively force the pUbl to bind in trans. A competition experiment to demonstrate the likelihood of cis or trans activation in direct comparison with each other would provide stronger evidence for trans activation.

      This is an excellent point. In the revised manuscript, we have performed experiments using native phospho-Parkin (Revised Figure 5), and the results are consistent with those in Figure 2 ( Revised Figure 4), where we used the K211N mutation.

      (3) A major limitation of this study is that the authors interpret structural flexibility from experiments that do not report directly on flexibility. The analytical SEC experiments report on binding affinity and more specifically off-rates. By removing the interdomain linkages, the accompanying on-rate would be drastically impacted, and thus the observations are disconnected from a native scenario. Likewise, observations from protein crystallography can be consistent with flexibility, but certainly should not be directly interpreted in this manner. Rigorous determination of linker and/or domain flexibility would require alternative methods that measure this directly.

      We also agree with the reviewer that these methods do not directly capture structural flexibility. Also, rigorous determination of linker flexibility would require alternative methods that measure this directly. However, due to the complex nature of interactions and technical limitations, breaking the interdomain linkages was the best possible way to capture interactions in trans. Interestingly, all previous methods that report cis interactions between pUbl and RING0 also used a similar approach (Gladkova et al.ref. 24, Sauve et al. ref. 29).  

      (4) The analysis of the ACT element comes across as incomplete. The authors make a point of a competing interaction with Lys48 of the Ubl domain, but the significance of this is unclear. It is possible that this observation could be an overinterpretation of the crystal structures. Additionally, the rationale for why the ACT element should or shouldn't contribute to in trans activation of different Parkin constructs is not clear. Lastly, the conclusion that this work explains the evolutionary nature of this element in chordates is highly overstated.

      We agree with the reviewer that the significance of Lys48 is unclear. We have presented this just as one of the observations from the crystal structure. As the reviewer suggested, we have removed the sentence about the evolutionary nature of this element from the revised manuscript.

      (5) The analysis of the REP linker element also seems incomplete. The authors identify contacts to a neighboring pUb molecule in their crystal structure, but the connection between this interface (which could be a crystallization artifact) and their biochemical activity data is not straightforward. The analysis of flexibility within this region using crystallographic and AlphaFold modeling observations is very indirect. The authors also draw parallels with linker regions in other RBR ligases that are involved in recognizing the E2-loaded Ub. Firstly, it is not clear from the text or figures whether the "conserved" hydrophobic within the linker region is involved in these alternative Ub interfaces. And secondly, the authors appear to jump to the conclusion that the Parkin linker region also binds an E2-loaded Ub, even though their original observation from the crystal structure seems inconsistent with this. The entire analysis feels very preliminary and also comes across as tangential to the primary storyline of in trans Parkin activation.

      We agree with the reviewer that crystal structure data and biochemical data are not directly linked. In the revised manuscript, we have also highlighted the conserved hydrophobic in the linker region at the ubiquitin interface (Fig. 9C and Extended Data Fig. 11A), which was somehow missed in the original manuscript. We want to add that a very similar analysis and supporting experiments identified donor ubiquitin-binding sites on the IBR and helix connecting RING1-IBR (Kumar et al., Nature Str. and Mol. Biol., 2017), which several other groups later confirmed. In the mentioned study, the Ubl domain of Parkin from the symmetry mate Parkin molecule was identified as a mimic of “donor ubiquitin” on IBR and helix connecting RING1-IBR.

      In the present study, a neighboring pUb molecule in the crystal structure is identified as a donor ubiquitin mimic (Fig. 9C) by supporting biophysical/biochemical experiments. First, we show that mutation of I411A in the REP linker of Parkin perturbs Parkin interaction with E2~Ub (donor) (Fig. 9F). Another supporting experiment was performed using a Ubiquitin-VS probe assay, which is independent of E2. Assays using Ubiquitin-VS show that I411A mutation in the REP-RING2 linker perturbs Parkin charging with Ubiquitin-VS (Extended Data Fig. 11 B). Furthermore, the biophysical data showing loss of Parkin interaction with donor ubiquitin is further supported by ubiquitination assays. Mutations in the REP-RING2 linker perturb the Parkin activity (Fig. 9E), confirming biophysical data. This is further confirmed by mutations (L71A or L73A) on ubiquitin (Extended Data Fig. 11C), resulting in loss of Parkin activity. The above experiments nicely establish the role of the REP-RING2 linker in interaction with donor ubiquitin, which is consistent with other RBRs (Extended Data Fig. 11A).

      While we agree with the reviewer that this appears tangential to the primary storyline in trans-Parkin activation, we decided to include this data because it could be of interest to the field.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) For clarity, a schematic of the domain architecture of Parkin would be helpful at the outset in the main figures. This will help with the introduction to better understand the protein organization. This is lost in the Extended Figure in my opinion.

      We thank the reviewer for suggesting this, which we have included in Figure 1 of the revised manuscript.

      (2) Related to the competition between the Ubl and RING2 domains, can competition be shown through another method? SPR, ITC, etc? ITC was used in other experiments, but only in the context of mutations (Lys211Asn)? Can this be done with WT sequence?

      This is an excellent suggestion. In the revised Figure 5, we have performed ITC experiment using WT Parkin, and the results are consistent with what we observed using Lys211Asn Parkin.

      (3) The authors also note that "the AlphaFold model shows a helical structure in the linker region of Parkin (Extended Data Figure 10C), further confirming the flexible nature of this region"... but the secondary structure would not be inherently flexible. This is confusing.

      The flexibility is in terms of the conformation of this linker region observed under the open or closed state of Parkin. In the revised manuscript, we have explained this point more clearly.

      (4) The manuscript needs extensive revision to improve its readability. Minor grammatical mistakes were prevalent throughout.

      We thank the reviewer for pointing out this and we have corrected these in the revised manuscript.

      (5) The confocal images are nice, but inset panels may help highlight the regions of interest (ROIs).

      This is corrected in the revised manuscript.

      (6) Trans is misspelled ("tans") towards the end of the second paragraph on page 16.

      This is corrected in the revised manuscript.

      (7) The schematics are helpful, but some of the lettering in Figure 2 is very small.

      This is corrected in the revised manuscript.

      Reviewer #3 (Recommendations For The Authors):

      (1) A significant portion of the results section refers to the supplement, making the overall readability very difficult.

      We accept this issue as a lot of relevant data could not be added to the main figures and thus ended up in the supplement.  In the revised manuscript, we have moved some of the supplementary figures to the main figures.

      (2) Interpretation of the experiments utilizing many different Parkin constructs and cleavage scenarios (particularly the SEC and crystallography experiments) is extremely difficult. The work would benefit from a layout of the Parkin model system, highlighting cleavage sites, key domain terminology, and mutations used in the study, presented together and early on in the manuscript. Using this to identify a simpler system of referencing Parkin constructs would also be a large improvement.

      This is a great suggestion. We have included these points in the revised manuscript, which has improved the readability.

      (3) Lines 81-83; the authors say they "demonstrate the conformational changes in Parkin during the activation process", but fail to show any actual conformational changes. Further, much of what is demonstrated in this work (in terms of crystal structures) corroborates existing literature. The authors should use caution not to overstate their original conclusions in light of the large body of work in this area.

      We thank the reviewer for pointing out this. We have corrected the above statement in the revised manuscript to indicate that we meant it in the context of trans conformational changes.

      (4) Line 446 and 434; there is a discrepancy about which amino acid is present at residue 409. Is this a K408 typo? The authors also present mutational work on K416, but this residue is not shown in the structure panel.

      We thank the reviewer for pointing out this. In the revised manuscript, we have corrected these typos.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer 1 (Public Review):

      I want to reiterate my comment from the first round of reviews: that I am insufficiently familiar with the intricacies of Maxwell’s equations to assess the validity of the assumptions and the equations being used by WETCOW. The work ideally needs assessing by someone more versed in that area, especially given the potential impact of this method if valid.

      We appreciate the reviewer’s candor. Unfortunately, familiarity with Maxwell’s equations is an essential prerequisite for assessing the veracity of our approach and our claims.

      Effort has been made in these revisions to improve explanations of the proposed approach (a lot of new text has been added) and to add new simulations. However, the authors have still not compared their method on real data with existing standard approaches for reconstructing data from sensor to physical space. Refusing to do so because existing approaches are deemed inappropriate (i.e. they “are solving a different problem”) is illogical.

      Without understanding the importance of our model for brain wave activity (cited in the paper) derived from Maxwell’s equations in inhomogeneous and anisotropic brain tissue, it is not possible to critically evaluate the fundamental difference between our method and the standard so-called “source localization” method which the Reviewer feels it is important to compare our results with. Our method is not “source localization” which is a class of techniques based on an inappropriate model for static brain activity (static dipoles sprinkled sparsely in user-defined areas of interest). Just because a method is “standard” does not make it correct. Rather, we are reconstructing a whole brain, time dependent electric field potential based upon a model for brain wave activity derived from first principles. It is comparing two methods that are “solving different problems” that is, by definition, illogical.

      Similarly, refusing to compare their method with existing standard approaches for spatio-temporally describing brain activity, just because existing approaches are deemed inappropriate, is illogical.

      Contrary to the Reviewer’s assertion, we do compare our results with three existing methods for describing spatiotemporal variations of brain activity.

      First, Figures 1, 2, and 6 compare the spatiotemporal variations in brain activity between our method and fMRI, the recognized standard for spatiotemporal localization of brain activity. The statistical comparison in Fig 3 is a quantitative demonstration of the similarity of the activation patterns. It is important to note that these data are simultaneous EEG/fMRI in order to eliminate a variety of potential confounds related to differences in experimental conditions.

      Second, Fig 4 (A-D) compares our method with the most reasonable “standard” spatiotemporal localization method for EEG: mapping of fields in the outer cortical regions of the brain detected at the surface electrodes to the surface of the skull. The consistency of both the location and sign of the activity changes detected by both methods in a “standard” attention paradigm is clearly evident. Further confirmation is provided by comparison of our results with simultaneous EEG/fMRI spatial reconstructions (E-F) where the consistency of our reconstructions between subjects is shown in Fig 5.

      Third, measurements from intra-cranial electrodes, the most direct method for validation, are compared with spatiotemporal estimates derived from surface electrodes and shown to be highly correlated.

      For example, the authors say that “it’s not even clear what one would compare [between the new method and standard approaches]”. How about:

      (1) Qualitatively: compare EEG activation maps. I.e. compare what you would report to a researcher about the brain activity found in a standard experimental task dataset (e.g. their gambling task). People simply want to be able to judge, at least qualitatively on the same data, what the most equivalent output would be from the two approaches. Note, both approaches do not need to be done at the same spatial resolution if there are constraints on this for the comparison to be useful.

      (2) Quantitatively: compare the correlation scores between EEG activation maps and fMRI activation maps

      These comparison were performed and already in the paper.

      (1) Fig 4 compares the results with a standard attention paradigm (data and interpretation from Co-author Dr Martinez, who is an expert in both EEG and attention). Additionally, Fig 12 shows detected regions of increased activity in a well-known brain circuit from an experimental task (’reward’) with data provided by Co-author Dr Krigolson, an expert in reward circuitry.

      (2) Correlation scores between EEG and fMRI are shown in Fig 3.

      (3) Very high correlation between the directly measured field from intra-cranial electrodes in an epilepsy patient and those estimated from only the surface electrodes is shown in Fig 9.

      There are an awful lot of typos in the new text in the paper. I would expect a paper to have been proof read before submitting.

      We have cleaned up the typos.

      The abstract claims that there is a “direct comparison with standard state-of-the-art EEG analysis in a well-established attention paradigm”, but no actual comparison appears to have been completed in the paper.

      On the contrary, as mentioned above, Fig 4 compares the results of our method with the state-of-the-art surface spatial mapping analysis, with the state-of-the-art time-frequency analysis, and with the state-of-the-art fMRI analysis

      Reviewer 2 (Public Review):

      This is a major rewrite of the paper. The authors have improved the discourse vastly.

      There is now a lot of didactics included but they are not always relevant to the paper.

      The technique described in the paper does in fact leverage several novel methods we have developed over the years for analyzing multimodal space-time imaging data. Each of these techniques has been described in detail in separate publications cited in the current paper. However, the Reviewers’ criticisms stated that the methods were non-standard and they were unfamiliar with them. In lieu of the Reviewers’ reading the original publications, we added a significant amount of text indeed intended to be didactic. However, we can assume the Reviewer that nothing presented was irrelevant to the paper. We certainly had no desire to make the paper any longer than it needed to be.

      The section on Maxwell’s equation does a disservice to the literature in prior work in bioelectromagnetism and does not even address the issues raised in classic text books by Plonsey et al. There is no logical “backwardness” in the literature. They are based on the relative values of constants in biological tissues.

      This criticism highlights the crux of our paper. Contrary to the assertion that we have ignored the work of Plonsey, we have referenced it in the new additional text detailing how we have constructed Maxwell’s Equations appropriate for brain tissue, based on the model suggested by Plonsey that allows the magnetic field temporal variations to be ignored but not the time-dependence electric fields.

      However, the assumption ubiquitous in the vast prior literature of bioelectricity in the brain that the electric field dynamics can be “based on the relative values of constants in biological tissues”, as the Reviewer correctly summarizes, is precisely the problem. Using relative average tissue properties does not take into account the tissue anisotropy necessary to properly account for correct expressions for the electric fields. As our prior publications have demonstrated in detail, taking into account the inhomogeneity and anisotropy of brain tissue in the solution to Maxwell’s Equations is necessary for properly characterizing brain electrical fields, and serves as the foundation of our brain wave theory. This led to the discovery of a new class of brain waves (weakly evanescent transverse cortical waves, WETCOW).

      It is this brain wave model that is used to estimate the dynamic electric field potential from the measurements made by the EEG electrode array. The standard model that ignores these tissue details leads to the ubiquitous “quasi-static approximation” that leads to the conclusion that the EEG signal cannot be spatial reconstructed. It is indeed this critical gap in the existing literature that is the central new idea in the paper.

      There are reinventions of many standard ideas in terms of physics discourses, like Bayesian theory or PCA etc.

      The discussion of Bayesian theory and PCA is in response to the Reviewer complaint that they were unfamiliar with our entropy field decomposition (EFD) method and the request that we compare it with other “standard” methods. Again, we have published extensively on this method (as referenced in the manuscript) and therefore felt that extensive elaboration was unnecessary. Having been asked to provide such elaboration and then being pilloried for it therefore feels somewhat inappropriate in our view. This is particularly disappointing as the Reviewer claims we are presenting “standard” ideas when in fact the EFD is new general framework we developed to overcome the deficiencies in standard “statistical” and probabilistic data analysis methods that are insufficient for characterizing non-linear, nonperiodic, interacting fields that are the rule, rather than the exception, in complex dynamical systems, such as brain electric fields (or weather, or oceans, or ....).

      The EFD is indeed a Bayesian framework, as this is the fundamental starting point for probability theory, but it is developed in a unique and more general fashion than previous data analysis methods. (Again, this is detailed in several references in the papers bibliography. The Reviewer’s requested that an explanation be included in the present paper, however, so we did so). First, Bayes Theorem is expressed in terms of a field theory that allows an arbitrary number of field orders and coupling terms. This generality comes with a penalty, which is that it’s unclear how to assess the significance of the essentially infinite number of terms. The second feature is the introduction of a method by which to determine the significant number of terms automatically from the data itself, via the our theory of entropy spectrum pathways (ESP), which is also detailed in a cited publication, and which produces ranked spatiotemporal modes from the data. Rather than being “reinventions of many standard ideas” these are novel theoretical and computational methods that are central to the EEG reconstruction method presented in the paper.

      I think that the paper remains quite opaque and many of the original criticisms remain, especially as they relate to multimodal datasets. The overall algorithm still remains poorly described. benchmarks.

      It’s not clear how to assess the criticisms that the algorithm is poorly described yet there is too much detail provided that is mistakenly assessed as “standard”. Certainly the central wave equations that are estimated from the data are precisely described, so it’s not clear exactly what the Reviewer is referring to.

      The comparisons to benchmark remain unaddressed and the authors state that they couldn’t get Loreta to work and so aborted that. The figures are largely unaltered, although they have added a few more, and do not clearly depict the ideas. Again, no benchmark comparisons are provided to evaluate the results and the performance in comparison to other benchmarks.

      As we have tried to emphasize in the paper, and in the Response to Reviewers, the standard so-called “source localization” methods are NOT a benchmark, as they are solving an inappropriate model for brain activity. Once again, static dipole “sources” arbitrarily sprinkled on pre-defined regions of interest bear little resemblance to observed brain waves, nor to the dynamic electric field wave equations produced by our brain wave theory derived from a proper solution to Maxwell’s equations in the anisotropic and inhomogeneous complex morphology of the brain.

      The comparison with Loreta was not abandoned because we couldn’t get it to work, but because we could not get it to run under conditions that were remotely similar to whole brain activity described by our theory, or, more importantly, by an rationale theory of dynamic brain activity that might reproduce the exceedingly complex electric field activity observed in numerous neuroscience experiments.

      We take issue with the rather dismissive mention of “a few more” figures that “do not clearly depict the idea” when in fact the figures that have been added have demonstrated additional quantitative validation of the method.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer 1 (Public Review):

      The paper proposes a new source reconstruction method for electroencephalography (EEG) data and claims that it can provide far superior spatial resolution than existing approaches and also superior spatial resolution to fMRI. This primarily stems from abandoning the established quasi-static approximation to Maxwell’s equations.<br /> The proposed method brings together some very interesting ideas, and the potential impact is high. However, the work does not provide the evaluations expected when validating a new source reconstruction approach. I cannot judge the success or impact of the approach based on the current set of results. This is very important to rectify, especially given that the work is challenging some long- standing and fundamental assumptions made in the field.

      We appreciate the Reviewer’s efforts in reviewing this paper and have included a significant amount of new text to address their concerns.

      I also find that the clarity of the description of the methods, and how they link to what is shown in the main results hard to follow.

      We have added significantly more detail on the methods, including more accessible explanations of the technical details, and schematic diagrams to visualize the key processing components.

      I am insufficiently familiar with the intricacies of Maxwell’s equations to assess the validity of the assumptions and the equations being used by WETCOW. The work therefore needs assessing by someone more versed in that area. That said, how do we know that the new terms in Maxwell’s equations, i.e. the time-dependent terms that are normally missing from established quasi-static-based approaches, are large enough to need to be considered? Where is the evidence for this?

      The fact that the time-dependent terms are large enough to be considered is essentially the entire focus of the original papers [7,8]. Time-dependent terms in Maxwell’s equations are generally not important for brain electrodynamics at physiological frequencies for homogeneous tissues, but this is not true for areas with stroung inhomogeneity and ansisotropy.

      I have not come across EFD, and I am not sure many in the EEG field will have. To require the reader to appreciate the contributions of WETCOW only through the lens of the unfamiliar (and far from trivial) approach of EFD is frustrating. In particular, what impact do the assumptions of WETCOW make compared to the assumptions of EFD on the overall performance of SPECTRE?

      We have added an entire new section in the Appendix that provides a very basic introduction to EFD and relates it to more commonly known methods, such as Fourier and Independent Components Analyses.

      The paper needs to provide results showing the improvements obtained when WETCOW or EFD are combined with more established and familiar approaches. For example, EFD can be replaced by a first-order vector autoregressive (VAR) model, i.e. y<sub>t</sub> = Ay<sub>t−1</sub> + e<sub>t</sub> (where y<sub>t</sub> is [num<sub>gridpoints</sub> ∗ 1] and A is [num<sub>gridpoints</sub> ∗ num<sub>gridpoints</sub>] of autoregressive parameters).

      The development of EFD, which is independent of WETCOW, stemmed from the necessity of developing a general method for the probabilistic analysis of finitely sampled non-linear interacting fields, which are ubiquitous in measurements of physical systems, of which functional neuroimaging data (fMRI, EEG) are excellent examples. Standard methods (such as VAR) are inadequate in such cases, as discussed in great detail in our EFD publications (e.g., [12,37]). The new appendix on EFD reviews these arguments. It does not make sense to compare EFD with methods which are inappropriate for the data.

      The authors’ decision not to include any comparisons with established source reconstruction approaches does not make sense to me. They attempt to justify this by saying that the spatial resolution of LORETA would need to be very low compared to the resolution being used in SPECTRE, to avoid compute problems. But how does this stop them from using a spatial resolution typically used by the field that has no compute problems, and comparing with that? This would be very informative. There are also more computationally efficient methods than LORETA that are very popular, such as beamforming or minimum norm.

      he primary reason for not comparing with ’source reconstruction’ (SR) methods is that we are are not doing source reconstruction. Our view of brain activity is that it involves continuous dynamical non-linear interacting fields througout the entire brain. Formulating EEG analysis in terms of reconstructing sources is, in our view, like asking ’what are the point sources of a sea of ocean waves’. It’s just not an appropriate physical model. A pre-chosen limited distribution of static dipoles is just a very bad model for brain activity, so much so that it’s not even clear what one would compare. Because in our view, as manifest in our computational implementation, one needs to have a very high density of computational locations throughout the entire brain, including white matter, and the reconstructed modes are waves whose extent can be across the entire brain. Our comments about the low resolution of computational methods for SR techniques really is expressing the more overarching concern that they are not capable of, or even designed for, detecting time-dependent fields of non-linear interacting waves that exist everywhere througout the brain. Moreover, the SR methods always give some answer, but in our view the initial conditions upon which those methods are based (pre-selected regions of activity with a pre-selected number of ’sources’) is a highly influential but artificial set of strong computational constraints that will almost always provide an answer consist with (i.e., biased toward) the expectations of the person formlating the problem, and is therefore potentially misleading.

      In short, something like the following methods needs to be compared:

      (1) Full SPECTRE (EFD plus WETCOW)

      (2) WETCOW + VAR or standard (“simple regression”) techniques

      (3) Beamformer/min norm plus EFD

      (4) Beamformer/min norm plus VAR or standard (“simple regression”) techniques

      The reason that no one has previously ever been able to solve the EEG inverse problem is due to the ubiquitous use of methods that are too ’simple’, i.e., are poor physical models of brain activity. We have spent a decade carefully elucidating the details of this statement in numerous highly technical and careful publications. It therefore serves no purpose to return to the use of these ’simple’ methods for comparison. We do agree, however, that a clearer overview of the advantages of our methods is warranted and have added significant additional text in this revision towards that purpose.

      This would also allow for more illuminating and quantitative comparisons of the real data. For example, a metric of similarity between EEG maps and fMRI can be computed to compare the performance of these methods. At the moment, the fMRI-EEG analysis amounts to just showing fairly similar maps.

      We disagree with this assessment. The correlation coefficient between the spatially localized activation maps is a conservative sufficient statistic for the measure of statistically significant similarity. These numbers were/are reported in the caption to Figure 5, and have now also been moved to, and highlighted in, the main text.

      There are no results provided on simulated data. Simulations are needed to provide quantitative comparisons of the different methods, to show face validity, and to demonstrate unequivocally the new information that SPECTRE can ’potentially’ provide on real data compared to established methods. The paper ideally needs at least 3 types of simulations, where one thing is changed at a time, e.g.:

      (1) Data simulated using WETCOW plus EFD assumptions

      (2) Data simulated using WETCOW plus e.g. VAR assumptions

      (3) Data simulated using standard lead fields (based on the quasi-static Maxwell solutions) plus e.g. VAR assumptions

      These should be assessed with the multiple methods specified earlier. Crucially the assessment should be quantitative showing the ability to recover the ground truth over multiple realisations of realistic noise. This type of assessment of a new source reconstruction method is the expected standard

      We have now provided results on simulated data, along with a discussion on what entails a meaningful simulation comparison. In short, our original paper on the WETCOW theory included a significant number of simulations of predicted results on several spatial and temporal scales. The most relevant simulation data to compare with the SPECTRE imaging results are the cortical wave loop predicted by WETCOW theory and demonstrated via numerical simulation in a realistic brain model derived from high resolution anatomical (HRA) MRI data. The most relevant data with which to compare these simulations are the SPECTRE recontruction from the data that provides the closest approximation to a “Gold Standard” - reconstructions from intra-cranial EEG (iEEG). We have now included results (new Fig 8) that demonstrate the ability of SPECTRE to reconstruct dynamically evolving cortical wave loops in iEEG data acquired in an epilepsy patient that match with the predicted loop predicted theoretically by WETCOW and demonstrated in realistic numerical simulations.

      The suggested comparison with simple regression techniques serves no purpose, as stated above, since that class of analysis techniques was not designed for non-linear, non-Gaussian, coupled interacting fields predicted by the WETCOW model. The explication of this statement is provided in great detail in our publications on the EFD approach and in the new appendix material provided in this revision. The suggested simulation of the dipole (i.e., quasi-static) model of brain activity also serves no purpose, as our WETCOW papers have demonstrated in great detail that is is not a reasonable model for dynamic brain activity.

      Reviewer 2 (Public Review):

      Strengths:

      If true and convincing, the proposed theoretical framework and reconstruction algorithm can revolutionize the use of EEG source reconstructions.

      Weaknesses:

      There is very little actual information in the paper about either the forward model or the novel method of reconstruction. Only citations to prior work by the authors are cited with absolutely no benchmark comparisons, making the manuscript difficult to read and interpret in isolation from their prior body of work.

      We have now added a significant amount of material detailing the forward model, our solution to the inverse problem, and the method of reconstruction, in order to remedy this deficit in the previous version of the paper.

      Recommendations for the authors:

      Reviewer 1 (Recommendations):

      It is not at all clear from the main text (section 3.1) and the caption, what is being shown in the activity patterns in Figures 1 and 2. What frequency bands and time points etc? How are the values shown in the figures calculated from the equations in the methods?

      We have added detailed information on the frequency bands reconstructed and the activity pattern generation and meaning. Additional information on the simultaneous EEG/fMRI acquisition details has been added to the Appendix.

      How have the activity maps been thresholded? Where are the color bars in Figures 1 and 2?

      We have now included that information in new versions of the figures. In addition, the quantitative comparison between fMRI and EEG are presented is now presented in a new Figure 2 (now Figure 3).

      P30 “This term is ignored in the current paper”. Why is this term ignored, but other (time-dependent) terms are not?

      These terms are ignored because they represent higher order terms that complicate the processing (and intepretation) but do not substatially change the main results. A note to this effect has been added to the text.

      The concepts and equations in the EFD section are not very accessible (e.g. to someone unfamiliar with IFT).

      We have added a lengthy general and more accessible description of the EFD method in the Appendix.

      Variables in equation 1, and the following equation, are not always defined in a clear, accessible manner. What is ?

      We have added additional information on how Eqn 1 (now Eqn 3) is derived, and the variables therein.

      In the EFD section, what do you mean conceptually by α, i.e. “the coupled parameters α”?

      This sentence has been eliminated, as it was superfluous and confusing.

      How are the EFD and WETCOW sections linked mathematically? What is ψ (in eqn 2) linked to in the WETCOW section (presumably ϕ<sub>ω</sub>?) ?

      We have added more introductory detail at the beginning of the Results to describe the WETCOW theory and how this is related to the inverse problem for EEG.

      What is the difference between data d and signal s in section 6.1.3? How are they related?

      We have added a much more detailed Appendix A where this (and other) details are provided.

      What assumptions have been made to get the form for the information Hamiltonian in eqn3?

      Eq 3 (now Eqn A.5) is actually very general. The approximations come in when constructing the interaction Hamiltonian H<sub>i</sub>.

      P33 “using coupling between different spatio-temporal points that is available from the data itself” I do not understand what is meant by this.

      This was a poorly worded sentence, but this section has now been replaced by Appendix A, which now contains the sentence that prior information “is contained within the data itself”. This refers to the fact that the prior information consists of correlations in the data, rather than some other measurements independent of the original data. This point is emphasized because in many Bayesian application, prior information consists of knowledge of some quantity that were acquired independently from the data at hand (e.g., mean values from previous experiments)

      Reviewer 2 (Recommendations):

      Abstract

      The first part presents validation from simultaneous EEG/fMRI data, iEEG data, and comparisons with standard EEG analyses of an attention paradigm. Exactly what constitutes adequate validation or what metrics were used to assess performance is surprisingly absent.

      Subsequently, the manuscript examines a large cohort of subjects performing a gambling task and engaging in reward circuits. The claim is that this method offers an alternative to fMRI.

      Introduction

      Provocative statements require strong backing and evidence. In the first paragraph, the “quasi-static” assumption which is dominant in the field of EEG and MEG imaging is questioned with some classic citations that support this assumption. Instead of delving into why exactly the assumption cannot be relaxed, the authors claim that because the assumption was proved with average tissue properties rather than exact, it is wrong. This does not make sense. Citations to the WETCOW papers are insufficient to question the quasi-static assumption.

      The introduction purports to validate a novel theory and inverse modeling method but poorly outlines the exact foundations of both the theory (WETCOW) and the inverse modeling (SPECTRE) work.

      We have added a new introductory subsection (“A physical theory of brain waves”) to the Results section that provides a brief overview of the foundations of the WETCOW theory and an explicit description of why the quasi-static approximation can be abandoned. We have expanded the subsequent subsection (“Solution to the inverse EEG problem”) to more clearly detail the inverse modeling (SPECTRE) method.

      Section 3.2 Validation with fMRI

      Figure 1 supposedly is a validation of this promising novel theoretical approach that defies the existing body of literature in this field. Shockingly, a single subject data is shown in a qualitative manner with absolutely no quantitative comparison anywhere to be found in the manuscript. While there are similarities, there are also differences in reconstructions. What to make out of these discrepancies? Are there distortions that may occur with SPECTRE reconstructions? What are its tradeoffs? How does it deal with noise in the data?

      It is certainly not the case that there are no quantitative comparisons. Correlation coefficients, which are the sufficient statistics for comparison of activation regions, are given in Figure 5 for very specific activation regions. Figure 9 (now Figure 11) shows a t-statistic demonstrating the very high significance of the comparison between multiple subjects. And we have now added a new Figure 7 demonstrating the strongly correlated estimates for full vs surface intra-cranial EEG reconstructions. To make this more clear, we have added a new section “Statistical Significance of the Results”.

      We note that a discussion of the discrepancies between fMRI and EEG was already presented in the Supplementary Material. Therein we discuss the main point that fMRI and EEG are measuring different physical quantities and so should not be expected to be identical. We also highlight the fact that fMRI is prone to significant geometrical distortions for magnetic field inhomogeities, and to physiological noise. To provide more visibility for this important issue, we have moved this text into the Discussion section.

      We do note that geometric distortions in fMRI data due to suboptimal acquisitions and corrections is all too common. This, coupled with the paucity of open source simultaneous fMRI-EEG data, made it difficult to find good data for comparison. The data on which we performed the quantitative statistical comparison between fMRI and EEG (Fig 5) was collected by co-author Dr Martinez, and was of the highest quality and therefore sufficient for comparison. The data used in Fig 1 and 2 was a well publicized open source dataset but had significant fMRI distortions that made quantitative comparison (i.e., correlation coefficents between subregions in the Harvard-Oxford atlas) suboptimal. Nevertheless, we wanted to demonstrate the method in more than one source, and feel that visual similarity is a reasonble measure for this data.

      Section 3.2 Validation with fMRI

      Figure 2 Are the sample slices being shown? How to address discrepancies? How to assume that these are validations when there are such a level of discrepancies?

      It’s not clear what “sample slices” means. The issue of discrepancies is addressed in the response to the previous query.

      Section 3.2 Validation with fMRI

      Figure 3 Similar arguments can be made for Figure 3. Here too, a comparison with source localization benchmarks is warranted because many papers have examined similar attention data.

      Regarding the fMRI/EEG comparison, these data are compared quantitatively in the text and in Figure 5.

      Regarding the suggestion to perform standard ’source localization’ analysis, see responses to Reviewer 1.

      Section 3.2 Validation with fMRI

      Figure 4 While there is consistency across 5 subjects, there are also subtle and not-so-subtle differences.

      What to make out of them?

      Discrepancies in activations patterns between individuals is a complex neuroscience question that we feel is well beyond the scope of this paper.

      Section 3.2 Validation with fMRI

      Figures 5 & 6 Figure 5 is also a qualitative figure from two subjects with no appropriate quantification of results across subjects. The same is true for Figure 6.

      On the contrary, Figure 5 contains a quantitative comparison, which is now also described in the text. A quantitative comparison for the epilepsy data in Fig 6 (and C.4-C.6) is now shown in Fig 7.

      Section 3.2 Validation with fMRI

      Given the absence of appropriate “validation” of the proposed model and method, it is unclear how much one can trust results in Section 4.

      We believe that the quantitative comparisons extant in the original text (and apparently missed by the Reviewer) along with the additional quantitative comparisons are sufficient to merit trust in Section 4.

      Section 3.2 Validation with fMRI

      What are the thresholds used in maps for Figure 7? Was correction for multiple comparisons performed? The final arguments at the end of section 4 do not make sense. Is the claim that all results of reconstructions from SPECTRE shown here are significant with no reason for multiple comparison corrections to control for false positives? Why so?

      We agree that the last line in Section 4 is misleading and have removed it.

      Section 3.2 Validation with fMRI

      Discussion is woefully inadequate in addition to the inconclusive findings presented here.

      We have added a significant amount of text to the Discussion to address the points brought up by the Reviewer. And, contrary to the comments of this Reviewer, we believe the statistically significant results presented are not “inconclusive”.

      Supplementary Materials

      This reviewer had an incredibly difficult time understanding the inverse model solution. Even though this has been described in a prior publication by the authors, it is important and imperative that all details be provided here to make the current manuscript complete. The notation itself is so nonstandard. What is Σ<sup>ij</sup>, δ<sup>ij</sup>? Where is the reference for equation (1)? What about the equation for <sup>ˆ</sup>(R)? There are very few details provided on the exact implementation details for the Fourier-space pseudo-spectral approach. What are the dimensions of the problem involved? How were different tissue compartments etc. handled? Equation 1 holds for the entire volume but the measurements are only made on the surface. How was this handled? What is the WETCOW brain wave model? I don’t see any entropy term defined anywhere - where is it?

      We have added more detail on the theoretical and numerical aspects of the inverse problem in two new subsections “Theory” and “Numerical Implementation” in the new section “Solution to the inverse EEG problem”.

      Supplementary Materials

      So, how can one understand even at a high conceptual level what is being done with SPECTRE?

      We have added a new subsection “Summary of SPECTRE” that provides a high conceptual level overview of the SPECTRE method outlined in the preceding sections.

      Supplementary Materials

      In order to understand what was being presented here, it required the reader to go on a tour of the many publications by the authors where the difficulty in understanding what they actually did in terms of inverse modeling remains highly obscure and presents a huge problem for replicability or reproducibility of the current work.

      We have now included more basic material from our previous papers, and simplified the presentation to be more accessible. In particular, we have now moved the key aspects of the theoretic and numerical methods, in a more readable form, from the Supplementary Material to the main text, and added a new Appendix that provides a more intuitive and accessible overview of our estimation procedures.

      Supplementary Materials

      How were conductivity values for different tissue types assigned? Is there an assumption that the conductivity tensor is the same as the diffusion tensor? What does it mean that “in the present study only HRA data were used in the estimation procedure?” Does that mean that diffusion MRI data was not used? What is SYMREG? If this refers to the MRM paper from the authors in 2018, that paper does not include EEG data at all. So, things are unclear here.

      The conductivity tensor is not exactly the same as the diffusion tensor in brain tissues, but they are closely related. While both tensors describe transport properties in brain tissue, they represent different physical processes. The conductivity tensor is often assumed to share the same eigenvectors as the diffusion tensor. There is a strong linear relationship between the conductivity and diffusion tensor eigenvalues, as supported by theoretical models and experimental measurements. For the current study we only used the anatomical data for estimatition and assignment of different tissue types and no diffusion MRI data was used. To register between different modalities, including MNI, HRA, function MRI, etc., and to transform the tissue assignment into an appropriate space we used the SYMREG registration method. A comment to the effect has been added to the text.

      Supplementary Materials

      How can reconstructed volumetric time-series of potential be thought of as the EM equivalent of an fMRI dataset? This sentence doesn’t make sense.

      This sentence indeed did not make sense and has been removed.

      Supplementary Materials

      Typical Bayesian inference does not include entropy terms, and entropy estimation doesn’t always lend to computing full posterior distributions. What is an “entropy spectrum pathway”? What is µ∗? Why can’t things be made clear to the reader, instead of incredible jargon used here? How does section 6.1.2 relate back to the previous section?

      That is correct that Bayesian inference typically does not include entropy terms. We believe that their introduction via the theory of entropy spectrum pathways (ESP) is a significant advance in Bayesian estimation as it provides highly relevent prior information from within the data itself (and therefore always available in spatiotemporal data) that facilitates a practical methodology for the analysis of complex non-linear dynamical system, as contained in the entropy field decomposition (EFD).

      Section 6.1.3 has now been replaced by a new Appendix A that discusses ESP in a much more intuitive and conceptual manner.

      Supplementary Materials

      Section 6.1.3 describes entropy field decomposition in very general terms. What is “non-period”? This section is incomprehensible. Without reference to exactly where in the process this procedure is deployed it is extremely difficult to follow. There seems to be an abuse of notation of using ϕ for eigenvectors in equation (5) and potentials earlier. How do equations 9-11 relate back to the original problem being solved in section 6.1.1? What are multiple modalities being described here that require JESTER?

      Section 6.1.3 has now been replaced by a new Appendix A that covers this material in a much more intuitive and conceptual manner.

      Supplementary Materials

      Section 6.3 discusses source localization methods. While most forward lead-field models assume quasistatic approximations to Maxwell’s equations, these are perfectly valid for the frequency content of brain activity being measured with EEG or MEG. Even with quasi-static lead fields, the solutions can have frequency dependence due to the data having frequency dependence. Solutions do not have to be insensitive to detailed spatially variable electrical properties of the tissues. For instance, if a FEM model was used to compute the forward model, this model will indeed be sensitive to the spatially variable and anisotropic electrical properties. This issue is not even acknowledged.

      The frequency dependence of the tissue properties is not the issue. Our theoretical work demonstrates that taking into account the anisotropy and inhomogeneity of the tissue is necessary in order to derive the existence of the weakly evanescent transverse cortical waves (WETCOW) that SPECTRE is detecting. We have added more details about the WETCOW model in the new Section “A physical theory of brain wave” to emphasize this point.

      Supplementary Materials

      Arguments to disambiguate deep vs shallow sources can be achieved with some but not all source localization algorithms and do not require a non-quasi-static formulation. LORETA is not even the main standard algorithm for comparison. It is disappointing that there are no comparisons to source localization and that this is dismissed away due to some coding issues.

      Again, we are not doing ’source localization’. The concept of localized dipole sources is anathema to our brain wave model, and so in our view comparing SPECTRE to such methods only propagates the misleading idea that they are doing the same thing. So they are definitely not dismissed due to coding issues. However, because of repeated requests to do compare SPECTRE with such methods, we attempted to run a standard source localization method with parameters that would at least provide the closest approximation to what we were doing. This attempt highlighted a serious computational issue in source localization methods that is a direct consequence of the fact that they are not attempting to do what SPECTRE is doing - describing a time-varying wave field, in the technical definition of a ’field’ as an object that has a value at every point in space-time.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      Bennion and colleagues present a careful examination of how an earlier set of memories can either interfere with or facilitate memories formed later. This impressive work is a companion piece to an earlier paper by Antony and colleagues (2022) in which a similar experimental design was used to examine how a later set of memories can either interfere with or facilitate memories formed earlier. This study makes contact with an experimental literature spanning 100 years, which is concerned with the nature of forgetting, and the ways in which memories for particular experiences can interact with other memories. These ideas are fundamental to modern theories of human memory, for example, paired-associate studies like this one are central to the theoretical idea that interference between memories is a much bigger contributor to forgetting than any sort of passive decay. 

      Strengths: 

      At the heart of the current investigation is a proposal made by Osgood in the 1940s regarding how paired associates are learned and remembered. In these experiments, one learns a pair of items, A-B (cue-target), and then later learns another pair that is related in some way, either A'-B (changing the cue, delta-cue), or A-B' (changing the target, delta-target), or A'-B' (changing both, delta-both), where the prime indicates that item has been modified, and may be semantically related to the original item. The authors refer to the critical to-be-remembered pairs as base pairs. Osgood proposed that when the changed item is very different from the original item there will be interference, and when the changed item is similar to the original item there will be facilitation. Osgood proposed a graphical depiction of his theory in which performance was summarized as a surface, with one axis indicating changes to the cue item of a pair and the other indicating changes to the target item, and the surface itself necessary to visualize the consequences of changing both. 

      In the decades since Osgood's proposal, there have been many studies examining slivers of the proposal, e.g., just changing targets in one experiment, just changing cues in another experiment. Because any pair of experiments uses different methods, this has made it difficult to draw clear conclusions about the effects of particular manipulations. 

      The current paper is a potential landmark, in that the authors manipulate multiple fundamental experimental characteristics using the same general experimental design. Importantly, they manipulate the semantic relatedness of the changed item to the original item, the delay between the study experience and the test, and which aspect of the pair is changed. Furthermore, they include both a positive control condition (where the exact same pair is studied twice), and a negative control condition (where a pair is only studied once, in the same phase as the critical base pairs). This allows them to determine when the prior learning exhibits an interfering effect relative to the negative control condition and also allows them to determine how close any facilitative effects come to matching the positive control. 

      The results are interpreted in terms of a set of existing theories, most prominently the memory-for-change framework, which proposes a mechanism (recursive reminding) potentially responsible for the facilitative effects examined here. One of the central results is the finding that a stronger semantic relationship between a base pair and an earlier pair has a facilitative effect on both the rate of learning of the base pair and the durability of the memory for the base pair. This is consistent with the memory-for-change framework, which proposes that this semantic relationship prompts retrieval of the earlier pair, and the two pairs are integrated into a common memory structure that contains information about which pair was studied in which phase of the experiment. When semantic relatedness is lower, they more often show interference effects, with the idea being that competition between the stored memories makes it more difficult to remember the base pair. 

      This work represents a major methodological and empirical advance for our understanding of paired-associates learning, and it sets a laudably high bar for future work seeking to extend this knowledge further. By manipulating so many factors within one set of experiments, it fills a gap in the prior literature regarding the cognitive validity of an 80-year-old proposal by Osgood. The reader can see where the observed results match Osgood's theory and where they are inconclusive. This gives us insight, for example, into the necessity of including a long delay in one's experiment, to observe potential facilitative effects. This point is theoretically interesting, but it is also a boon for future methodological development, in that it establishes the experimental conditions necessary for examining one or another of these facilitation or interference effects more closely. 

      We thank the reviewer for their thorough and positive comments -- thank you so much!

      Weaknesses: 

      One minor weakness of the work is that the overarching theoretical framing does not necessarily specify the expected result for each and every one of the many effects examined. For example, with a narrower set of semantic associations being considered (all of which are relatively high associations) and a long delay, varying the semantic relatedness of the target item did not reliably affect the memorability of that pair. However, the same analysis showed a significant effect when the wider set of semantic associations was used. The positive result is consistent with the memory-for-change framework, but the null result isn't clearly informative to the theory. I call this a minor weakness because I think the value of this work will grow with time, as memory researchers and theorists use it as a benchmark for new theory development. For example, the data from these experiments will undoubtedly be used to develop and constrain a new generation of computational models of paired-associates learning. 

      We thank the reviewer for this constructive critique. We agree that the experiments with a narrower set of semantic associations are less informative; in fact, we thought about removing these experiments from the current study, but given that we found results in the ΔBoth condition in Antony et al. (2022) using these stimuli that we did NOT find in the wider set, we thought it was worth including for a thorough comparison. We hope that the analyses combining the two experiment sets (Fig 6-Supp 1) are informative for contextualizing the results in the ‘narrower’ experiments and, as the reviewer notes, for informing future researchers.

      Reviewer #2 (Public Review): 

      Summary: 

      The study focuses on how relatedness with existing memories affects the formation and retention of new memories. Of core interest were the conditions that determine when prior memories facilitate new learning or interfere with it. Across a set of experiments that varied the degree of relatedness across memories as well as retention interval, the study compellingly shows that relatedness typically leads to proactive facilitation of new learning, with interference only observed under specific conditions and immediate test and being thus an exception rather than a rule. 

      Strengths: 

      The study uses a well-established word-pair learning paradigm to study interference and facilitation of overlapping memories. However it goes more in-depth than a typical interference study in the systematic variation of several factors: (1) which elements of an association are overlapping and which are altered (change target, change cue, change both, change neither); (2) how much the changed element differs from the original (word relatedness, with two ranges of relatedness considered); (3) retention period (immediate test, 2-day delay). Furthermore, each experiment has a large N sample size, so both significant effects as well as null effects are robust and informative. 

      The results show the benefits of relatedness, but also replicate interference effects in the "change target" condition when the new target is not related to the old target and when the test is immediate. This provides a reconciliation of some existing seemingly contradictory results on the effect of overlap on memory. Here, the whole range of conditions is mapped to convincingly show how the direction of the effect can flip across the surface of relatedness values. 

      Additional strength comes from supporting analyses, such as analyses of learning data, demonstrating that relatedness leads to both better final memory and also faster initial learning. 

      More broadly, the study informs our understanding of memory integration, demonstrating how the interdependence of memory for related information increases with relatedness. Together with a prior study or retroactive interference and facilitation, the results provide new insights into the role of reminding in memory formation. 

      In summary, this is a highly rigorous body of work that sets a great model for future studies and improves our understanding of memory organization. 

      We thank their reviewer for their thorough summary and very supportive words!

      Weaknesses: 

      The evidence for the proactive facilitation driven by relatedness is very convincing. However, in the finer scale results, the continuous relationship between the degree of relatedness and the degree of proactive facilitation/interference is less clear. This could be improved with some additional analyses and/or context and discussion. In the narrower range, the measure used was AS, with values ranging from 0.03-0.98, where even 0.03 still denotes clearly related words (pious - holy). Within this range from "related" to "related a lot", no relationship to the degree of facilitation was found. The wider range results are reported using a different scale, GloVe, with values from -0.14 to 0.95, where the lower end includes unrelated words (sap - laugh). It is possible that any results of facilitation/interference observed in the wider range may be better understood as a somewhat binary effect of relatedness (yes or no) rather than the degree of relatedness, given the results from the narrower condition. These two options could be more explicitly discussed. The report would benefit from providing clearer information about these measures and their range and how they relate to each other (e.g., not a linear transformation). It would be also helpful to know how the values reported on the AS scale would end up if expressed in the GloVe scale (and potentially vice-versa) and how that affects the results. Currently, it is difficult to assess whether the relationship between relatedness and memory is qualitative or quantitative. This is less of a problem with interdependence analyses where the results converge across a narrow and wider range. 

      We thank the reviewer for this point. While other analyses do show differences across the range of AS values we used, we agree in the case of the memorability analysis in the narrower stimulus set, 48-hr experiment (or combining across the narrower and wider stimulus sets), there could be a stronger influence of binary (yes/no) relatedness. We have now made this point explicitly (p. 26):

      “Altogether, these results show that PI can still occur with low relatedness, like in other studies finding PI in ΔTarget (A-B, A-D) paradigms (for a review, see Anderson & Neely, 1996), but PF occurs with higher relatedness. In fact, the absence of low relatedness pairs in the narrower stimulus set likely led to the strong overall PF in this condition across all pairs (positive y-intercept in the upper right of Fig 3A). In this particular instance, there may have been a stronger influence of a binary factor (whether they are related or not), though this remains speculative and is not the case for other analyses in our paper.”

      Additionally, we have also emphasized that the two relatedness metrics are not linear transforms of each other. Finally, as in addressing both your and reviewer #3’s comment below, we now graph relatedness values under a common GloVe metric in Fig 1-Supp 1C (p. 9):

      “Please note that GloVe is an entirely different relatedness metric and is not a linear transformation of AS (see Fig 1-Supp 1C for how the two stimulus sets compare using the common GloVe metric).”

      A smaller weakness is generalizability beyond the word set used here. Using a carefully crafted stimulus set and repeating the same word pairings across participants and conditions was important for memorability calculations and some of the other analyses. However, highlighting the inherently noisy item-by-item results, especially in the Osgood-style surface figures, makes it challenging to imagine how the results would generalize to new stimuli, even within the same relatedness ranges as the current stimulus sets. 

      We thank the reviewer for this critique. We have added this caveat in the limitations to suggest that future studies should replicate these general findings with different stimulus sets (p. 28):

      “Finally, future studies could ensure these effects are not limited to these stimuli and generalize to other word stimuli in addition to testing other domains (Baek & Papaj, 2024; Holding, 1976).”

      Reviewer #3 (Public Review): 

      Summary: 

      Bennion et al. investigate how semantic relatedness proactively benefits the learning of new word pairs. The authors draw predictions from Osgood (1949), which posits that the degree of proactive interference (PI) and proactive facilitation (PF) of previously learned items on to-be-learned items depends on the semantic relationships between the old and new information. In the current study, participants learn a set of word pairs ("supplemental pairs"), followed by a second set of pairs ("base pairs"), in which the cue, target, or both words are changed, or the pair is identical. Pairs were drawn from either a narrower or wider stimulus set and were tested after either a 5-minute or 48-hour delay. The results show that semantic relatedness overwhelmingly produces PF and greater memory interdependence between base and supplemental pairs, except in the case of unrelated pairs in a wider stimulus set after a short delay, which produced PI. In their final analyses, the authors compare their current results to previous work from their group studying the analogous retroactive effects of semantic relatedness on memory. These comparisons show generally similar, if slightly weaker, patterns of results. The authors interpret their results in the framework of recursive reminders (Hintzman, 2011), which posits that the semantic relationships between new and old word pairs promote reminders of the old information during the learning of the new to-be-learned information. These reminders help to integrate the old and new information and result in additional retrieval practice opportunities that in turn improve later recall. 

      Strengths: 

      Overall, I thought that the analyses were thorough and well-thought-out and the results were incredibly well-situated in the literature. In particular, I found that the large sample size, inclusion of a wide range of semantic relatedness across the two stimulus sets, variable delays, and the ability to directly compare the current results to their prior results on the retroactive effects of semantic relatedness were particular strengths of the authors' approach and make this an impressive contribution to the existing literature. I thought that their interpretations and conclusions were mostly reasonable and included appropriate caveats (where applicable). 

      We thank the reviewer for this kind, effective summary and highlight of the paper’s strengths!

      Weaknesses: 

      Although I found that the paper was very strong overall, I have three main questions and concerns about the analyses. 

      My first concern lies in the use of the narrow versus wider stimulus sets. I understand why the initial narrow stimulus set was defined using associative similarity (especially in the context of their previous paper on the retroactive effects of semantic similarity), and I also understand their rationale for including an additional wider stimulus set. What I am less clear on, however, is the theoretical justification for separating the datasets. The authors include a section combining them and show in a control analysis that there were no directional effects in the narrow stimulus set. The authors seem to imply in the Discussion that they believe there are global effects of the lower average relatedness on differing patterns of PI vs PF across stimulus sets (lines 549-553), but I wonder if an alternative explanation for some of their conflicting results could be that PI only occurs with pairs of low semantic relatedness between the supplemental and base pair and that because the narrower stimulus set does not include the truly semantically unrelated pairs, there was no evidence of PI. 

      We agree with the reviewer’s interpretation here, and we have now directly stated this in the discussion section (p. 26):

      “Altogether, these results show that PI can still occur with low relatedness, like in other studies finding PI in ΔTarget (A-B, A-D) paradigms (for a review see, Anderson & Neely, 1996), but PF occurs with higher relatedness. In fact, the absence of low relatedness pairs in the narrower stimulus set likely led to the strong overall PF in this condition across all pairs (positive y-intercept in the upper right of Fig 3A).”

      As for the remainder of this concern, please see our response to your elaboration on the critique below.

      My next concern comes from the additive change in both measures (change in Cue + change in Target). This measure is simply a measure of overall change, in which a pair where the cue changes a great deal but the target doesn't change is treated equivalently to a pair where the target changes a lot, but the cue does not change at all, which in turn are treated equivalently to a pair where the cue and target both change moderate amounts. Given that the authors speculate that there are different processes occurring with the changes in cue and target and the lack of relationship between cue+target relatedness and memorability, it might be important to tease apart the relative impact of the changes to the different aspects of the pair. 

      We thank the reviewer for this great point. First, we should clarify that we only added cue and target similarity values in the ΔBoth condition, which means that all instances of equivalence relate to non-zero values for both cue and target similarity. However, it is certainly possible cue and target similarity separately influence memorability or interdependence. We have now run this analysis separately for cue and target similarity (but within the ΔBoth condition). For memorability, neither cue nor target similarity independently predicted memorability within the ΔBoth condition in any of the four main experiments (all p > 0.23). Conversely, there were some relationships with interdependence. In the narrower stimulus set, 48-hr delay experiment, both cue and target similarity significantly or marginally predicted base-secondary pair interdependence (Cue: r = 0.30, p = 0.04; Target: r = 0.29, p = 0.054). Notably, both survived partial correlation analyses partialing out the other factor (Cue: r = 0.33, p = 0.03; Target: r = 0.32, p = 0.04). In the wider stimulus set, 48-hr delay experiment, only target similarity predicted interdependence (Cue: r = 0.09, p = 0.55; Target: r = 0.34, p = 0.02), and target similarity also predicted interdependence after partialing out cue similarity (r = 0.34, p = 0.02). Similarly, in the narrower stimulus set, 5-min delay experiment, only target similarity predicted interdependence (Cue: r = 0.01, p = 0.93; Target: r = 0.41, p = 0.005), and target similarity also predicted interdependence after partialing out cue similarity (r = 0.42, p = 0.005). Neither predicted interdependence in the wider stimulus set, 5-min delay experiment (Cue: r = -0.14, p = 0.36; Target: r = 0.09, p = 0.54). We have opted to leave this out of the paper for now, but we could include it if the reviewer believes it is worthwhile.

      Note that we address the multiple regression point raised by the reviewer in the critique below.

      Finally, it is unclear to me whether there was any online spell-checking that occurred during the free recall in the learning phase. If there wasn't, I could imagine a case where words might have accidentally received additional retrieval opportunities during learning - take for example, a case where a participant misspelled "razor" as "razer." In this example, they likely still successfully learned the word pair but if there was no spell-checking that occurred during the learning phase, this would not be considered correct, and the participant would have had an additional learning opportunity for that pair. 

      We did not use online spell checking. We agree that misspellings would be considered successful instances of learning (meaning that for those words, they would essentially have successful retrieval more than once). However, we do not have a reason to think that this would meaningfully differ across conditions, so the main learning results would still hold. We have included this in the Methods (p. 29-30):

      “We did not use spell checking during learning, meaning that in some cases pairs could have been essentially retrieved more than once. However, we do not believe this would differ across conditions to affect learning results.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      In terms of the framing of the paper, I think the paper would benefit from a clearer explication of the different theories at play in the introductory section. There are a few theories being examined. Memory-for-change is described in most detail in the discussion, it would help to describe it more deliberately in the intro. The authors refer to a PI account, and this is contrasted with the memory-for-change account, but it seems to me that these theories are not mutually exclusive. In the discussion, several theories are mentioned in passing without being named, e.g., I believe the authors are referring to the fan effect when they mention the difference between delta-cue and delta-target conditions. Perhaps this could be addressed with a more detailed account of the theory underlying Osgood's predictions, which I believe arise from an associative account of paired-associates memory. Osgood's work took place when there was a big debate between unlearning and interference. The current work isn't designed to speak directly to that old debate. But it may be possible to develop the theory a bit more in the intro, which would go a long way towards scaffolding the many results for the reader, by giving them a better sense up front of the theoretical implications. 

      We thank the reviewer for this comment and the nudge to clarify these points. First, we have now made the memory-for-change and remindings accounts more explicit in the introduction, as well as the fact that we are combining the two in forming predictions for the current study (p. 3):

      “Conversely, in favor of the PF account, we consider two main, related theories. The first is the importance of “remindings” in memory, which involve reinstating representations from an earlier study phase during later learning (Hintzman, 2011). This idea centers study-phase retrieval, which involves being able to mentally recall prior information and is usually applied to exact repetitions of the same material (Benjamin & Tullis, 2010; Hintzman et al., 1975; Siegel & Kahana, 2014; Thios & D’Agostino, 1976; Zou et al., 2023). However, remindings can occur upon the presentation of related (but not identical) material and can result in better memory for both prior and new information when memory for the linked events becomes more interdependent (Hintzman, 2011; Hintzman et al., 1975; McKinley et al., 2019; McKinley & Benjamin, 2020; Schlichting & Preston, 2017; Tullis et al., 2014; Wahlheim & Zacks, 2019). The second is the memory-for-change framework, which builds upon these ideas and argues that humans often retrieve prior experiences during new learning, either spontaneously by noticing changes from what was learned previously or by instruction (Jacoby et al., 2015; Jacoby & Wahlheim, 2013). The key advance of this framework is that recollecting changes is necessary for PF, whereas PI occurs without recollection. This framework has been applied to paradigms including stimulus changes, including common paired associate paradigms (e.g., A-B, A-D) that we cover extensively later. Because humans may be more likely to notice and recall prior information when it is more related to new information, these two accounts would predict that semantic relatedness instead promotes successful remindings, which would create PF and interdependence among the traces.”

      Second, as the reviewer suggests, we were referring to the fan effect in the discussion, and we have now made that more explicit (p. 26):

      “We believe these effects arise from the competing processes of impairments between competing responses at retrieval that have not been integrated versus retrieval benefits when that integration has occurred (which occurs especially often with high target relatedness). These types of competing processes appear operative in various associative learning paradigms such as retrieval-induced forgetting (Anderson & McCulloch, 1999; Carroll et al., 2007), and the fan effect (Moeser, 1979; Reder & Anderson, 1980).”

      Finally, our reading of Osgood’s proposal is as an attempt to summarize the qualitative effects of the scattered literature (as of 1949) and did not discuss many theories. For this reason, we generally focus on the directional predictions relating to Osgood’s surface, but we couch it in theories proposed since then.

      It strikes me that the advantage seen for items in the retroactive study compared to the proactive study is consistent with classic findings examining spontaneous recovery. These classic studies found that first-learned materials tended to recover to a level above second-learned materials as time passed. This could be consistent with the memory-for-change proposal presented in the text. The memory-for-change proposal provides a potential cognitive mechanism for the effect, here I'm just suggesting a connection that could be made with the spontaneous recovery literature. 

      We thank the reviewer for this suggestion. Indeed, we agree there is a meaningful point of connection here. We have added the following to the Discussion (p. 27):

      “Additionally, these effects partially resemble those on spontaneous recovery, whereby original associations tend to face interference after new, conflicting learning, but slowly recover over time (either absolutely or relative to the new learning) and often eventually eclipse memory for the new information (Barnes & Underwood, 1959; Postman et al., 1969; Wheeler, 1995). In both cases, original associations appear more robust to change over time, though it is unclear whether these similar outcomes stem from similar mechanisms.”

      Minor recommendations 

      Line 89: relative existing -> relative to existing. 

      Line 132: "line from an unrelated and identical target" -> from an unrelated to identical target (take a look, just needs rephrasing). 

      Line 340: (e.g. peace-shaverazor) I wasn't clear whether this was a typographical error, or whether the intent was to typographically indicate a unified representation. <br /> Line 383: effects on relatedness -> effects of relatedness. 

      We think the reviewer for catching these errors. We have fixed them, and for the third comment, we have clarified that we indeed meant to indicate a unified representation (p. 12):

      “[e.g., peace-shaverazor (written jointly to emphasize the unification)]”

      Page 24: Figure 8. I think the statistical tests in this figure are just being done between the pairs of the same color? Like in the top left panel, delta-cue pro and delta-target retro are adjacent and look equivalent, but there is no n.s. marking for this pair. Could consider keeping the connecting line between the linked conditions and removing the connecting lines that span different conditions. 

      Indeed, we were only comparing conditions with the same color. We have changed the connecting lines to reflect this.

      Page 26 line 612: I think this is the first mention that the remindings account is referred to as the memory-for-change framework, consider mentioning this in the introduction. 

      Thank you – we have now mentioned this in the introduction.

      Lines 627-630. Is this sentence referring to the fan effect? If so it could help the reader to name it explicitly. 

      We have now named this explicitly.

      Reviewer #2 (Recommendations For The Authors): 

      This is a matter of personal preference, but I would prefer PI and PF spelled out instead of the abbreviations. This was also true for RI and RF which are defined early but then not used for 20 pages before being re-used again. In contrast, the naming of the within-subject conditions was very intuitive. 

      We appreciate this perspective. However, we prefer to keep the terms PI and PF for the sake of brevity. We now re-introduce terms that do not return until later in the manuscript.

      Osgood surface in Figure 1A could be easier to read if slightly reformatted. For example, target and cue relatedness sides are very disproportional and I kept wondering if that was intentional. The z-axis could be slightly more exaggerated so it's easier to see the critical messages in that figure (e.g., flip from + to - effect along the one dimension). The example word pairs were extremely helpful. 

      Figures 1C and 1D were also very helpful. It would be great if they could be a little bigger as the current version is hard to read. 

      Figure 1B took a while to decipher and could use a little more anticipation in the body of the text. Any reason to plot the x-axis from high to low on this figure? It is confusing (and not done in the actual results figures). I believe the supplemental GloVe equivalent in the supplement also has a confusing x-axis. 

      Thank the reviewer for this feedback. We have modified Figure 1A to reduce the disproportionality and accentuate the z-axis changes. We have also made the text in C and D larger. Finally, we have flipped around the x-axis in B and in the supplement.

      The description of relatedness values was rather confusing. It is not intuitive to accept that AS values from 0.03-0.96 are "narrow", as that seems to cover almost the whole theoretical range. I do understand that 0.03 is still a value showing relatedness, but more explanation would be helpful. It is also not clear how the GloVe values compare to the AS values. If I am understanding the measures and ranges correctly, the "narrow" condition could also be called "related only" while the "wide" condition could be called "related and unrelated". This is somewhat verbalized but could be clearer. In general, please provide a straightforward way for a reader to explicitly or implicitly compare those conditions, or even plot the "narrow" condition using both AS values and GloVe values so one can really compare narrow and wider conditions comparing apples with apples. 

      We thank the reviewer for this critique. First, we have now sought to clarify this in the Introduction (p. 11-12):

      “Across the first four experiments, we manipulated two factors: range of relatedness among the pairs and retention interval before the final test. The narrower range of relatedness used direct AS between pairs using free association norms, such that all pairs had between 0.03-0.96 association strength. Though this encompasses what appears to be a full range of relatedness values, pairs with even low AS are still related in the context of all possible associations (e.g., pious-holy has AS = 0.03 but would generally be considered related) (Fig 1B). The stimuli using a wider range of relatedness spanned the full range of global vector similarity (Pennington et al., 2014) that included many associations that would truly be considered unrelated (Fig 1-Supp 1A). One can see the range of the wider relatedness values in Fig 1-Supp 1B and comparisons between narrower and wider relatedness values in Fig 1-Supp 1C.”

      Additionally, as noted in the text above, we have added a new subfigure to Fig 1-Supp 1 that compares the relatedness values in the narrower and wider stimulus sets using the common GloVe metric.

      Considering a relationship other than linear may also be beneficial (e.g., the difference between AS of 0.03 and 0.13 may not be equal to AS of .83 and .93; same with GloVe). I am assuming that AS and GloVe are not linear transforms of each other. Thus, it is not clear whether one should expect a linear (rather than curvilinear or another monotonic) relationship with both of them. It could be as simple as considering rank-order correlation rather than linear correlation, but just wanted to put this out for consideration. The linear approach is still clearly fruitful (e.g., interdependence), but limits further the utility of having both narrow and wide conditions without a straightforward way to compare them. 

      We thank the reviewer for this point. Indeed, AS and GloVe are not linear transforms of each other, but metrics derived from different sources (AS comes from human free associations; GloVe comes from a learned vector space language model). (We noted this in the text and in our response to your above comment.) However, we do have the ability to put all the word pairs into the GloVe metric, which we do in the Results section, “Re-assessing proactive memory and interdependence effects using a common metric”. In this analysis, we used a linear correlation that combined data sets with a similar retention interval and replicated our main findings earlier in the paper (p. 5):

      “In the 48-hr delay experiment, correlations between memorability and cue relatedness in the ΔCue condition [r2(44) > 0.29, p < 0.001] and target relatedness in the ΔTarget condition [r2(44) = 0.2, p < 0.001] were significant, whereas cue+target relatedness in the ΔBoth condition was not [r2(44) = 0.01, p = 0.58]. In all three conditions, interdependence increased with relatedness [all r2(44) > 0.16, p < 0.001].”

      Following the reviewer suggestion to test things out using rank order, we also re-created the combined analysis using rank order based on GloVe values rather than the raw GloVe values. The ranks now span 1-90 (because there were 45 pairs in each of the narrower and wider stimulus sets). All results qualitatively held.

      Author response image 1.

      Rank order results.

      Author response image 2.

      And the raw results in Fig 6-Supp 1 (as a reference).

      Reviewer #3 (Recommendations For The Authors):

      In regards to my first concern, the authors could potentially test whether the stimulus sets are different by specifically looking at pairs from the wider stimulus set that overlap with the range of relatedness from the narrow set and see if they replicate the results from the narrow stimulus set. If the results do not differ, the authors could simplify their results section by collapsing across stimulus sets (as they did in the analyses presented in Figure 6 - Supplementary Figure 1). If the authors opt to keep the stimulus sets separate, it would be helpful to include a version of Figure 1b/Figure 1 - Supplementary Figure 1 where the coverage of the two stimulus sets are plotted on the same figure using GloVe similarity so it is easier to interpret the results. 

      We have conducted this analysis in two ways, though we note that we will eventually settle upon keeping the stimulus sets separate. First, we examined memorability between the data sets by removing one pair at a time from the wider stimulus set until there was no significant difference (p > 0.05). We did this at the long delay because that was more informative for most of our analyses. Even after reducing the wider stimulus set, the narrow stimulus set still had significantly or marginally higher memorability in all three conditions (p < 0.001 for ΔCue; p < 0.001 for ΔTarget; p = 0.08 for ΔBoth. We reasoned that this was likely because the AS values still differed (all, p < 0.001), which would present a clear way for participants to associate words that may not be as strongly similar in vector space (perhaps due to polysemy for individual words). When we ran the analysis a different way that equated AS, we no longer found significant memorability differences (p \= 0.13 for ΔCue; p = 0.50 for ΔTarget; p = 0.18 for ΔBoth). However, equating the two data sets in this analysis required us to drop so many pairs to equate the wider stimulus data set (because only a few only had a direct AS connection; there were 3, 5, and 1 pairs kept in the ΔCue, ΔTarget, and ΔBoth conditions) that we would prefer not to report this result.

      Additionally, we now plot the two stimulus sets on the same plot (Reviewer 2 also suggested this).

      In regards to my second concern, one potential way the authors could disambiguate the effects of change in cue vs change in target might be to run a multiple linear regression with change in Cue, change in Target, and the change in Cue*change in Target interaction (potentially with random effects of subject identity and word pair identity to combine experiments and control for pair memorability/counterbalancing), which has the additional bonus of potentially allowing the authors to include all word pairs in a single model and better describe the Osgood-style spaces in Figure 6.

      This is a very interesting idea. We set this analysis up as the reviewer suggested, using fixed effects for ΔCue, ΔTarget, and ΔCue*ΔTarget, and random effects for subject and word ID. Because we had a binary outcome variable, we used mixed effects logistic regression. For a given pair, if it had the same cue or target, the corresponding change column received a 0, and if it had a different cue or target, it received a graded value (1 - GloVe value between the new and old cue or target). For this analysis, because we designed this analysis to indicate a treatment away from a repeat (as in the No Δ condition, which had no change for either cues and targets), we omitted control items. For items in the ΔBoth condition, we initially used positive values in both the Cue and Target columns too, with the multiplied ΔCue*ΔTarget value in its own column. We focused these analyses on the 48-hr delay experiments. In both experiments, running it this way resulted in highly significant negative effects of ΔCue and ΔTarget (both p < 0.001), but positive effects of ΔCue*ΔTarget (p < 0.001), presumably because after accounting for the negative independent predictions of both ΔCue and ΔTarget, ΔCue*ΔTarget values actually were better than expected.

      We thought that those results were a little strange given that generally there did not appear to be interactions with ΔCue*ΔTarget values, and the positive result was simply due to the other predictors in the model. To show that this is the case, we changed the predictors so that items in the ΔBoth condition had 0 in ΔCue and ΔTarget columns alongside their ΔCue*ΔTarget value. In this case, all three factors negatively predicted memory (all p < 0.001).

      We don't necessarily see this second approach as better, partly because it seems clear to us that any direction you go from identity is just hurting memory, and we felt the need to drop the control condition. We next flipped around the analysis to more closely resemble how we ran the other analyses, using similarity instead of distance. Here, identity along any dimension indicated a 1, a change in any part of the pair involved using that pair’s GloVe value (rather than the 1 – the GloVe value from above), and the control condition simply had zeros in all the columns. In this case, if we code the cue and target similarity values as themselves in the ΔBoth condition, in both 48-hr experiments, cue and target similarity significantly positively predicted memory (narrower set: cue similarity had p = 0.006, target similarity had p < 0.001; wider set: both p < 0.001) and the interaction term negatively predicted memory (p < 0.001 in both). If we code cue and target similarity values as 0s in the ΔBoth condition, all three factors tend to be positive (narrower, Cue: p = 0.11, Target and Interaction: p < 0.001; wider, Cue and Target p < 0.001; Interaction: p = 0.07).

      Ultimately, we would prefer to leave this out of the manuscript in the interest of simplicity and because we largely find that these analyses support our prior conclusions. However, we could include them if the reviewer prefers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review):

      In this study, Alejandro Rosell et al. uncovers the immunoregulation functions of RAS-p110α pathway in macrophages, including the extravasation of monocytes from the bloodstream and subsequent lysosomal digestion. Disrupting RAS-p110α pathway by mouse genetic tools or by pharmacological intervention, hampers the inflammatory response, leading to delayed resolution and more severe acute inflammatory reactions. The authors proposed that activating p110α using small molecules could be a promising approach for treating chronic inflammation. This study provides insights into the roles and mechanisms of p110α on macrophage function and the inflammatory response, while some conclusions are still questionable because of several issues described below. 

      (1) Fig. 1B showed that disruption of RAS-p110α causes the decrease in the activation of NF-κB, which is a crucial transcription factor that regulates the expression of proinflammatory genes. However, the authors observed that disruption of RAS-p110α interaction results in an exacerbated inflammatory state in vivo, in both localized paw inflammation and systemic inflammatory mediator levels. Also, the authors introduced that "this disruption leads to a change in macrophage polarization, favoring a more proinflammatory M1 state" in introduction according to reference 12. The conclusions drew from the signaling and the models seemed contradictory and puzzling. Besides, it is not clear why the protein level of p65 was decreased at 10' and 30'. Was it attributed to the degradation of p65 or experimental variation? 

      We thank the reviewer for this insightful comment and apologize for not previously explaining the implications of the observed decrease in NF-κB activation. We found a decrease in NF-κB activation in response to LPS + IFN-γ stimulation in macrophages lacking RAS-PI3K interaction. As the reviewer pointed out, NF-κB is a key transcription factor that regulates the expression of various proinflammatory genes. To better characterize whether the decrease in p-p65 would lead to a reduction in the expression of specific cytokines, we performed a cytokine array using unstimulated and LPS + IFN-γ stimulated macrophages. The results indicated a small number of cytokines with altered expression, validating that RAS-p110α activation of p-p65 regulates the expression of some inflammatory cytokines. These results have been added to the manuscript and to Figure 1 (panels C and D). In brief, the data suggest an impairment in recruitment factors and inflammatory regulators following the disruption of RAS-p110α signaling in macrophages, which aligns with the observed in vivo phenotype. 

      Our findings indicate that the disruption of RAS-p110α signaling has a complex and multifaceted role in BMDMs. Specifically, monocytes lacking RAS-PI3K are unable to reach the inflamed area due to an impaired ability to extravasate, caused by altered actin cytoskeleton dynamics. Consequently, inflammation is sustained over time, continuously releasing inflammatory mediators. Moreover, we have shown that macrophages deficient in RAS-p110α interaction fail to mount a full inflammatory response due to decreased activation of p-p65, leading to reduced production of a set of inflammatory regulators. Additionally, these macrophages are unable to effectively process phagocytosed material and activate the resolutive phase of inflammation. As a result of these defects, an exacerbated and sustained inflammatory response occurs. 

      Our in vivo data, showing an increase in systemic inflammatory mediators, might be a consequence of the accumulation of monocytes produced by bone marrow progenitors in response to sensed inflammatory stimuli, but unable to extravasate.

      Regarding the sentence in the introduction: "this disruption leads to a change in macrophage polarization, favoring a more proinflammatory M1 state" (reference 12), this was observed in an oncogenic context, which might differ from the role of RAS-p110α in a non-oncogenic situation, as analyzed in this work. We introduced these results as an example to establish the role of RAS-p110α in macrophages, demonstrating its participation in macrophage-dependent responses. Together with our study, these findings clearly indicate that p110α signaling is critical when analyzing full immune responses. Previously, little was known about the role of this PI3K isoform in immune responses. Our data, along with those presented by Murillo et al. (ref. 12), demonstrate that p110α plays a significant role in macrophage function in both oncogenic and inflammatory contexts. Additionally, our results suggest that this role is complex and multifaceted, warranting further investigation to fully understand the complexity of p110α signaling in macrophages.

      Regarding decreased levels of p65 at 10’ and 30’ in RBD cells we are still uncertain about the possible molecular mechanism leading to the observed decrease. No changes in p65 mRNA levels were observed after 30 minutes of LPS+IFNγ treatment as shown in Author response image 1.

      Author response image 1.

      Preliminary data not shown here suggest that treating macrophages with BYL exhibits a similar effect, indicating a potential pathway for investigation. Considering that the decrease in protein levels is not due to lower mRNA expression, we may infer that post-translational mechanisms are leading to early protein degradation in RAS-p110α deficient macrophages. This could explain the observed decrease in protein activation. However, the specific molecular mechanism responsible for this degradation remains unclear, and further research is necessary to elucidate it. 

      (2) In Fig 3, the authors used bone-marrow derived macrophages (BMDMs) instead of isolated monocytes to evaluate the ability of monocyte transendothelial migration, which is not sufficiently convincing. In Fig. 3B, the authors evaluated the migration in Pik3caWT/- BMDMs, and Pik3caWT/WT BMDMs treated with BYL-719'. Given that the dose effect of gene expression, the best control is Pik3caWT/- BMDMs treated with BYL-719. 

      We thank reviewer for this comment. While we agree that using BMDMs might not be the most conventional approach for studying monocyte migration, there were several reasons why we still considered them a valid method. While isolated monocytes are the initial cell type involved in transendothelial migration, bone marrow-derived macrophages (BMDMs) provide a relevant and practical model for studying this process. BMDMs are differentiated from the same bone marrow precursors as monocytes and retain the ability to respond to chemotactic signals, adhere to endothelial cells, and migrate through the endothelium. This makes them a suitable tool for examining the cellular and molecular mechanisms underlying monocyte migration and subsequent macrophage infiltration into tissues. Additionally, BMDMs offer experimental consistency and are easier to manipulate in vitro, enabling more controlled and reproducible studies. 

      In response to the comment regarding Fig. 3B, we appreciate the suggestion to use Pik3ca WT/- BMDMs treated with BYL-719 as a control. However, our rationale for using Pik3ca WT/WT BMDMs treated with BYL-719 was based on a conceptual approach rather than a purely experimental control. The BYL-719 treatment in Pik3ca WT/WT cells was intended to simulate the inhibition of p110α in a fully functional, wild-type context. This allows us to directly assess the impact of p110α inhibition under normal physiological conditions, which is more representative of what would occur in an organism where the full dose of Pik3ca is present. Using Pik3ca WT/- BMDMs treated with BYL-719 as a control may not accurately reflect the in vivo scenario, where any therapeutic intervention would likely occur in the context of a fully functional, wild-type background. Our approach aims to provide a clearer understanding of how p110α inhibition affects cell functionality in a wild-type setting, which is relevant for potential therapeutic applications. Therefore, we considered the use of Pik3ca WT/WT BMDMs with BYL-719 treatment to be a more appropriate control for testing the effects of p110α inhibition in normal conditions.

      (3) In Fig. 4E-4G, the authors observed that elevated levels of serine 3 phosphorylated Cofilin in Pik3caRBD/- BMDMs both in unstimulated and in proinflammatory conditions, and phosphorylation of Cofilin at Ser3 increase actin stabilization, it is not clear why disruption of RAS-p110α binding caused a decrease in the F-actin pool in unstimulated BMDMs? 

      We thank the reviewer for this insightful comment. During the review process, we have carefully quantified all the Western blots conducted. While we did observe an increase in phospho-Cofilin (Ser3) levels in RBD BMDMs, this increase did not reach statistical significance. As a result, we cannot confidently attribute the observed increase in F-actin to this proposed mechanism. We apologize for any confusion this may have caused. Consequently, we have removed these data from Figure 4G and the associated discussion.

      Unfortunately, we have not yet identified the underlying mechanism responsible for this phenotype. Future experiments will focus on exploring potential alterations in other actin-nucleating, regulating, and stabilizing proteins that could account for the observed changes in F-actin levels.

      Reviewer #2 (Public Review): 

      Summary: 

      Cell intrinsic signaling pathways controlling the function of macrophages in inflammatory processes, including in response to infection, injury or in the resolution of inflammation are incompletely understood. In this study, Rosell et al. investigate the contribution of RAS-p110α signaling to macrophage activity. p110α is a ubiquitously expressed catalytic subunit of PI3K with previously described roles in multiple biological processes including in epithelial cell growth and survival, and carcinogenesis. While previous studies have already suggested a role for RAS-p110α signaling in macrophages function, the cell intrinsic impact of disrupting the interaction between RAS and p110α in this central myeloid cell subset is not known. 

      Strengths: 

      Exploiting a sound previously described genetically mouse model that allows tamoxifen-inducible disruption of the RAS-p110α pathway and using different readouts of macrophage activity in vitro and in vivo, the authors provide data consistent with their conclusion that alteration in RAS-p110α signaling impairs the function of macrophages in a cell intrinsic manner. The study is well designed, clearly written with overall high-quality figures. 

      Weaknesses: 

      My main concern is that for many of the readouts, the difference between wild-type and mutant macrophages in vitro or between wild-type and Pik3caRBD mice in vivo is rather modest, even if statistically significant (e.g. Figure 1A, 1C, 2A, 2F, 3B, 4B, 4C). In other cases, such as for the analysis of the H&E images (Figure 1D-E, S1E), the images are not quantified, and it is hard to appreciate what the phenotype in samples from Pik3caRBD mice is or whether this is consistently observed across different animals. Also, the authors claim there is a 'notable decrease' in Akt activation but 'no discernible chance' in ERK activation based on the western blot data presented in Figure 1A. I do not think the data shown supports this conclusion. 

      We appreciate the reviewer's careful examination of our data and their observation regarding the modest differences between wild-type and mutant macrophages in vitro, as well as between wild-type and Pik3caRBD mice in vivo. While the differences observed in Figures 1A, 1C, 2A, 2F, 3B, 4B, and 4C are statistically significant but modest, our data demonstrate that they are biologically relevant and should be interpreted within the specific nature of our model. Our study focuses on the disruption of the RASp110α interaction, but it should be noted that alternative pathways for p110α activation, independent of RAS, remain functional in this model. Additionally, the model retains the expression of other p110 isoforms, such as p110β, p110γ, and p110δ, which are known to have significant roles in immune responses. Given the overlapping functions of these p110 isoforms, and the fact that our model involves a subtle modification that specifically affects the RAS-p110α interaction without completely abrogating p110α activity, it is understandable that only modest effects are observed in some readouts. The redundancy and compensation by other p110 isoforms likely mitigate the impact of disrupting RAS-mediated p110α activation.

      However, despite these modest in vitro differences, it is crucial to highlight that the in vivo effects on inflammation are both clear and consistent. The persistence of inflammation in our model suggests that the RAS-p110α interaction plays a specific, non-redundant role in resolving inflammation, which cannot be fully compensated by other signaling pathways or p110 isoforms. These findings underscore the importance of RAS-p110α signaling in immune homeostasis and suggest that even subtle disruptions in this pathway can lead to significant physiological consequences over time, particularly in the context of inflammation. The modest differences observed may represent early or subtle alterations that could lead to more pronounced phenotypes under specific stress or stimulation conditions. This could be tested across all the figures mentioned. For instance, in Fig. 1A, the Western blot for AKT has been quantified, demonstrating a significant decrease in AKT levels; in Fig. 1C, although the difference in paw inflammation was only a few millimeters in thickness, considering the size of a mouse paw, those millimeters were very noticeable by eye. Furthermore, pathological examination of the tissue consistently showed an increase in inflammation in RBD mice. Furthermore, the consistency of the observed differences across different readouts and experimental setups reinforces the reliability and robustness of our findings. Even modest changes that are consistently observed across different assays and conditions are indicative of genuine biological effects. The statistical significance of the differences indicates that they are unlikely to be due to random variation. This statistical rigor supports the conclusion that the observed effects, albeit modest, are real and warrant further exploration.

      Regarding the analysis of H&E images, we have now quantified the changes with the assistance of the pathologist, Mª Carmen García Macías, who has been added to the author list. We removed the colored arrows from the images and instead quantified fibrin and chromatin remnants as markers of inflammation staging. Loose chromatin, which increases as a consequence of cell death, is higher in the early phases of inflammation and decreases as macrophages phagocytose cell debris to initiate tissue healing. Chromatin content was scored on a scale from 1 to 3, where 1 represents the lowest amount and 3 the highest. The scoring was based on the area within the acute inflammatory abscess where chromatin could be found: 3 for less than 30%, 2 for 30-60%, and 1 for over 60%. Graphs corresponding to this quantification have now been added to Figure 1 and an explanation of the scale has been added to Material and Methods. 

      To further substantiate the extent of macrophage function alteration upon disruption of RAS-p110α signaling, the manuscript would benefit from testing macrophage activity in vitro and in vivo across other key macrophage activities such as bacteria phagocytosis, cytokine/chemokine production in response to titrating amounts of different PAMPs, inflammasome function, etc. This would be generally important overall but also useful to determine whether the defects in monocyte motility or macrophage lysosomal function are selectively controlled downstream of RAS-p110α signaling.  

      We thank reviewer #2 for this comment. In order to better address the role of RAS-PI3K in macrophage function, we have performed some additional experiments, some of which have been added to the revised version of the manuscript. 

      (1) We have performed cytokine microarrays of RAS-p110α deficient macrophages unstimulated and stimulated with LPS+IFN-g. Results have been added to the manuscript and to Supplementary Figure S1E and S1F. In brief, the data obtained suggest an impairment in recruitment factors, as well as in inflammatory regulators after disruption of RAS-p110α signaling in macrophages, which align with the in vivo observed phenotype. 

      (2) We also conducted phagocytosis assays to analyze the ability of RAS-p110α deficient macrophages to phagocytose 1 µm Sepharose beads, Borrelia burgdorferi, and apoptotic cells. The data reveal varied behavior of RAS-p110α deficient bone marrow-derived macrophages (BMDMs) depending on the target: 

      • Engulfment of Non-biological Particles: RAS-p110α deficient macrophages showed a decreased ability to engulf 1 µm Sepharose beads. This suggests that RAS-p110α signaling is important for the effective phagocytosis of non-biological particles. These findings have now been added to the text and figures have been added to supplementary Fig. S4A

      • Response to Bacterial Pathogens: When exposed to Borrelia burgdorferi, RAS-p110α deficient macrophages did not exhibit a change in bacterial uptake. This indicates that RAS-p110α may not play a critical role in the initial phagocytosis of this bacterial pathogen. The observed increase in the phagocytic index, although not statistically significant, might imply a compensatory mechanism or a more complex interaction that warrants further investigation. These findings have now been added to the text and figures have been added to supplementary Fig. S4B. These experiments were performed in collaboration with Dr. Anguita, from CICBioBune (Bilbao, Spain) and, as a consequence, he has been added as an author in the paper. 

      • Phagocytosis of Apoptotic Cells: There were no differences in the phagocytosis rate of apoptotic cells between RAS-p110α deficient and control macrophages at early time points. However, the accumulation of engulfed material at later time points suggests a possible delay in the processing and degradation of apoptotic cells in the absence of RAS-p110α signaling.

      These findings highlight the complexity of RAS-p110α's involvement in phagocytic processes and suggest that its role may vary with different types of phagocytic targets. 

      Furthermore, given the key role of other myeloid cells besides macrophages in inflammation and immunity it remains unclear whether the phenotype observed in vivo can be attributed to impaired macrophage function. Is the function of neutrophils, dendritic cells or other key innate immune cells not affected? 

      Thank you for this insightful comment. We understand the key role of other myeloid cells in inflammation and immunity. However, our study specifically focuses on the role of macrophages. Our data show that disruption of RAS-PI3K leads to a clear defect in macrophage extravasation, and our in vitro data demonstrate issues in macrophage cytoskeleton and phagocytosis, aligning with the in vivo phenotype.

      Experiments investigating the role of RAS-PI3K in neutrophils, dendritic cells, or other innate immune cells are beyond the scope of this study. Understanding these interactions would indeed require separate, comprehensive studies and the generation of new mouse models to disrupt RAS-PI3K exclusively in specific cell types.

      Furthermore, during paw inflammation experiments, polymorphonuclear cells were present from the initial phases of the inflammatory response. What caught our attention was the prolonged presence of these cells. In conversation with our in-house pathologist, she mentioned the lack of macrophages to remove dead polymorphonuclear cells in our RAS-PI3K mutant mice. Specific staining for macrophages confirmed the absence of macrophages in the inflamed node of mutant mice.

      We acknowledge that further research is necessary to elucidate the effects on other myeloid cells. However, our current findings provide clear evidence of a decrease in inflammatory monocytes and defective macrophage responses to inflammation, both in vivo and in vitro. We believe these results significantly contribute to understanding the role of RAS-PI3K in macrophage function during inflammation.

      Compelling proof of concept data that targeting RAS-p110α signalling constitutes indeed a putative approach for modulation of chronic inflammation is lacking. Addressing this further would increase the conceptual advance of the manuscript and provide extra support to the authors' suggestion that p110α inhibition or activation constitute promising approaches to manage inflammation. 

      We thank Reviewer #2 for this insightful comment. In our manuscript, we have demonstrated through multiple experiments that the inhibition of p110α, either by disrupting RAS-p110α signaling or through the use of Alpelisib (BYL-719), has a modulatory effect on inflammatory responses. However, we acknowledge that we have not activated the pathway due to the unavailability of a suitable p110α activator until the concluding phase of our study.

      We recognize the importance of this point and are eager about investigating both the inhibition and activation of p110α as potential approaches to managing inflammation in well-established inflammatory disease models. We believe that such comprehensive studies would significantly enhance the conceptual advance and translational relevance of our findings.

      However, it is essential to note that the primary aim of our current work was to demonstrate the role of RAS-p110α in the inflammatory responses of macrophages. We have successfully shown that RASp110α influences macrophage behavior and inflammatory signaling. Expanding the scope to include disease models and pathway activation studies would be an extensive project that goes beyond the current objectives of this manuscript. While our present study establishes the foundational role of RASp110α in macrophage-mediated inflammatory responses, we agree that further investigation into both p110α inhibition and activation in disease models is crucial. We are keen to pursue this line of research in future studies, which we believe will provide robust evidence supporting the therapeutic potential of targeting RAS-p110α signaling in chronic inflammation.

      Finally, the analysis by FACS should also include information about the total number of cells, not just the percentage, which is affected by the relative change in other populations. On this point, Figure S2B shows a substantial, albeit not significant (with less number of mice analysed), increase in the percentage of CD3+ cells. Is there an increase in the absolute number of T cells or does this apparent relative increase reflect a reduction in myeloid cells? 

      We thank the reviewer for this comment, which we have addressed in the revised version of the manuscript. Regarding the total number of cells analyzed, we have added to the Materials and Methods section that in all our studies, a total of 50,000 cells were analyzed (line 749). The percentages of cells are related to these 50,000 events. Additionally, we have increased the number of mice analyzed by including new mice for CD3+ cell analysis. Despite this, the results remain not significant.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations For The Authors):   

      (1) It is recommended to provide a graphical abstract to summarize the multiple functions of RAS-p110α pathway in monocyte/macrophages that the authors proposed 

      We thank reviewer for this useful recommendation. A graphical abstract has now been added to the study. 

      (2) Western blots in this paper need quantification and a measure of reproducibility 

      We have now added a graph with the quantification of the western blots performed in this work as a measure of reproducibility. 

      (3) Representative flow data and gating strategy should be included

      We have now added the description of the gating strategy followed to material and methods section.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This work provides a new dataset of 71,688 images of different ape species across a variety of environmental and behavioral conditions, along with pose annotations per image. The authors demonstrate the value of their dataset by training pose estimation networks (HRNet-W48) on both their own dataset and other primate datasets (OpenMonkeyPose for monkeys, COCO for humans), ultimately showing that the model trained on their dataset had the best performance (performance measured by PCK and AUC). In addition to their ablation studies where they train pose estimation models with either specific species removed or a certain percentage of the images removed, they provide solid evidence that their large, specialized dataset is uniquely positioned to aid in the task of pose estimation for ape species.

      The diversity and size of the dataset make it particularly useful, as it covers a wide range of ape species and poses, making it particularly suitable for training off-the-shelf pose estimation networks or for contributing to the training of a large foundational pose estimation model. In conjunction with new tools focused on extracting behavioral dynamics from pose, this dataset can be especially useful in understanding the basis of ape behaviors using pose.

      We thank the reviewer for the kind comments.

      Since the dataset provided is the first large, public dataset of its kind exclusively for ape species, more details should be provided on how the data were annotated, as well as summaries of the dataset statistics. In addition, the authors should provide the full list of hyperparameters for each model that was used for evaluation (e.g., mmpose config files, textual descriptions of augmentation/optimization parameters).

      We have added more details on the annotation process and have included the list of instructions sent to the annotators. We have also included mmpose configs with the code provided. The following files include the relevant details:

      File including the list of instructions sent to the annotators: OpenMonkeyWild Photograph Rubric.pdf

      Mmpose configs:

      i) TopDownOAPDataset.py

      ii) animal_oap_dataset.py

      iii) init.py

      iv) hrnet_w48_oap_256x192_full.py

      Anaconda environment files:

      i) OpenApePose.yml

      ii) requirements.txt

      Overall this work is a terrific contribution to the field and is likely to have a significant impact on both computer vision and animal behavior.

      Strengths:

      • Open source dataset with excellent annotations on the format, as well as example code provided for working with it.

      • Properties of the dataset are mostly well described.

      • Comparison to pose estimation models trained on humans vs monkeys, finding that models trained on human data generalized better to apes than the ones trained on monkeys, in accordance with phylogenetic similarity. This provides evidence for an important consideration in the field: how well can we expect pose estimation models to generalize to new species when using data from closely or distantly related ones? - Sample efficiency experiments reflect an important property of pose estimation systems, which indicates how much data would be necessary to generate similar datasets in other species, as well as how much data may be required for fine-tuning these types of models (also characterized via ablation experiments where some species are left out).

      • The sample efficiency experiments also reveal important insights about scaling properties of different model architectures, finding that HRNet saturates in performance improvements as a function of dataset size sooner than other architectures like CPMs (even though HRNets still perform better overall).

      We thank the reviewer for the kind comments.

      Weaknesses:

      • More details on training hyperparameters used (preferably full config if trained via mmpose).

      We have now included mmpose configs and anaconda environment files that allow researchers to use the dataset with specific versions of mmpose and other packages we trained our models with. The list of files is provided above.

      • Should include dataset datasheet, as described in Gebru et al 2021 (arXiv:1803.09010).

      We have included a datasheet for our dataset in the appendix lines 621-764.

      • Should include crowdsourced annotation datasheet, as described in Diaz et al 2022 (arXiv:2206.08931). Alternatively, the specific instructions that were provided to Hive/annotators would be highly relevant to convey what annotation protocols were employed here.

      We have included the list of instructions sent to the Hive annotators in the supplementary materials. File: OpenMonkeyWild Photograph Rubric.pdf

      • Should include model cards, as described in Mitchell et al (arXiv:1810.03993).

      We have included a model card for the included model in the results section line 359. See Author response image 1.

      Author response image 1.

      • It would be useful to include more information on the source of the data as they are collected from many different sites and from many different individuals, some of which may introduce structural biases such as lighting conditions due to geography and time of year.

      We agree that the source could introduce structural biases. This is why we included images from so many different sources and captured images at different times from the same source—in hopes that a large variety of background and lighting conditions are represented. However, doing so limits our ability to document each source background and lighting condition separately.

      • Is there a reason not to use OKS? This incorporates several factors such as landmark visibility, scale, and landmark type-specific annotation variability as in Ronchi & Perona 2017 (arXiv:1707.05388). The latter (variability) could use the human pose values (for landmarks types that are shared), the least variable keypoint class in humans (eyes) as a conservative estimate of accuracy, or leverage a unique aspect of this work (crowdsourced annotations) which affords the ability to estimate these values empirically.

      The focus of this work is on overall keypoint localization accuracy and hence we wanted a metric that is easy to interpret and implement, in this case we made use of PCK (Percentage of Correct Keypoints). PCK is a simple and widely used metric that measures the percentage of correctly localized keypoints within a certain distance threshold from their corresponding groundtruth keypoints.

      • A reporting of the scales present in the dataset would be useful (e.g., histogram of unnormalized bounding boxes) and would align well with existing pose dataset papers such as MS-COCO (arXiv:1405.0312) which reports the distribution of instance sizes and instance density per image.

      RESPONSE: We have now included a histogram of unnormalized bounding boxes in the manuscript, Author response image 2.

      Author response image 2.

      Reviewer #2 (Public Review):

      The authors present the OpenApePose database constituting a collection of over 70000 ape images which will be important for many applications within primatology and the behavioural sciences. The authors have also rigorously tested the utility of this database in comparison to available Pose image databases for monkeys and humans to clearly demonstrate its solid potential.

      We thank the reviewer for the kind comments.

      However, the variation in the database with regards to individuals, background, source/setting is not clearly articulated and would be beneficial information for those wishing to make use of this resource in the future. At present, there is also a lack of clarity as to how this image database can be extrapolated to aid video data analyses which would be highly beneficial as well.

      I have two major concerns with regard to the manuscript as it currently stands which I think if addressed would aid the clarity and utility of this database for readers.

      1) Human annotators are mentioned as doing the 16 landmarks manually for all images but there is no assessment of inter-observer reliability or the such. I think something to this end is currently missing, along with how many annotators there were. This will be essential for others to know who may want to use this database in the future.

      We thank the reviewer for pointing this out. Inter-observer reliability is important for ensuring the quality of the annotations. We first used Amazon MTurk to crowd source annotations and found that the inter-observer reliability and the annotation quality was poor. This was the reason for choosing a commercial service such as Hive AI. As the crowd sourcing and quality control are managed by Hive through their internal procedures, we do not have access to data that can allow us to assess inter-observer reliability. However, the annotation quality was assessed by first author ND through manual inspections of the annotations visualized on all of the images the database. Additionally, our ablation experiments with high out of sample performances further vaildate the quality of the annotations.

      Relevant to this comment, in your description of the database, a table or such could be included, providing the number of images from each source/setting per species and/or number of individuals. Something to give a brief overview of the variation beyond species. (subspecies would also be of benefit for example).

      Our goal was to obtain as many images as possible from the most commonly studied ape species. In order to ensure a large enough database, we focused only on the species and combined images from as many sources as possible to reach our goal of ~10,000 images per species. With the wide range of people involved in obtaining the images, we could not ensure that all the photographers had the necessary expertise to differentiate individuals and subspecies of the subjects they were photographing. We could only ensure that the right species was being photographed. Hence, we cannot include more detailed information.

      2) You mention around line 195 that you used a specific function for splitting up the dataset into training, validation, and test but there is no information given as to whether this was simply random or if an attempt to balance across species, individuals, background/source was made. I would actually think that a balanced approach would be more appropriate/useful here so whether or not this was done, and the reasoning behind that must be justified.

      This is especially relevant given that in one test you report balancing across species (for the sample size subsampling procedure).

      We created the training set to reflect the species composition of the whole dataset, but used test sets balanced by species. This was done to give a sense of the performance of a model that could be trained with the entire dataset, that does not have the species fully balanced. We believe that researchers interested in training models using this dataset for behavior tracking applications would use the entire dataset to fully leverage the variation in the dataset. However, for those interested in training models with balanced species, we provide an annotation file with all the images included, which would allow researchers to create their own training and test sets that meet their specific needs. We have added this justification in the manuscript to guide the other users with different needs. Lines 530-534: “We did not balance our training set for the species as we wanted to utilize the full variation in the dataset and assess models trained with the proportion of species as reflected in the dataset. We provide annotations including the entire dataset to allow others to make create their own training/validation/test sets that suit their needs.”

      And another perhaps major concern that I think should also be addressed somewhere is the fact that this is an image database tested on images while the abstract and manuscript mention the importance of pose estimation for video datasets, yet the current manuscript does not provide any clear test of video datasets nor engage with the practicalities associated with using this image-based database for applications to video datasets. Somewhere this needs to be added to clarify its practical utility.

      We thank the reviewer for this important suggestion. Since we can separate a video into its constituent frames, one can indeed use the provided model or other models trained using this dataset for inference on the frames, thus allowing video tracking applications. We now include a short video clip of a chimpanzee with inferences from the provided model visualized in the supplementary materials.

      Reviewer #1 (Recommendations For The Authors):

      • Please provide a more thorough description of the annotation procedure (i.e., the instructions given to crowd workers)! See public review for reference on dataset annotation reporting cards.

      We have included the list of instructions for Hive annotators in the supplementary materials.

      • An estimate of the crowd worker accuracy and variability would be super valuable!

      While we agree that this is useful, we do not have access to Hive internal data on crowd worker IDs that could allow us to estimate these metrics. Furthermore, we assessed each image manually to ensure good annotation quality.

      • In the methods section it is reported that images were discarded because they were either too blurry, small, or highly occluded. Further quantification could be provided. How many images were discarded per species?

      It’s not really clear to us why this is interesting or important. We used a large number of photographers and annotators, some of whom gave a high ratio of great images; some of whom gave a poor ratio. But it’s not clear what those ratios tell us.

      • Placing the numerical values at the end of the bars would make the graphs more readable in Figures 4 and 5.

      We thank the reviewer for this suggestion. While we agree that this can help, we do not have space to include the number in a font size that would be readable. Smaller font sizes that are likely to fit may not be readable for all readers. We have included the numerical values in the main text in the results section for those interested and hope that the figures provide a qualitative sense of the results to the readers.

    1. Author response:

      eLife Assessment

      This valuable short paper is an ingenious use of clinical patient data to address an issue in imaging neuroscience. The authors clarify the role of face-selectivity in human fusiform gyrus by measuring both BOLD fMRI and depth electrode recordings in the same individuals; furthermore, by comparing responses in different brain regions in the two patients, they suggested that the suppression of blood oxygenation is associated with a decrease in local neural activity. While the methods are compelling and provide a rare dataset of potentially general importance, the presentation of the data in its current form is incomplete.

      We thank the Reviewing editor and Senior editor at eLife for their positive assessment of our paper. After reading the reviewers’ comments – to which we reply below - we agree that the presentation of the data could be completed. We provide additional presentation of data in the responses below and we will slightly modify Figure 2 of the paper. However, in keeping the short format of the paper, the revised version will have the same number of figures, which support the claims made in the paper.

      Reviewer #1 (Public review):

      Summary:

      Measurement of BOLD MR imaging has regularly found regions of the brain that show reliable suppression of BOLD responses during specific experimental testing conditions. These observations are to some degree unexplained, in comparison with more usual association between activation of the BOLD response and excitatory activation of the neurons (most tightly linked to synaptic activity) in the same brain location. This paper finds two patients whose brains were tested with both non-invasive functional MRI and with invasive insertion of electrodes, which allowed the direct recording of neuronal activity. The electrode insertions were made within the fusiform gyrus, which is known to process information about faces, in a clinical search for the sites of intractable epilepsy in each patient. The simple observation is that the electrode location in one patient showed activation of the BOLD response and activation of neuronal firing in response to face stimuli. This is the classical association. The other patient showed an informative and different pattern of responses. In this person, the electrode location showed a suppression of the BOLD response to face stimuli and, most interestingly, an associated suppression of neuronal activity at the electrode site.

      Strengths:

      Whilst these results are not by themselves definitive, they add an important piece of evidence to a long-standing discussion about the origins of the BOLD response. The observation of decreased neuronal activation associated with negative BOLD is interesting because, at various times, exactly the opposite association has been predicted. It has been previously argued that if synaptic mechanisms of neuronal inhibition are responsible for the suppression of neuronal firing, then it would be reasonable

      Weaknesses:

      The chief weakness of the paper is that the results may be unique in a slightly awkward way. The observation of positive BOLD and neuronal activation is made at one brain site in one patient, while the complementary observation of negative BOLD and neuronal suppression actually derives from the other patient. Showing both effects in both patients would make a much stronger paper.

      We thank reviewer #1 for their positive evaluation of our paper. Obviously, we agree with the reviewer that the paper would be much stronger if BOTH effects – spike increase and decrease – would be found in BOTH patients in their corresponding fMRI regions (lateral and medial fusiform gyrus) (also in the same hemisphere). Nevertheless, we clearly acknowledge this limitation in the (revised) version of the manuscript (p.8: Material and Methods section).

      In the current paper, one could think that P1 shows only increases to faces, and P2 would show only decreases (irrespective of the region). However, that is not the case since 11% of P1’s face-selective units are decreases (89% are increases) and 4% of P2’s face-selective units are increases. This has now been made clearer in the manuscript (p.5).

      As the reviewer is certainly aware, the number and position of the electrodes are based on strict clinical criteria, and we will probably never encounter a situation with two neighboring (macro-micro hybrid electrodes), one with microelectrodes ending up in the lateral MidFG, the other in the medial MidFG, in the same patient. If there is no clinical value for the patient, this cannot be done.

      The only thing we can do is to strengthen these results in the future by collecting data on additional patients with an electrode either in the lateral or the medial FG, together with fMRI. But these are the only two patients we have been able to record so far with electrodes falling unambiguously in such contrasted regions and with large (and comparable) measures.

      While we acknowledge that the results may be unique because of the use of 2 contrasted patients only (and this is why the paper is a short report), the data is compelling in these 2 cases, and we are confident that it will be replicated in larger cohorts in the future.

      Reviewer #2 (Public review):

      Summary:

      This is a short and straightforward paper describing BOLD fMRI and depth electrode measurements from two regions of the fusiform gyrus that show either higher or lower BOLD responses to faces vs. objects (which I will call face-positive and facenegative regions). In these regions, which were studied separately in two patients undergoing epilepsy surgery, spiking activity increased for faces relative to objects in the face-positive region and decreased for faces relative to objects in the face-negative region. Interestingly, about 30% of neurons in the face-negative region did not respond to objects and decreased their responses below baseline in response to faces (absolute suppression).

      Strengths:

      These patient data are valuable, with many recording sessions and neurons from human face-selective regions, and the methods used for comparing face and object responses in both fMRI and electrode recordings were robust and well-established. The finding of absolute suppression could clarify the nature of face selectivity in human fusiform gyrus since previous fMRI studies of the face-negative region could not distinguish whether face < object responses came from absolute suppression, or just relatively lower but still positive responses to faces vs. objects.

      Weaknesses:

      The authors claim that the results tell us about both 1) face-selectivity in the fusiform gyrus, and 2) the physiological basis of the BOLD signal. However, I would like to see more of the data that supports the first claim, and I am not sure the second claim is supported.

      (1) The authors report that ~30% of neurons showed absolute suppression, but those data are not shown separately from the neurons that only show relative reductions. It is difficult to evaluate the absolute suppression claim from the short assertion in the text alone (lines 105-106), although this is a critical claim in the paper.

      We thank reviewer #2 for their positive evaluation of our paper. We understand the reviewer’s point, and we partly agree. Where we respectfully disagree is that the finding of absolute suppression is critical for the claim of the paper: finding an identical contrast between the two regions in terms of RELATIVE increase/decrease of face-selective activity in fMRI and spiking activity is already novel and informative. Where we agree with the reviewer is that the absolute suppression could be more documented: it wasn’t, due to space constraints (brief report). We provide below an example of a neuron showing absolute suppression to faces. In the frequency domain, there is only a face-selective response (1.2 Hz and harmonics) but no significant response at 6 Hz (common general visual response). In the time-domain, relative to face onset, the response drops below baseline level. It means that this neuron has baseline (non-periodic) spontaneous spiking activity that is actively suppressed when a face appears.

      Author response image 1.

      (2) I am not sure how much light the results shed on the physiological basis of the BOLD signal. The authors write that the results reveal "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain" (line 120). But I think to make this claim, you would need a region that exclusively had neurons showing absolute suppression, not a region with a mix of neurons, some showing absolute suppression and some showing relative suppression, as here. The responses of both groups of neurons contribute to the measured BOLD signal, so it seems impossible to tell from these data how absolute suppression per se drives the BOLD response.

      It is a fact that we find both kinds of responses in the same region.  We cannot tell with this technique if neurons showing relative vs. absolute suppression of responses are spatially segregated for instance (e.g., forming two separate sub-regions) or are intermingled. And we cannot tell from our data how absolute suppression per se drives the BOLD response. In our view, this does not diminish the interest and originality of the study, but the statement "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain” will be rephrased in the revised manuscript, in the following way: "that BOLD decreases can be due to relative, or absolute (or a combination of both), spike suppression in the human brain”.

      Reviewer #3 (Public review):

      In this paper the authors conduct two experiments an fMRI experiment and intracranial recordings of neurons in two patients P1 and P2. In both experiments, they employ a SSVEP paradigm in which they show images at a fast rate (e.g. 6Hz) and then they show face images at a slower rate (e.g. 1.2Hz), where the rest of the images are a variety of object images. In the first patient, they record from neurons over a region in the mid fusiform gyrus that is face-selective and in the second patient, they record neurons from a region more medially that is not face selective (it responds more strongly to objects than faces). Results find similar selectivity between the electrophysiology data and the fMRI data in that the location which shows higher fMRI to faces also finds face-selective neurons and the location which finds preference to non faces also shows non face preferring neurons.

      Strengths:

      The data is important in that it shows that there is a relationship between category selectivity measured from electrophysiology data and category-selective from fMRI. The data is unique as it contains a lot of single and multiunit recordings (245 units) from the human fusiform gyrus - which the authors point out - is a humanoid specific gyrus.

      Weaknesses:

      My major concerns are two-fold:

      (i) There is a paucity of data; Thus, more information (results and methods) is warranted; and in particular there is no comparison between the fMRI data and the SEEG data.

      We thank reviewer #3 for their positive evaluation of our paper. If the reviewer means paucity of data presentation, we agree and we provide more presentation below, although the methods and results information appear as complete to us. The comparison between fMRI and SEEG is there, but can only be indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance). In addition, our manuscript aims at providing a short empirical contribution to further our understanding of the relationship between neural responses and BOLD signal, not to provide a model of neurovascular coupling.

      (ii) One main claim of the paper is that there is evidence for suppressed responses to faces in the non-face selective region. That is, the reduction in activation to faces in the non-face selective region is interpreted as a suppression in the neural response and consequently the reduction in fMRI signal is interpreted as suppression. However, the SSVEP paradigm has no baseline (it alternates between faces and objects) and therefore it cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      We understand the concern of the reviewer, but we respectfully disagree that our paradigm cannot distinguish between lower firing rate to faces vs. suppression of response to faces. Indeed, since the stimuli are presented periodically (6 Hz), we can objectively distinguish stimulus-related activity from spontaneous neuronal firing. The baseline corresponds to spikes that are non-periodic, i.e., unrelated to the (common face and object) stimulation. For a subset of neurons, even this non-periodic baseline activity is suppressed, above and beyond the suppression of the 6 Hz response illustrated on Figure 2. We mention it in the manuscript, but we agree that we do not present illustrations of such decrease in the time-domain for SU, which we did not consider as being necessary initially (please see below for such presentation).

      (1) Additional data: the paper has 2 figures: figure 1 which shows the experimental design and figure 2 which presents data, the latter shows one example neuron raster plot from each patient and group average neural data from each patient. In this reader's opinion this is insufficient data to support the conclusions of the paper. The paper will be more impactful if the researchers would report the data more comprehensively.

      We answer to more specific requests for additional evidence below, but the reviewer should be aware that this is a short report, which reaches the word limit. In our view, the group average neural data should be sufficient to support the conclusions, and the example neurons are there for illustration. And while we cannot provide the raster plots for a large number of neurons, the anonymized data will be made available upon publication of the final version of the paper.

      (a) There is no direct comparison between the fMRI data and the SEEG data, except for a comparison of the location of the electrodes relative to the statistical parametric map generated from a contrast (Fig 2a,d). It will be helpful to build a model linking between the neural responses to the voxel response in the same location - i.e., estimate from the electrophysiology data the fMRI data (e.g., Logothetis & Wandell, 2004).

      As mentioned above the comparison between fMRI and SEEG is indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance) and would not allow to make such a model.

      (b) More comprehensive analyses of the SSVEP neural data: It will be helpful to show the results of the frequency analyses of the SSVEP data for all neurons to show that there are significant visual responses and significant face responses. It will be also useful to compare and quantify the magnitude of the face responses compared to the visual responses.

      The data has been analyzed comprehensively, but we would not be able to show all neurons with such significant visual responses and face-selective responses.

      (c) The neuron shown in E shows cyclical responses tied to the onset of the stimuli, is this the visual response?

      Correct, it’s the visual response at 6 Hz.

      If so, why is there an increase in the firing rate of the neuron before the face stimulus is shown in time 0?

      Because the stimulation is continuous. What is displayed at 0 is the onset of the face stimulus, with each face stimulus being preceded by 4 images of nonface objects.

      The neuron's data seems different than the average response across neurons; This raises a concern about interpreting the average response across neurons in panel F which seems different than the single neuron responses

      The reviewer is correct, and we apologize for the confusion. This is because the average data on panel F has been notch-filtered for the 6 Hz (and harmonic responses), as indicated in the methods (p.11):  ‘a FFT notch filter (filter width = 0.05 Hz) was then applied on the 70 s single or multi-units time-series to remove the general visual response at 6 Hz and two additional harmonics (i.e., 12 and 18 Hz)’.

      Here is the same data without the notch-filter (the 6Hz periodic response is clearly visible):

      Author response image 2.

      For sake of clarity, we prefer presenting the notch-filtered data in the paper, but the revised version will make it clear in the figure caption that the average data has been notch-filtered.

      (d) Related to (c) it would be useful to show raster plots of all neurons and quantify if the neural responses within a region are homogeneous or heterogeneous. This would add data relating the single neuron response to the population responses measured from fMRI. See also Nir 2009.

      We agree with the reviewer that this is interesting, but again we do not think that it is necessary for the point made in the present paper. Responses in these regions appear rather heterogenous, and we are currently working on a longer paper with additional SEEG data (other patients tested for shorter sessions) to define and quantify the face-selective neurons in the MidFusiform gyrus with this approach (without relating it to the fMRI contrast as reported here).

      (e) When reporting group average data (e.g., Fig 2C,F) it is necessary to show standard deviation of the response across neurons.

      We agree with the reviewer and have modified Figure 2 accordingly in the revised manuscript.

      (f) Is it possible to estimate the latency of the neural responses to face and object images from the phase data? If so, this will add important information on the timing of neural responses in the human fusiform gyrus to face and object images.

      The fast periodic paradigm to measure neural face-selectivity has been used in tens of studies since its original reports:

      - in EEG: Rossion et al., 2015: https://doi.org/10.1167/15.1.18

      - in SEEG: Jonas et al., 2016: https://doi.org/10.1073/pnas.1522033113

      In this paradigm, the face-selective response spreads to several harmonics (1.2 Hz, 2.4 Hz, 3.6 Hz, etc.) (which are summed for quantifying the total face-selective amplitude). This is illustrated below by the averaged single units’ SNR spectra across all recording sessions for both participants.

      Author response image 3.

      There is no unique phase-value, each harmonic being associated with a phase-value, so that the timing cannot be unambiguously extracted from phase values. Instead, the onset latency is computed directly from the time-domain responses, which is more straightforward and reliable than using the phase. Note that the present paper is not about the specific time-courses of the different types of neurons, which would require a more comprehensive report, but which is not necessary to support the point made in the present paper about the SEEG-fMRI sign relationship.

      g) Related to (e) In total the authors recorded data from 245 units (some single units and some multiunits) and they found that both in the face and nonface selective most of the recoded neurons exhibited face -selectivity, which this reader found confusing: They write “ Among all visually responsive neurons, we found a very high proportion of face-selective neurons (p < 0.05) in both activated and deactivated MidFG regions (P1: 98.1%; N = 51/52; P2: 86.6%; N = 110/127)’. Is the face selectivity in P1 an increase in response to faces and P2 a reduction in response to faces or in both it’s an increase in response to faces

      Face-selectivity is defined as a DIFFERENTIAL response to faces compared to objects, not necessarily a larger response to faces. So yes, face-selectivity in P1 is an increase in response to faces and P2 a reduction in response to faces.

      (1) Additional methods

      (a) it is unclear if the SSVEP analyses of neural responses were done on the spikes or the raw electrical signal. If the former, how is the SSVEP frequency analysis done on discrete data like action potentials?

      The FFT is applied directly on spike trains using Matlab’s discrete Fourier Transform function. This function is suitable to be applied to spike trains in the same way as to any sampled digital signal (here, the microwires signal was sampled at 30 kHz, see Methods).

      In complementary analyses, we also attempted to apply the FFT on spike trains that had been temporally smoothed by convolving them with a 20ms square window (Le Cam et al., 2023, cited in the paper ). This did not change the outcome of the frequency analyses in the frequency range we are interested in.

      (b) it is unclear why the onset time was shifted by 33ms; one can measure the phase of the response relative to the cycle onset and use that to estimate the delay between the onset of a stimulus and the onset of the response. Adding phase information will be useful.

      The onset time was shifted by 33ms because the stimuli are presented with a sinewave contrast modulation (i.e., at 0ms, the stimulus has 0% contrast). 100% contrast is reached at half a stimulation cycle, which is 83.33ms here, but a response is likely triggered before reaching 100% contrast. To estimate the delay between the start of the sinewave (0% contrast) and the triggering of a neural response, we tested 7 SEEG participants with the same images presented in FPVS sequences either as a sinewave contrast (black line) modulation or as a squarewave (i.e. abrupt) contrast modulation (red line).  The 33ms value is based on these LFP data obtained in response to such sinewave stimulation and squarewave stimulation of the same paradigm. This delay corresponds to 4 screen refresh frames (120 Hz refresh rate = 8.33ms by frame) and 35% of the full contrast, as illustrated below (please see also Retter, T. L., & Rossion, B. (2016). Uncovering the neural magnitude and spatio-temporal dynamics of natural image categorization in a fast visual stream. Neuropsychologia, 91, 9–28).

      Author response image 4.

      (2) Interpretation of suppression:

      The SSVEP paradigm alternates between 2 conditions: faces and objects and has no baseline; In other words, responses to faces are measured relative to the baseline response to objects so that any region that contains neurons that have a lower firing rate to faces than objects is bound to show a lower response in the SSVEP signal. Therefore, because the experiment does not have a true baseline (e.g. blank screen, with no visual stimulation) this experimental design cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      The strongest evidence put forward for suppression is the response of non-visual neurons that was also reduced when patients looked at faces, but since these are non-visual neurons, it is unclear how to interpret the responses to faces.

      We understand this point, but how does the reviewer know that these are non-visual neurons? Because these neurons are located in the visual cortex, they are likely to be visual neurons that are not responsive to non-face objects. In any case, as the reviewer writes, we think it’s strong evidence for suppression.

      We thank all three reviewers for their positive evaluation of our paper and their constructive comments.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper concerns mechanisms of foraging behavior in C. elegans. Upon removal from food, C. elegans first executes a stereotypical local search behavior in which it explores a small area by executing many random, undirected reversals and turns called "reorientations." If the worm fails to find food, it transitions to a global search in which it explores larger areas by suppressing reorientations and executing long forward runs (Hills et al., 2004). At the population level, the reorientation rate declines gradually. Nevertheless, about 50% of individual worms appear to exhibit an abrupt transition between local and global search, which is evident as a discrete transition from high to low reorientation rate (Lopez-Cruz et al., 2019). This observation has given rise to the hypothesis that local and global search correspond to separate internal states with the possibility of sudden transitions between them (Calhoun et al., 2014). The main conclusion of the paper is that it is not necessary to posit distinct internal states to account for discrete transitions from high to low reorientation rates. On the contrary, discrete transitions can occur simply because of the stochastic nature of the reorientation behavior itself.

      Strengths:

      The strength of the paper is the demonstration that a more parsimonious model explains abrupt transitions in the reorientation rate.

      Weaknesses:

      (1) Use of the Gillespie algorithm is not well justified. A conventional model with a fixed dt and an exponentially decaying reorientation rate would be adequate and far easier to explain. It would also be sufficiently accurate - given the appropriate choice of dt - to support the main claims of the paper, which are merely qualitative. In some respects, the whole point of the paper - that discrete transitions are an epiphenomenon of stochastic behavior - can be made with the authors' version of the model having a constant reorientation rate (Figure 2f).

      We apologize, but we are not sure what the reviewer means by “fixed dt”. If the reviewer means taking discrete steps in time (dt), and modeling whether a reorientation occurs, we would argue that the Gillespie algorithm is a better way to do this because it provides floating-point precision time resolution, rather than a time resolution limited by dt, which we hopefully explain in the comments below.

      The reviewer is correct that discrete transitions are an epiphenomenon of stochastic behavior as we show in Figure 2f. However, abrupt stochastic jumps that occur with a constant rate do not produce persistent changes in the observed rate because it is by definition, constant. The theory that there are local and global searches is based on the observation that individual worms often abruptly change their rates. But this observation is only true for a fraction of worms. We are trying to argue that the reason why this is not observed for all, or even most worms is because these are the result of stochastic sampling, not a sudden change in search strategy.

      (2) In the manuscript, the Gillespie algorithm is very poorly explained, even for readers who already understand the algorithm; for those who do not it will be essentially impossible to comprehend. To take just a few examples: in Equation (1), omega is defined as reorientations instead of cumulative reorientations; it is unclear how (4) follows from (2) and (3); notation in (5), line 133, and (7) is idiosyncratic. Figure 1a does not help, partly because the notation is unexplained. For example, what do the arrows mean, what does "*" mean?

      We apologize for this, you are correct,  is cumulative reorientations, and we will edit the text as follows:

      Experimentally, reorientation rate is measured as the number of reorientation events that occurred in an observational window. However, these are discrete stochastic events, so we should describe them in terms of propensity, i.e. the probability of observing a transitional event (in this case, a reorientation) is:

      Here, P(W+1,t) is the probability of observing a reorientation event at time t, and a<sub>1</sub> is the propensity for this event to occur. Observationally, the frequency of reorientations observed decays over time, so we can define the propensity as:

      Where α is the initial propensity at t=0.

      We can model this decay as the reorientation propensity coupled to a decaying factor (M):

      Where the propensity of this event (a<sub>2</sub>) is:

      Since M is a first-order decay process, when integrated, the cumulative M observed is:

      We can couple the probability of observing a reorientation to this decay by redefining (a<sub>1</sub> as:

      So that now:

      A critical detail should be noted. While reorientations are modeled as discrete events, the amount of M at time t\=0 is chosen to be large (M<sub>0</sub>←1,000), so that over the timescale of 40 minutes, the decay in M is practically continuous. This ensures that sudden changes in reorientations are not due to sudden changes in M, but due to the inherent stochasticity of reorientations.

      To model both processes, we can create the master equation:

      Since these are both Poisson processes, the probability density function for a state change i occurring in time t is:

      The probability that an event will not occur in time interval t is:

      The probability that no events will occur for ALL transitions in this time interval is:

      We can draw a random number (r<sub>1</sub> ∈[0,1]) that represents the probability of no events in time interval t, so that this time interval can be assigned by rearranging equation 11:

      where:

      This is the time interval for any event (W+1 or M-1) happening at t + t. The probability of which event occurs is proportional to its propensity:

      We can draw a second number (r<sub>2</sub> ∈[0,1]) that represents this probability so that which event occurs at time t + t is determined by the smallest n that satisfies:

      so that:

      The elegant efficiency of the Gillespie algorithm is two-fold. First, it models all transitions simultaneously, not separately. Second, it provides floating-point time resolution. Rather than drawing a random number, and using a cumulative probability distribution of interval-times to decide whether an event occurs at discrete steps in time, the Gillespie algorithm uses this distribution to draw the interval-time itself. The time resolution of the prior approach is limited by step size, whereas the Gillespie algorithm’s time resolution is limited by the floating-point precision of the random number that is drawn.

      We are happy to add this text to improve clarity.

      We apologize for the arrow notation confusion. Arrow notation is commonly used in pseudocode to indicate variable assignment, and so we used it to indicate variable assignment updates in the algorithm.

      We added Figure 2a to help explain the Gillespie algorithm for people who are unfamiliar with it, but you are correct, some notation, like probabilities, were left unexplained. We will address this to improve clarity.

      (3) In the model, the reorientation rate dΩ⁄dt declines to zero but the empirical rate clearly does not. This is a major flaw. It would have been easy to fix by adding a constant to the exponentially declining rate in (1). Perhaps fixing this obvious problem would mitigate the discrepancies between the data and the model in Figure 2d.

      You are correct that the model deviates slightly at longer times, but this result is consistent with Klein et al. that show a continuous decline of reorientations. However, we could add a constant to the model, since an infinite run length is likely not physiological.

      (4) Evidence that the model fits the data (Figure 2d) is unconvincing. I would like to have seen the proportion of runs in which the model generated one as opposed to multiple or no transitions in reorientation rate; in the real data, the proportion is 50% (Lopez). It is claimed that the "model demonstrated a continuum of switching to non-switching behavior" as seen in the experimental data but no evidence is provided.

      We should clarify that the 50% proportion cited by López-Cruz was based on an arbitrary difference in slopes, and by assessing the data visually. We sought to avoid this subjective assessment by plotting the distribution of slopes and transition times produced by the method used in López-Cruz. We should also clarify by what we meant by “a continuum of switching and non-switching” behavior. Both the transition time distributions and the slope-difference distributions do not appear to be the result of two distributions. This is unlike roaming and dwelling on food, where two distinct distributions of behavioral metrics can be identified based on speed and angular speed (Flavell et al, 2009, Fig S2a). We will add a permutation test to verify the mean differences in slopes and transition times between the experiment and model are not significant.

      (5) The explanation for the poor fit between the model and data (lines 166-174) is unclear. Why would externally triggered collisions cause a shift in the transition distribution?

      Thank you, we should rewrite the text to clarify this better. There were no externally triggered collisions; 10 animals were used per experiment. They would occasionally collide during the experiment, but these collisions were excluded from the data that were provided. However, worms are also known to increase reorientations when they encounter a pheromone trail, and it is unknown (from this dataset) which orientations may have been a result of this phenomenon.

      (6) The discussion of Levy walks and the accompanying figure are off-topic and should be deleted.

      Thank you, we agree that this topic is tangential, and we will remove it.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors build a statistical model that stochastically samples from a time-interval distribution of reorientation rates. The form of the distribution is extracted from a large array of behavioral data, and is then used to describe not only the dynamics of individual worms (including the inter-individual variability in behavior), but also the aggregate population behavior. The authors note that the model does not require assumptions about behavioral state transitions, or evidence accumulation, as has been done previously, but rather that the stochastic nature of behavior is "simply the product of stochastic sampling from an exponential function".

      Strengths:

      This model provides a strong juxtaposition to other foraging models in the worm. Rather than evoking a behavioral transition function (that might arise from a change in internal state or the activity of a cell type in the network), or evidence accumulation (which again maps onto a cell type, or the activity of a network) - this model explains behavior via the stochastic sampling of a function of an exponential decay. The underlying model and the dynamics being simulated, as well as the process of stochastic sampling, are well described and the model fits the exponential function (Equation 1) to data on a large array of worms exhibiting diverse behaviors (1600+ worms from Lopez-Cruz et al). The work of this study is able to explain or describe the inter-individual diversity of worm behavior across a large population. The model is also able to capture two aspects of the reorientations, including the dynamics (to switch or not to switch) and the kinetics (slow vs fast reorientations). The authors also work to compare their model to a few others including the Levy walk (whose construction arises from a Markov process) to a simple exponential distribution, all of which have been used to study foraging and search behaviors.

      Weaknesses:

      This manuscript has two weaknesses that dampen the enthusiasm for the results. First, in all of the examples the authors cite where a Gillespie algorithm is used to sample from a distribution, be it the kinetics associated with chemical dynamics, or a Lotka-Volterra Competition Model, there are underlying processes that govern the evolution of the dynamics, and thus the sampling from distributions. In one of their references, for instance, the stochasticity arises from the birth and death rates, thereby influencing the genetic drift in the model. In these examples, the process governing the dynamics (and thus generating the distributions from which one samples) is distinct from the behavior being studied. In this manuscript, the distribution being sampled is the exponential decay function of the reorientation rate (lines 100-102). This appears to be tautological - a decay function fitted to the reorientation data is then sampled to generate the distributions of the reorientation data. That the model performs well and matches the data is commendable, but it is unclear how that could not be the case if the underlying function generating the distribution was fit to the data.

      Thank you, we apologize that this was not clearer. In the Lotka-Volterra model, the density of predators and prey are being modeled, with the underlying assumption that rates of birth and death are inherently stochastic. In our model, the number of reorientations are being modeled, with the assumption (based on the experiments), that the occurrence of reorientations is stochastic, just like the occurrence (birth) of a prey animal is stochastic. However, the decay in M is phenomenological, and we speculate about the nature of M later in the manuscript.

      You are absolutely right that the decay function for M was fitted to the population average of reorientations and then sampled to generate the distributions of the reorientation data. This was intentional to show that the parameters chosen to match the population average would produce individual trajectories with comparable stochastic “switching” as the experimental data. All we’re trying to show really is that observed sudden changes in reorientation that appear persistent can be produced by a stochastic process without resorting to binary state assignments. In Calhoun, et al 2014 it is reported all animals produced switch-like behavior, but in Klein et al, 2017 it is reported that no animals showed abrupt transitions. López-Cruz et al seem to show a mix of these results, which can be easily explained by an underlying stochastic process.

      The second weakness is somewhat related to the first, in that absent an underlying mechanism or framework, one is left wondering what insight the model provides. Stochastic sampling a function generated by fitting the data to produce stochastic behavior is where one ends up in this framework, and the authors indeed point this out: "simple stochastic models should be sufficient to explain observably stochastic behaviors." (Line 233-234). But if that is the case, what do we learn about how the foraging is happening? The authors suggest that the decay parameter M can be considered a memory timescale; which offers some suggestion, but then go on to say that the "physical basis of M can come from multiple sources". Here is where one is left for want: The mechanisms suggested, including loss of sensory stimuli, alternations in motor integration, ionotropic glutamate signaling, dopamine, and neuropeptides are all suggested: these are basically all of the possible biological sources that can govern behavior, and one is left not knowing what insight the model provides. The array of biological processes listed is so variable in dynamics and meaning, that their explanation of what governs M is at best unsatisfying. Molecular dynamics models that generate distributions can point to certain properties of the model, such as the binding kinetics (on and off rates, etc.) as explanations for the mechanisms generating the distributions, and therefore point to how a change in the biology affects the stochasticity of the process. It is unclear how this model provides such a connection, especially taken in aggregate with the previous weakness.

      Providing a roadmap of how to think about the processes generating M, the meaning of those processes in search, and potential frameworks that are more constrained and with more precise biological underpinning (beyond the array of possibilities described) would go a long way to assuaging the weaknesses.

      Thank you, these are all excellent points. We should clarify that in López-Cruz et al, they claim that only 50% of the animals fit a local/global search paradigm. We are simply proposing there is no need for designating local and global searches if the data don’t really support it. The underlying behavior is stochastic, so the sudden switches sometimes observed can be explained by a stochastic process where the underlying rate is slowing down, thus producing the persistently slow reorientation rate when an apparent “switch” occurs. What we hope to convey is that foraging doesn’t appear to follow a decision paradigm, but instead a gradual change in reorientations which for individual worms, can occasionally produce reorientation trajectories that appear switch-like.

      As for M, you are correct, we should be more explicit. A decay in reorientation rate, rather than a sudden change, is consistent with observations made by López-Cruz et al.  They found that the neurons AIA and ADE redundantly suppress reorientations, and that silencing either one was sufficient to restore the large number of reorientations during early foraging. The synaptic output of AIA and ADE was inhibited over long timescales (tens of minutes) by presynaptic glutamate binding to MGL-1, a slow G-Protein coupled receptor expressed in AIA and ADE. Their results support a model where sensory neurons suppress the synaptic output of AIA and ADE, which in turn leads to a large number of reorientations early in foraging. As time passes, glutamatergic input from the sensory neurons decrease, which leads to disinhibition of AIA and ADE, and a subsequent suppression of reorientations.

      The sensory inputs into AIA and ADE are sequestered into two separate circuits, with AIA receiving chemosensory input and ADE receiving mechanosensory input. Since the suppression of either AIA or ADE is sufficient to increase reorientations, the decay in reorientations is likely due to the synaptic output of both of these neurons decaying in time. This correlates with an observed decrease in sensory neuron activity as well, so the timescale of reorientation decay could be tied to the timescale of sensory neuron activity, which in turn is influencing the timescale of AIA/ADE reorientation suppression. This implies that our factor “M” is likely the sum of several different sensory inputs decaying in time.

      The molecular basis of which sensory neuron signaling factors contribute to decreased AIA and ADE activity is made more complicated by the observation that the glutamatergic input provided by the sensory neurons was not essential, and that additional factors besides glutamate contribute to the signaling to AIA and ADE. In addition to this, it is simply not the sensory neuron activity that decays in time, but also the sensitivity of AIA and ADE to sensory neuron input that decays in time. Simply depolarizing sensory neurons after the animals had starved for 30 minutes was insufficient to rescue the reorientation rates observed earlier in the foraging assay. This observation could be due to decreased presynaptic vesicle release, and/or decreased receptor localization on the postsynaptic side.

      In summary, there are two neuronal properties that appear to be decaying in time. One is sensory neuron activity, and the other is decreased potentiation of presynaptic input onto AIA and ADE. Our factor “M” is a phenomenological manifestation of these numerous decaying factors.

      Reviewer #3 (Public review):

      Summary:

      This intriguing paper addresses a special case of a fundamental statistical question: how to distinguish between stochastic point processes that derive from a single "state" (or single process) and more than one state/process. In the language of the paper, a "state" (perhaps more intuitively called a strategy/process) refers to a set of rules that determine the temporal statistics of the system. The rules give rise to probability distributions (here, the probability for turning events). The difficulty arises when the sampling time is finite, and hence, the empirical data is finite, and affected by the sampling of the underlying distribution(s). The specific problem being tackled is the foraging behavior of C. elegans nematodes, removed from food. Such foraging has been studied for decades, and described by a transition over time from 'local'/'area-restricted' search'(roughly in the initial 10-30 minutes of the experiments, in which animals execute frequent turns) to 'dispersion', or 'global search' (characterized by a low frequency of turns). The authors propose an alternative to this two-state description - a potentially more parsimonious single 'state' with time-changing parameters, which they claim can account for the full-time course of these observations.

      Figure 1a shows the mean rate of turning events as a function of time (averaged across the population). Here, we see a rapid transient, followed by a gradual 4-5 fold decay in the rate, and then levels off. This picture seems consistent with the two-state description. However, the authors demonstrate that individual animals exhibit different "transition" statistics (Figure 1e) and wish to explain this. They do so by fitting this mean with a single function (Equations 1-3).

      Strengths:

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Weaknesses:

      (1) The authors claim that only about half the animals tested exhibit discontinuity in turning rates. Can they automatically separate the empirical and model population into these two subpopulations (with the same method), and compare the results?

      Thank you, we should clarify that the observation that about half the animals exhibit discontinuity was not made by us, but by López-Cruz et al. The observed fraction of 50% was based on a visual assessment of the dual regression method we described. To make the process more objective, we decided to simply plot the distributions of the metrics they used for this assessment to see if two distinct populations could be observed. However, the distributions of slope differences and transition times do not produce two distinct populations. Our stochastic approach, which does not assume abrupt state-transitions, also produces comparable distributions. To quantify this, we will perform permutation tests on the means and variances differences between experimental and model data.

      (2) The equations consider an exponentially decaying rate of turning events. If so, Figure 2b should be shown on a semi-logarithmic scale.

      We are happy to add this panel as well.

      (3) The variables in Equations 1-3 and the methods for simulating them are not well defined, making the method difficult to follow. Assuming my reading is correct, Omega should be defined as the cumulative number of turning events over time (Omega(t)), not as a "turn" or "reorientation", which has no derivative. The relevant entity in Figure 1a is apparently <Omega (t)>, i.e. the mean number of events across a population which can be modelled by an expectation value. The time derivative would then give the expected rate of turning events as a function of time.

      Thank you, you are correct. Please see response to Reviewer #1.

      (4) Equations 1-3 are cryptic. The authors need to spell out up front that they are using a pair of coupled stochastic processes, sampling a hidden state M (to model the dynamic turning rate) and the actual turn events, Omega(t), separately, as described in Figure 2a. In this case, the model no longer appears more parsimonious than the original 2-state model. What then is its benefit or explanatory power (especially since the process involving M is not observable experimentally)?

      Thank you, yes we see how as written this was confusing. In our response to Reviewer #1, we added an important detail:

      While reorientations are modeled as discrete events, which is observationally true, the amount of M at time t\=0 is chosen to be large (M<sub>0</sub>←1,000), so that over the timescale of 40 minutes, the decay in M is practically continuous. This ensures that sudden changes in reorientations are not due to sudden changes in M, but due to the inherent stochasticity of reorientations.

      However you are correct that if M was chosen to have a binary value of 0 or 1, then this would indeed be the two state model. Adding this as an additional model would be a good idea to compare how this matches the experimental data, and we are happy to add it.

      (5) Further, as currently stated in the paper, Equations 1-3 are only for the mean rate of events. However, the expectation value is not a complete description of a stochastic system. Instead, the authors need to formulate the equations for the probability of events, from which they can extract any moment (they write something in Figure 2a, but the notation there is unclear, and this needs to be incorporated here).

      Thank you, yes please see our response to Reviewer #1.

      (6) Equations 1-3 have three constants (alpha and gamma which were fit to the data, and M0 which was presumably set to 1000). How does the choice of M0 affect the results?

      Thank you, this is a good question. We will test this down to a binary state of M as mentioned in comment #4.

      (7) M decays to near 0 over 40 minutes, abolishing omega turns by the end of the simulations. Are omega turns entirely abolished in worms after 30-40 minutes off food? How do the authors reconcile this decay with the leveling of the turning rate in Figure 1a?

      Yes, reviewer #1 recommended adding a baseline reorientation rate which is likely more biologically plausible. However, we should also note that in Klein et al they observed a continuous decay over 50 minutes.

      (8) The fit given in Figure 2b does not look convincing. No statistical test was used to compare the two functions (empirical and fit). No error bars were given (to either). These should be added. In the discussion, the authors explain the discrepancy away as experimental limitations. This is not unreasonable, but on the flip side, makes the argument inconclusive. If the authors could model and simulate these limitations, and show that they account for the discrepancies with the data, the model would be much more compelling. To do this, I would imagine that the authors would need to take the output of their model (lists of turning times) and convert them into simulated trajectories over time. These trajectories could be used to detect boundary events (for a given size of arena), collisions between individuals, etc. in their simulations and to see their effects on the turn statistics.

      Thank you, we will add error bars and perform a permutation test on the mean and variance differences between experiment and model over the 40 minute window.

      (9) The other figures similarly lack any statistical tests and by eye, they do not look convincing. The exception is the 6 anecdotal examples in Figure 2e. Those anecdotal examples match remarkably closely, almost suspiciously so. I'm not sure I understood this though - the caption refers to "different" models of M decay (and at least one of the 6 examples clearly shows a much shallower exponential). If different M models are allowed for each animal, this is no longer parsimonious. Are the results in Figure 2d for a single M model? Can Figure 2e explain the data with a single (stochastic) M model?

      Thank you, yes, we will perform permutation tests on the mean and variance differences in the observed distributions in figure 2d. We certainly don’t want the panels in Figure 2e to be suspicious! These comparisons were drawn from calculating the correlations between all model traces and all experimental traces, and then choosing the top hits. Every time we run the simulation, we arrive at a different set of examples. Since it was recommended we add a baseline rate, these examples will be a completely different set when we run the simulation, again.

      We apologize for the confusion regarding M. Since the worms do not all start out with identical reorientation rates, we drew the initial M value from a distribution centered on M0 and a variance to match the initial distribution of observed experimental rates.

      (10) The left axes of Figure 2e should be reverted to cumulative counts (without the normalization).

      Thank you, we will add this. We want to clarify that we normalized it because we chose these examples based on correlation to show that the same types of sudden changes in search strategy can occur with a model that doesn’t rely on sudden rate changes.

      (11) The authors give an alternative model of a Levy flight, but do not give the obvious alternative models:

      a) the 1-state model in which P(t) = alpha exp (-gamma t) dt (i.e. a single stochastic process, without a hidden M, collapsing equations 1-3 into a single equation).

      b) the originally proposed 2-state model (with 3 parameters, a high turn rate, a low turn rate, and the local-to-global search transition time, which can be taken from the data, or sampled from the empirical probability distributions). Why not? The former seems necessary to justify the more complicated 2-process model, and the latter seems necessary since it's the model they are trying to replace. Including these two controls would allow them to compare the number of free parameters as well as the model results. I am also surprised by the Levy model since Levy is a family of models. How were the parameters of the Levy walk chosen?

      Thank you, we will remove this section completely, as it is tangential to the main point of the paper.

      (12) One point that is entirely missing in the discussion is the individuality of worms. It is by now well known that individual animals have individual behaviors. Some are slow/fast, and similarly, their turn rates vary. This makes this problem even harder. Combined with the tiny number of events concerned (typically 20-40 per experiment), it seems daunting to determine the underlying model from behavioral statistics alone.

      Thank you, yes we should have been more explicit in the reasoning behind drawing the initial M from a distribution (response to comment #9). We assume that not every worm starts out with the same reorientation rate, but that some start out fast (high M) and some start out slow (low M). However, we do assume M decays with the same kinetics, which seems sufficient to produce the observed phenomena.

      (13) That said, it's well-known which neurons underpin the suppression of turning events (starting already with Gray et al 2005, which, strangely, was not cited here). Some discussion of the neuronal predictions for each of the two (or more) models would be appropriate.

      Thank you, yes we will add Gray et al, but also the more detailed response to Reviewer #2.

      (14) An additional point is the reliance entirely on simulations. A rigorous formulation (of the probability distribution rather than just the mean) should be analytically tractable (at least for the first moment, and possibly higher moments). If higher moments are not obtainable analytically, then the equations should be numerically integrable. It seems strange not to do this.

      Thank you for suggesting this, we will add these analyses.

      In summary, while sample simulations do nicely match the examples in the data (of discontinuous vs continuous turning rates), this is not sufficient to demonstrate that the transition from ARS to dispersion in C. elegans is, in fact, likely to be a single 'state', or this (eq 1-3) single state. Of course, the model can be made more complicated to better match the data, but the approach of the authors, seeking an elegant and parsimonious model, is in principle valid, i.e. avoiding a many-parameter model-fitting exercise.

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Thank you, we agree that this is a generic phenomenon, which is partly why we did this. The data from López-Cruz seem to agree in part with Calhoun et al, that claim abrupt transitions occur, and Klein et al, which claim they do not occur. Since the underlying phenomenon is stochastic, we propose the mixed observations of sudden and gradual changes in search strategy are simply the result of a stochastic process, which can produce both phenomena for individual observations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors propose a new technique which they name "Multi-gradient Permutation Survival Analysis (MEMORY)" that they use to identify "Genes Steadily Associated with Prognosis (GEARs)" using RNA-seq data from the TCGA database. The contribution of this method is one of the key stated aims of the paper. The vast majority of the paper focuses on various downstream analyses that make use of the specific GEARs identified by MEMORY to derive biological insights, with a particular focus on lung adenocarcinoma (LUAD) and breast invasive carcinoma (BRCA) which are stated to be representative of other cancers and are observed to have enriched mitosis and immune signatures, respectively. Through the lens of these cancers, these signatures are the focus of significant investigation in the paper.

      Strengths:

      The approach for MEMORY is well-defined and clearly presented, albeit briefly. This affords statisticians and bioinformaticians the ability to effectively scrutinize the proposed methodology and may lead to further advancements in this field.

      The scientific aspects of the paper (e.g., the results based on the use of MEMORY and the downstream bioinformatics workflows) are conveyed effectively and in a way that is digestible to an individual who is not deeply steeped in the cancer biology field.

      Weaknesses:

      I was surprised that comparatively little of the paper is devoted to the justification of MEMORY (i.e., the authors' method) for the identification of genes that are important broadly for the understanding of cancer. The authors' approach is explained in the methods section of the paper, but no rationale is given for why certain aspects of the method are defined as they are. Moreover, no comparison or reference is made to any other methods that have been developed for similar purposes and no results are shown to illustrate the robustness of the proposed method (e.g., is it sensitive to subtle changes in how it is implemented).

      For example, in the first part of the MEMORY algorithm, gene expression values are dichotomized at the sample median and a log-rank test is performed. This would seemingly result in an unnecessary loss of information for detecting an association between gene expression and survival. Moreover, while dichotomizing at the median is optimal from an information theory perspective (i.e., it creates equally sized groups), there is no reason to believe that median-dichotomization is correct vis-à-vis the relationship between gene expression and survival. If a gene really matters and expression only differentiates survival more towards the tail of the empirical gene expression distribution, median-dichotomization could dramatically lower the power to detect group-wise differences.

      Thanks for these valuable comments!! We understand the reviewer’s concern regarding the potential loss of information caused by median-based dichotomization. In this study, we adopted the median as the cut-off value to stratify gene expression levels primarily for the purpose of data balancing and computational simplicity. This approach ensures approximately equal group sizes, which is particularly beneficial in the context of limited sample sizes and repeated sampling. While we acknowledge that this method may discard certain expression nuances, it remains a widely used strategy in survival analysis. To further evaluate and potentially enhance sensitivity, alternative strategies such as percentile-based cutoffs or survival models using continuous expression values (e.g., Cox regression) may be explored in future optimization of the MEMORY pipeline. Nevertheless, we believe that this dichotomization approach offers a straightforward and effective solution for the initial screening of survival-associated genes. We have now included this explanation in the revised manuscript (Lines 391–393).

      Specifically, the authors' rationale for translating the Significant Probability Matrix into a set of GEARs warrants some discussion in the paper. If I understand correctly, for each cancer the authors propose to search for the smallest sample size (i.e., the smallest value of k_{j}) were there is at least one gene with a survival analysis p-value <0.05 for each of the 1000 sampled datasets. I base my understanding on the statement "We defined the sampling size k_{j} reached saturation when the max value of column j was equal to 1 in a significant-probability matrix. The least value of k_{j} was selected". Then, any gene with a p-value <0.05 in 80% of the 1000 sampled datasets would be called a GEAR for that cancer. The 80% value here seems arbitrary but that is a minor point. I acknowledge that something must be chosen. More importantly, do the authors believe this logic will work effectively in general? Presumably, the gene with the largest effect for a cancer will define the value of K_{j}, and, if the effect is large, this may result in other genes with smaller effects not being selected for that cancer by virtue of the 80% threshold. One could imagine that a gene that has a small-tomoderate effect consistently across many cancers may not show up as a gear broadly if there are genes with more substantive effects for most of the cancers investigated. I am taking the term "Steadily Associated" very literally here as I've constructed a hypothetical where the association is consistent across cancers but not extremely strong. If by "Steadily Associated" the authors really mean "Relatively Large Association", my argument would fall apart but then the definition of a GEAR would perhaps be suboptimal. In this latter case, the proposed approach seems like an indirect way to ensure there is a reasonable effect size for a gene's expression on survival.

      Thank you for the comment and we apologize for the confusion! 𝐴<sub>𝑖𝑗</sub> refers to the value of gene i under gradient j in the significant-probability matrix, primarily used to quantify the statistical probability of association with patient survival for ranking purposes. We believe that GEARs are among the top-ranked genes, but there is no established metric to define the optimal threshold. An 80% threshold is previously employed as an empirical standard in studies related to survival estimates [1]. In addition, we acknowledge that the determination of the saturation point 𝑘<sub>𝑗</sub> is influenced by the earliest point at which any gene achieves consistent significance across 1000 permutations. We recognize that this may lead to the under representation of genes with moderate but consistent effects, especially in the presence of highly significant genes that dominate the statistical landscape. We therefore empirically used 𝐴<sub>𝑖𝑗</sub> > 0.8 the threshold to distinguish between GEARs and non-GEARs. Of course, this parameter variation may indeed result in the loss of some GEARs or the inclusion of non-GEARs. We also agree that future studies could investigate alternative metrics and more refined thresholds to improve the application of GEARs.

      Regarding the term ‘Steadily Associated’, we define GEARs based on statistical robustness across subsampled survival analyses within individual cancer types, rather than cross-cancer consistency or pan-cancer moderate effects. Therefore, our operational definition of “steadiness” emphasizes within-cancer reproducibility across sampling gradients, which does not necessarily exclude high-effect-size genes. Nonetheless, we agree that future extensions of MEMORY could incorporate cross-cancer consistency metrics to capture genes with smaller but reproducible pan-cancer effects.

      The paper contains numerous post-hoc hypothesis tests, statements regarding detected associations and correlations, and statements regarding statistically significant findings based on analyses that would naturally only be conducted in light of positive results from analyses upstream in the overall workflow. Due to the number of statistical tests performed and the fact that the tests are sometimes performed using data-driven subgroups (e.g., the mitosis subgroups), it is highly likely that some of the findings in the work will not be replicable. Of course, this is exploratory science, and is to be expected that some findings won't replicate (the authors even call for further research into key findings). Nonetheless, I would encourage the authors to focus on the quantification of evidence regarding associations or claims (i.e., presenting effect estimates and uncertainty intervals), but to avoid the use of the term statistical significance owing to there being no clear plan to control type I error rates in any systematic way across the diverse analyses there were performed.

      Thank you for the comment! We agree that rigorous control of type-I error is essential once a definitive list of prognostic genes is declared. The current implementation of MEMORY, however, is deliberately positioned as an exploratory screening tool: each gene is evaluated across 10 sampling gradients and 1,000 resamples per gradient, and the only quantity carried forward is its reproducibility probability (𝐴<sub>𝑖𝑗</sub>).

      Because these probabilities are derived from aggregate “votes” rather than single-pass P-values, the influence of any one unadjusted test is inherently diluted. In another words, whether or not a per-iteration BH adjustment is applied does not materially affect the ranking of genes by reproducibility, which is the key output at this stage. However, we also recognize that a clinically actionable GEARs catalogue will require extensive, large-scale multiple-testing adjustments. Accordingly, future versions of MEMORY will embed a dedicated false-positive control framework tailored to the final GEARs list before any translational application. We have added this point in the ‘Discussion’ in the revised manuscript (Lines 350-359).

      A prespecified analysis plan with hypotheses to be tested (to the extent this was already produced) and a document that defines the complete scope of the scientific endeavor (beyond that which is included in the paper) would strengthen the contribution by providing further context on the totality of the substantial work that has been done. For example, the focus on LUAD and BRCA due to their representativeness could be supplemented by additional information on other cancers that may have been investigated similarly but where results were not presented due to lack of space.

      We thank the reviewer for requesting greater clarity on the analytic workflow. The MEMORY pipeline was fully specified before any results were examined and is described in ‘Methods’ (Lines 386–407). By contrast, the pathway-enrichment and downstream network/mutation analyses were deliberately exploratory: their exact content necessarily depended on which functional categories emerged from the unbiased GEAR screen.

      Our screen revealed a pronounced enrichment of mitotic signatures in LUAD and immune signatures in BRCA.

      We then chose these two cancer types for deeper “case-study” analysis because they contained the largest sample sizes among all cancers showing mitotic- or immune-dominated GEAR profiles, and provided the greatest statistical power for follow-up investigations. We have added this explanation into the revised manuscript (Line 163, 219-220).

      Reviewer #2 (Public review):

      Summary:

      The authors are trying to come up with a list of genes (GEAR genes) that are consistently associated with cancer patient survival based on TCGA database. A method named "Multi-gradient Permutation Survival Analysis" was created based on bootstrapping and gradually increasing the sample size of the analysis. Only the genes with consistent performance in this analysis process are chosen as potential candidates for further analyses.

      Strengths:

      The authors describe in detail their proposed method and the list of the chosen genes from the analysis. The scientific meaning and potential values of their findings are discussed in the context of published results in this field.

      Weaknesses:

      Some steps of the proposed method (especially the definition of survival analysis similarity (SAS) need further clarification or details since it would be difficult if anyone tries to reproduce the results. In addition, the multiplicity (a large number of p-values are generated) needs to be discussed and/or the potential inflation of false findings needs to be part of the manuscript.

      Thank you for the reviewer’s insightful comments. Accordingly, in the revised manuscript, we have provided a more detailed explanation of the definition and calculation of Survival-Analysis Similarity (SAS) to ensure methodological clarity and reproducibility (Lines 411-428); and the full code is now publicly available on GitHub (https://github.com/XinleiCai/MEMORY). We have also expanded the ‘Discussion’ to clarify our position on false-positive control: future releases of MEMORY will incorporate a dedicated framework to control false discoveries in the final GEARs catalogue, where itself will be subjected to rigorous, large-scale multiple-testing adjustment.

      If the authors can improve the clarity of the proposed method and there is no major mistake there, the proposed approach can be applied to other diseases (assuming TCGA type of data is available for them) to identify potential gene lists, based on which drug screening can be performed to identify potential target for development.

      Thank you for the suggestion. All source code has now been made publicly available on GitHub for reference and reuse. We agree that the GEAR lists produced by MEMORY hold considerable promise for drugscreening and target-validation efforts, and the framework could be applied to any disease with TCGA-type data. Of course, we also notice that the current GEAR catalogue should first undergo rigorous, large-scale multipletesting correction to further improve its precision before broader deployment.

      Reviewer #3 (Public review):

      Summary:

      The authors describe a valuable method to find gene sets that may correlate with a patient's survival. This method employs iterative tests of significance across randomised samples with a range of proportions of the original dataset. Those genes that show significance across a range of samples are chosen. Based on these gene sets, hub genes are determined from similarity scores.

      Strengths:

      MEMORY allows them to assess the correlation between a gene and patient prognosis using any available transcriptomic dataset. They present several follow-on analyses and compare the gene sets found to previous studies.

      Weaknesses:

      Unfortunately, the authors have not included sufficient details for others to reproduce this work or use the MEMORY algorithm to find future gene sets, nor to take the gene findings presented forward to be validated or used for future hypotheses.

      Thank you for the reviewer’s comments! We apologize for the inconvenience and the lack of details.

      Followed the reviewer’s valuable suggestion, we have now made all source code and relevant scripts publicly available on GitHub to ensure full reproducibility and facilitate future use of the MEMORY algorithm for gene discovery and hypothesis generation.

      Reviewer #4 (Public review):

      The authors apply what I gather is a novel methodology titled "Multi-gradient Permutation Survival Analysis" to identify genes that are robustly associated with prognosis ("GEARs") using tumour expression data from 15 cancer types available in the TCGA. The resulting lists of GEARs are then interrogated for biological insights using a range of techniques including connectivity and gene enrichment analysis.

      I reviewed this paper primarily from a statistical perspective. Evidently, an impressive amount of work has been conducted, and concisely summarised, and great effort has been undertaken to add layers of insight to the findings. I am no stranger to what an undertaking this would have been. My primary concern, however, is that the novel statistical procedure proposed, and applied to identify the gene lists, as far as I can tell offers no statistical error control or quantification. Consequently, we have no sense of what proportion of the highlighted GEAR genes and networks are likely to just be noise.

      Major comments:

      (1) The main methodology used to identify the GEAR genes, "Multi-gradient Permutation Survival Analysis" does not formally account for multiple testing and offers no formal error control. Meaning we are left with no understanding of what the family-wise (aka type 1) error rate is among the GEAR lists, nor the false discovery rate. I would generally recommend against the use of any feature selection methodology that does not provide some form of error quantification and/or control because otherwise we do not know if we are encouraging our colleagues and/or readers to put resources into lists of genes that contain more noise than not. There are numerous statistical techniques available these days that offer error control, including for lists of p-values from arbitrary sets of tests (see expansion on this and some review references below).

      Thank you for your thoughtful and important comment! We fully agree that controlling type I error is critical when identifying gene sets for downstream interpretation or validation. As an exploratory study, our primary aim was to define and screen for GEARs by using the MEMORY framework; however, we acknowledge that the current implementation of MEMORY does not include a formal procedure for error control. Given that MEMORY relies on repeated sampling and counts the frequency of statistically significant p-values, applying standard p-value–based multiple-testing corrections at the individual test level would not meaningfully reduce the false-positive rate in this framework.

      We believe that error control should instead be applied at the level of the final GEAR catalogue. However, we also recognize that conventional correction methods are not directly applicable. In future versions of MEMORY, we plan to incorporate a dedicated and statistically appropriate false-positive control module tailored specifically to the aggregated outputs of the pipeline. We have clarified this point explicitly in the revised manuscript. (Lines 350-359)

      (2) Similarly, no formal significance measure was used to determine which of the strongest "SAS" connections to include as edges in the "Core Survival Network".

      We agree that the edges in the Core Survival Network (CSN) were selected based on the top-ranked SAS values rather than formal statistical thresholds. This was a deliberate design choice, as the CSN was intended as a heuristic similarity network to prioritize genes for downstream molecular classification and biological exploration, not for formal inference. To address potential concerns, we have clarified this intent in the revised manuscript, and we now explicitly state that the network construction was based on empirical ranking rather than statistical significance (Lines 422-425).

      (3) There is, as far as I could tell, no validation of any identified gene lists using an independent dataset external to the presently analysed TCGA data.

      Thank you for the comment. We acknowledge that no independent external dataset was used in the present study to validate the GEARs lists. However, the primary aim of this work was to systematically identify and characterize genes with robust prognostic associations across cancer types using the MEMORY framework. To assess the biological relevance of the resulting GEARs, we conducted extensive downstream analyses including functional enrichment, mutation profiling, immune infiltration comparison, and drug-response correlation. These analyses were performed across multiple cancer types and further supported by a wide range of published literature.

      We believe that this combination of functional characterization and literature validation provides strong initial support for the robustness and relevance of the GEARs lists. Nonetheless, we agree that validation in independent datasets is an important next step, and we plan to carry this out in future work to further strengthen the clinical application of MEMORY.

      (4) There are quite a few places in the methods section where descriptions were not clear (e.g. elements of matrices referred to without defining what the columns and rows are), and I think it would be quite challenging to re-produce some aspects of the procedures as currently described (more detailed notes below).

      We apologize for the confusion. In the revised manuscript, we have provided a clearer and more detailed description of the computational workflow of MEMORY to improve clarity and reproducibility.

      (5) There is a general lack of statistical inference offered. For example, throughout the gene enrichment section of the results, I never saw it stated whether the pathways highlighted are enriched to a significant degree or not.

      We apologize for not clearly stating this information in the original manuscript. In the revised manuscript, we have updated the figure legend to explicitly report the statistical significance of the enriched pathways (Line 870, 877, 879-880).

      Reviewer #1 (Recommendations for the authors):

      Overall, the paper reads well but there are numerous small grammatical errors that at times cost me non-trivial amounts of time to understand the authors' key messages.

      We apologize for the grammatical errors that hindered clarity. In response, we have thoroughly revised the manuscript for grammar, spelling, and overall language quality.

      Reviewer #2 (Recommendations for the authors):

      Major comments:

      (1) Line 427: survival analysis similarity (SAS) definition. Any reference on this definition and why it is defined this way? Can the SAS value be negative? Based on line 429 definition, if A and B are exactly the same, SAS ~ 1; completely opposite, SAS =0; otherwise, SAS could be any value, positive or negative. So it is hard to tell what SAS is measuring. It is important to make sure SAS can measure the similarity in a systematic and consistent way since it is used as input in the following network analysis.

      We apologize for the confusion caused by the ambiguity in the original SAS formula. The SAS metric was inspired by the Jaccard index, but we modified the denominator to increase contrast between gene pairs. Specifically, the numerator counts the number of permutations in which both genes are simultaneously significant (i.e., both equal to 1), while the denominator is the sum of the total number of significant events for each gene minus twice the shared significant count. An additional +1 term was included in the denominator to avoid division by zero. This formulation ensures that SAS is always non-negative and bounded between 0 and 1, with higher values indicating greater similarity. We have clarified this definition and updated the formula in the revised manuscript (Lines 405-425). 

      (2) For the method with high dimensional data, multiplicity adjustment needs to be discussed, but it is missing in the manuscript. A 5% p-value cutoff was used across the paper, which seems to be too liberal in this type of analysis. The suggestion is to either use a lower cutoff value or use False Discovery Rate (FDR) control methods for such adjustment. This will reduce the length of the gene list and may help with a more focused discussion.

      We appreciate the reviewer’s suggestion regarding multiplicity. MEMORY is intentionally positioned as an exploratory screen: each gene is tested across 10 sampling gradients and 1,000 resamples, and only its reproducibility probability (𝐴<sub>𝑖𝑗</sub>) is retained. Because this metric is an aggregate of 1,000 “votes” the influence of any single unadjusted P-value is already strongly diluted; adding a per-iteration BH/FDR step therefore has negligible impact on the reproducibility ranking that drives all downstream analyses.

      That said, we recognize that a clinically actionable GEARs catalogue must undergo formal, large-scale multipletesting correction. Future releases of MEMORY will incorporate an error control module applied to the consolidated GEAR list before any translational use. We have now added a statement to this effect in the revised manuscript (Lines 350-359).

      (3) To allow reproducibility from others, please include as many details as possible (software, parameters, modules etc.) for the analyses performed in different steps.

      All source codes are now publically available on GitHub. We have also added the GitHub address in the section Online Content.

      Minor comments or queries:

      (4) The manuscript needs to be polished to fix grammar, incomplete sentences, and missing figures.

      Thank you for the suggestion. We have thoroughly proofread the manuscript to correct grammar, complete any unfinished sentences, and restore or renumber all missing figure panels. All figures are now properly referenced in the text.

      (5) Line 131: "survival probability of certain genes" seems to be miss-leading. Are you talking about its probability of associating with survival (or prognosis)?

      Sorry for the oversight. What we mean is the probability that a gene is found to be significantly associated with survival across the 1,000 resamples. We have revised the statement to “significant probability of certain genes” (Line 102).

      (6) Lines 132, 133: "remained consistent": the score just needs to stay > 0.8 as the sample increases, or the score needs to be monotonously non-decreasing?

      We mean the score stay above 0.8. We understand “remained consistent” is confusing and now revised it to “remained above 0.8”.

      (7) Lines 168-170 how can supplementary figure 5A-K show "a certain degree of correlation with cancer stages"?

      Sorry for the confusion! We have now revised Supplementary Figure 5A–K to support the visual impression with formal statistics. For each cancer type, we built a contingency table of AJCC stage (I–IV) versus hub-gene subgroup (Low, Mid, High) and applied Pearson’s 𝑥<sup>2</sup> test (Monte-Carlo approximation, 10⁵ replicates when any expected cell count < 5). The 𝑥<sup>2</sup> statistic and p-value are printed beneath every panel; eight of the eleven cancers show a significant association (p-value < 0.05), while LUSC, THCA and PAAD do not.We have replaced the vague phrase “a certain degree of correlation” with this explicit statistical statement in the revised manuscript (Lines 141-143).

      (8) Lines 172-174: since the hub genes are a subset of GEAR genes through CSN construction, it is not a surprise of the consistency. any explanation about PAAD that is shown only in GOEA with GEARs but not with hub genes?

      Thanks for raising this interesting point! In PAAD the Core Survival Network is unusually diffuse: the top-ranked SAS edges are distributed broadly rather than converging on a single dense module. Because of this flat topology, the ten highest-degree nodes (our hub set) do not form a tightly interconnected cluster, nor are they collectively enriched in the mitosis-related pathway that dominates the full GEAR list. This might explain that the mitotic enrichment is evident when all PAAD GEARs were analyzed but not when the analysis is confined to the far smaller—and more functionally dispersed—hub-gene subset.

      (9) Lines 191: how the classification was performed? Tool? Cutoff values etc?

      The hub-gene-based molecular classification was performed in R using hierarchical clustering. Briefly, we extracted the 𝑙𝑜𝑔<sub>2</sub>(𝑇𝑃𝑀 +1) expression matrix of hub genes, computed Euclidean distances between samples, and applied Ward’s minimum variance method (hclust, method = "ward.D2"). The resulting dendrogram was then divided into three groups (cutree, k = 3), corresponding to low, mid, and high expression classes. These parameters were selected based on visual inspection of clustering structure across cancer types. We have added this information to the revised ‘Methods’ section (Lines 439-443).

      (10) Lines 210-212: any statistics to support the conclusion? The bar chat of Figure 3B seems to support that all mutations favor ML & MM.

      We agree that formal statistical support is important for interpreting groupwise comparisons. In this case, however, several of the driver events, such as ROS1 and ERBB2, had very small subgroup counts, which violate the assumptions of Pearson’s 𝑥<sup>2</sup> test. While we explored 𝑥<sup>2</sup> and Fisher’s exact tests, the results were unstable due to sparse counts. Therefore, we chose to present these distributions descriptively to illustrate the observed subtype preferences across different driver mutations (Figure 3B). We have revised the manuscript text to clarify this point (Lines 182-188).

      (11) Line 216: should supplementary Figure 6H-J be "6H-I"?

      We apologize for the mistake. We have corrected it in the revised manuscript.

      (12) Line 224: incomplete sentence starting with "To further the functional... ".

      Thanks! We have made the revision and it states now “To further expore the functional implications of these mutations, we enriched them using a pathway system called Nested Systems in Tumors (NeST)”.

      (13) Lines 261-263: it is better to report the median instead of the mean. Use log scale data for analysis or use non-parametric methods due to the long tail of the data.

      Thank you for the very helpful suggestion. In the revised manuscript, we now report the median instead of the mean to better reflect the distribution of the data. In addition, we have applied log-scale transformation where appropriate and replaced the original statistical tests with non-parametric Wilcoxon ranksum tests to account for the long-tailed distribution. These changes have been implemented in both the main text and figure legends (Lines 234–237, Figure 5F).

      (14) Line 430: why based on the first sampling gradient, i.e. k_1 instead of the k_j selected? Or do you mean k_j here?

      Thanks for this question! We deliberately based SAS on the vectors from the first sampling gradient ( 𝑘<sub>1</sub>, ≈ 10 % of the cohort). At this smallest sample size, the binary significance patterns still contain substantial variation, and many genes are not significant in every permutation. Based on this, we think the measure can meaningfully identify gene pairs that behave concordantly throughout the gradient permutation. 

      We have now added a sentence to clarify this in the Methods section (Lines 398–403).

      (15) Need clarification on how the significant survival network was built.

      Thank you for pointing this out. We have now provided a more detailed clarification of how the Survival-Analysis Similarity (SAS) metric was defined and applied in constructing the core survival network (CSN), including the rationale for key parameter choices (Lines 409–430). Additionally, we have made full source code publicly available on GitHub to facilitate transparency and reproducibility (https://github.com/XinleiCai/MEMORY).

      (16) Line 433: what defines the "significant genes" here? Are they the same as GEAR genes? And what are total genes, all the genes?

      We apologize for the inconsistency in terminology, which may have caused confusion. In this context,

      “significant genes” refers specifically to the GEARs (Genes Steadily Associated with Prognosis). The SAS values were calculated between each GEAR and all genes. We have revised the manuscript to clarify this by consistently using the term “GEARs” throughout.

      (17) Line 433: more detail on how SAS values were used will be helpful. For example, were pairwise SAS values fed into Cytoscape as an additional data attribute (on top of what is available in TCGA) or as the only data attribute for network building?

      The SAS values were used as the sole metric for defining connections (edges) between genes in the construction of the core survival network (CSN). Specifically, we calculated pairwise SAS values between each GEAR and all other genes, then selected the top 1,000 gene pairs with the highest SAS scores to construct the network. No additional data attributes from TCGA (such as expression levels or clinical features) were used in this step. These selected pairs were imported into Cytoscape solely based on their SAS values to visualize the CSN.

      (18) Line 434: what is "ranking" here, by degree? Is it the same as "nodes with top 10 degrees" at line 436?

      The “ranking” refers specifically to the SAS values between gene pairs. The top 1,000 ranked SAS values were selected to define the edges used in constructing the Core Survival Network (CSN).

      Once the CSN was built, we calculated the degree (number of connections) for each node (i.e., each gene). The

      “top 10 degrees” mentioned on Line 421 refers to the 10 genes with the highest node degrees in the CSN. These were designated as hub genes for downstream analyses.

      We have clarified this distinction in the revised manuscript (Line 398-403).

      (19) Line 435: was the network built in Cytoscape? Or built with other tool first and then visualized in Cytoscape?

      The network was constructed in R by selecting the top 1,000 gene pairs with the highest SAS values to define the edges. This edge list was then imported into Cytoscape solely for visualization purposes. No network construction or filtering was performed within Cytoscape itself. We have clarified this in the revised ‘Methods’ section (Lines 424-425).

      (20) Line 436: the degree of each note was calculated, what does it mean by "degree" here and is it the same as the number of edges? How does it link to the "higher ranked edges" in Line 165?

      The “degree” of a node refers to the number of edges connected to that node—a standard metric in graph theory used to quantify a node’s centrality or connectivity in the network. It is equivalent to the number of edges a gene shares with others in the CSN.

      The “higher-ranked edges” refer to the top 1,000 gene pairs with the highest SAS values, which we used to construct the Core Survival Network (CSN). The degree for each node was computed within this fixed network, and the top 10 nodes with the highest degree were selected as hub genes. Therefore, the node degree is largely determined by this pre-defined edge set.

      (21) Line 439: does it mean only 1000 SAS values were used or SAS values from 1000 genes, which should come up with 1000 choose 2 pairs (~ half million SAS values).

      We computed the SAS values between each GEAR gene and all other genes, resulting in a large number of pairwise similarity scores. Among these, we selected the top 1,000 gene pairs with the highest SAS values—regardless of how many unique genes were involved—to define the edges in the Core Survival Network (CSN). In another words, the network is constructed from the top 1,000 SAS-ranked gene pairs, not from all possible combinations among 1,000 genes (which would result in nearly half a million pairs). This approach yields a sparse network focused on the strongest co-prognostic relationships.

      We have clarified this in the revised ‘Methods’ section (Lines 409–430).

      (22) Line 496: what tool is used and what are the parameters set for hierarchical clustering if someone would like to reproduce the result?

      The hierarchical clustering was performed in R using the hclust function with Ward's minimum variance method (method = "ward.D2"), based on Euclidean distance computed from the log-transformed expression matrix (𝑙𝑜𝑔<sub>2</sub>(𝑇𝑃𝑀 +1)). Cluster assignment was done using the cutree function with k = 3 to define low, mid, and high expression subgroups. These settings have now been explicitly stated in the revised ‘Methods’ section (Lines 439–443) to facilitate reproducibility.

      (23) Lines 901-909: Figure 4 missing panel C. Current panel C seems to be the panel D in the description.

      Sorry for the oversights and we have now made the correction (Line 893).

      (24) Lines 920-928: Figure 6C: considering a higher bar to define "significant".

      We agree that applying a more stringent cutoff (e.g., p < 0.01) may reduce potential false positives. However, given the exploratory nature of this study, we believe the current threshold remains appropriate for the purpose of hypothesis generation.

      Reviewer #3 (Recommendations for the authors):

      (1) The title says the genes that are "steadily" associated are identified, but what you mean by the word "steadily" is not defined in the manuscript. Perhaps this could mean that they are consistently associated in different analyses, but multiple analyses are not compared.

      In our manuscript, “steadily associated” refers to genes that consistently show significant associations with patient prognosis across multiple sample sizes and repeated resampling within the MEMORY framework (Lines 65–66). Specifically, each gene is evaluated across 10 sampling gradients (from ~10% to 100% of the cohort) with 1,000 permutations at each level. A gene is defined as a GEAR if its probability of being significantly associated with survival remains ≥ 0.8 throughout the whole permutation process. This stability in signal under extensive resampling is what we refer to as “steadily associated.”

      (2) I think the word "gradient" is not appropriately used as it usually indicates a slope or a rate of change. It seems to indicate a step in the algorithm associated with a sampling proportion.

      Thank you for pointing out the potential ambiguity in our use of the term “gradient.” In our study, we used “gradient” to refer to stepwise increases in the sample proportion used for resampling and analysis. We have now revised it to “progressive”.

      (3) Make it clear that the name "GEARs" is introduced in this publication.

      Done.

      (4) Sometimes the document is hard to understand, for example, the sentence, "As the number of samples increases, the survival probability of certain genes gradually approaches 1." It does not appear to be calculating "gene survival probability" but rather a gene's association with patient survival. Or is it that as the algorithm progresses genes are discarded and therefore do have a survival probability? It is not clear.

      What we intended to describe is the probability that a gene is judged significant in the 1,000 resamples at a given sample-size step, that is, its reproducibility probability in the MEMORY framework. We have now revised the description (Lines 101-104).

      (5) The article lacks significant details, like the type of test used to generate p-values. I assume it is the log-rank test from the R survival package. This should be explicitly stated. It is not clear why the survminer R package is required or what function it has. Are the p-values corrected for multiple hypothesis testing at each sampling?

      We apologize for the lack of details. In each sampling iteration, we used the log-rank test (implemented via the survdiff function in the R survival package) to evaluate the prognostic association of individual genes. This information has now been explicitly added to the revised manuscript.

      The survminer package was originally included for visualization purposes, such as plotting illustrative Kaplan– Meier curves. However, since it did not contribute to the core statistical analysis, we have now removed this package from the Methods section to avoid confusion (Lines 386-407).

      As for multiple-testing correction, we did not adjust p-values in each iteration, because the final selection of GEARs is based on the frequency with which a gene is found significant across 1,000 resamples (i.e., its reproducibility probability). Classical FDR corrections at the per-sample level do not meaningfully affect this aggregate metric. That said, we fully acknowledge the importance of multiple-testing control for the final GEARs catalogue. Future versions of the MEMORY framework will incorporate appropriate adjustment procedures at that stage.

      (6) It is not clear what the survival metric is. Is it overall survival (OS) or progression-free survival (PFS), which would be common choices?

      It’s overall survival (OS).

      (7) The treatment of the patients is never considered, nor whether the sequencing was performed pre or posttreatment. The patient's survival will be impacted by the treatment that they receive, and many other factors like commodities, not just the genomics.

      We initially thought there exist no genes significantly associated with patient survival (GEARs) without counting so many different influential factors. This is exactly what motivated us to invent the

      MEMORY. However, this work proves “we were wrong”, and it demonstrates the real power of GEARs in determining patient survival. Of course, we totally agree with the reviewer that incorporating therapy variables and other clinical covariates will further improve the power of MEMORY analyses.

      (8) As a paper that introduces a new analysis method, it should contain some comparison with existing state of the art, or perhaps randomised data.

      Our understanding is --- the MEMORY presents as an exploratory and proof-of-concept framework. Comparison with regular survival analyses seems not reasonable. We have added some discussion in revised manuscript (Lines 350-359).

      (9) In the discussion it reads, "it remains uncertain whether there exists a set of genes steadily associated with cancer prognosis, regardless of sample size and other factors." Of course, there are many other factors that may alter the consistency of important cancer genes, but sample size is not one of them. Sample size merely determines whether your study has sufficient power to detect certain gene effects, it does not effect whether genes are steadily associated with cancer prognosis in different analyses. (Of course, this does depend on what you mean by "steadily".)

      We totally agree with reviewer that sample size itself does not alter a gene’s biological association with prognosis; it only affects the statistical power to detect that association. Because this study is exploratory and we were initially uncertain whether GEARs existed, we first examined the impact of sample-size variation—a dominant yet experimentally tractable source of heterogeneity—before considering other, less controllable factors.

      Reviewer #4 (Recommendations for the authors):

      Other more detailed comments:

      (1) Introduction

      L93: When listing reasons why genes do not replicate across different cohorts / datasets, there is also the simple fact that some could be false positives

      We totally agree that some genes may simply represent false-positive findings apart from biological heterogeneity and technical differences between cohorts. Although the MEMORY framework reduces this risk by requiring high reproducibility across 1,000 resamples and multiple sample-size tiers, it cannot eliminate false positives completely. We have added some discussion and explicitly note that external validation in independent datasets is essential for confirming any GEAR before clinical application.

      (2) Results Section

      L143: Language like "We also identified the most significant GEARs in individual cancer types" I think is potentially misleading since the "GEAR" lists do not have formal statistical significance attached.

      We removed “significant” ad revised it to “top 1” (Line 115).

      L153 onward: The pathway analysis results reported do not include any measures of how statistically significant the enrichment was.

      We have now updated the figure legends to clearly indicate that the displayed pathways represent the top significantly enriched results based on adjusted p-values from GO enrichment analyses (Lines 876-878).

      L168: "A certain degree of correlation with cancer stages (TNM stages) is observed in most cancer types except for COAD, LUSC and PRAD". For statements like this statistical significance should be mentioned in the same sentence or, if these correlations failed to reach significance, that should be explicitly stated.

      In the revised Supplementary Figure 5A–K, we now accompany the visual trends with formal statistical testing. Specifically, for each cancer type, we constructed a contingency table of AJCC stage (I–IV) versus hub-gene subgroup (Low, Mid, High) and applied Pearson’s 𝑥<sup>2</sup> test (using Monte Carlo approximation with 10⁵ replicates if any expected cell count was < 5). The resulting 𝑥<sup>2</sup> statistic and p-value are printed beneath each panel. Of the eleven cancer types analyzed, eight showed statistically significant associations (p < 0.05), while COAD, LUSC, and PRAD did not. Accordingly, we have make the revision in the manuscript (Line 137139).

      L171-176: When mentioning which pathways are enriched among the gene lists, please clarify whether these levels of enrichment are statistically significant or not. If the enrichment is significant, please indicate to what degree, and if not I would not mention.

      We agree that the statistical significance of pathway enrichment should be clearly stated and made the revision throughout the manuscript (Line 869, 875, 877).

      (3) Methods Section

      L406 - 418: I did not really understand, nor see it explained, what is the motivation and value of cycling through 10%, 20% bootstrapped proportions of patients in the "gradient" approach? I did not see this justified, or motivated by any pre-existing statistical methodology/results. I do not follow the benefit compared to just doing one analysis of all available samples, and using the statistical inference we get "for free" from the survival analysis p-values to quantify sampling uncertainty.

      The ten step-wise sample fractions (10 % to 100 %) allow us to transform each gene’s single log-rank P-value into a reproducibility probability: at every fraction we repeat the test 1,000 times and record the proportion of permutations in which the gene is significant. This learning-curve-style resampling not only quantifies how consistently a gene associates with survival under different power conditions but also produces the 0/1 vectors required to compute Survival-Analysis Similarity (SAS) and build the Core Survival Network. A single one-off analysis on the full cohort would yield only one P-value per gene, providing no binary vectors at all—hence no basis for calculating SAS or constructing the network. 

      L417: I assume p < 0.05 in the survival analysis means the nominal p-value, unadjusted for multiple testing. Since we are in the context of many tests please explicitly state if so.

      Yes, p < 0.05 refers to the nominal, unadjusted p-value from each log-rank test within a single permutation. In MEMORY these raw p-values are converted immediately into 0/1 “votes” and aggregated over 1 000 permutations and ten sample-size tiers; only the resulting reproducibility probability (𝐴<sub>𝑖𝑗</sub>) is carried forward. No multiple-testing adjustment is applied at the individual-test level, because a per-iteration FDR or BH step would not materially affect the final 𝐴<sub>𝑖𝑗</sub> ranking. We have revised the manuscript (Line 396)

      L419-426: I did not see defined what the rows are and what the columns are in the "significant-probability matrix". Are rows genes, columns cancer types? Consequently I was not really sure what actually makes a "GEAR". Is it achieving a significance probability of 0.8 across all 15 cancer subtypes? Or in just one of the tumour datasets?

      In the significant-probability matrix, each row represents a gene, and each column corresponds to a sampling gradient (i.e., increasing sample-size tiers from ~10% to 100%) within a single cancer type. The matrix is constructed independently for each cancer.

      GEAR is defined as achieving a significance probability of 0.8 within a single tumor type. Not need to achieve significance probability across all 15 cancer subtypes.

      L426: The significance probability threshold of 0.8 across 1,000 bootstrapped nominal tests --- used to define the GEAR lists --- has, as far as I can tell, no formal justification. Conceptually, the "significance probability" reflects uncertainty in the patients being used (if I follow their procedure correctly), but as mentioned above, a classical p-value is also designed to reflect sampling uncertainty. So why use the bootstrapping at all?

      Moreover, the 0.8 threshold is applied on a per-gene basis, so there is no apparent procedure "built in" to adapt to (and account for) different total numbers of genes being tested. Can the authors quantify the false discovery rate associated with this GEAR selection procedure e.g. by running for data with permuted outcome labels? And why do the gradient / bootstrapping at all --- why not just run the nominal survival p-values through a simple Benjamini-Hochberg procedure, and then apply and FDR threshold to define the GEAR lists? Then you would have both multiplicity and error control for the final lists. As it stands, with no form of error control or quantification of noise rates in the GEAR lists I would not recommend promoting their use. There is a long history of variable selection techniques, and various options the authors could have used that would have provided formal error rates for the final GEAR lists (see seminal reviews by eg Heinze et al 2018 Biometrical

      Journal, or O'Hara and Sillanpaa, 2009, Bayesian Analysis), including, as I say, simple application of a Benjamini-Hochberg to achive multiplicity adjusted FDR control.

      Thank you. We chose the 10 × 1,000 resampling scheme to ask a different question from a single Benjamini–Hochberg scan: does a gene keep re-appearing as significant when cohort composition and statistical power vary from 10 % to 100 % of the data? Converting the 1,000 nominal p-values at each sample fraction into a reproducibility probability 𝐴<sub>𝑖𝑗</sub> allows us to screen for signals that are stable across wide sampling uncertainty rather than relying on one pass through the full cohort. The 0.8 cut-off is an intentionally strict, empirically accepted robustness threshold (analogous to stability-selection); under the global null the chance of exceeding it in 1,000 draws is effectively zero, so the procedure is already highly conservative even before any gene-wise multiplicity correction [1]. Once MEMORY moves beyond this exploratory stage and a final, clinically actionable GEAR catalogue is required, we will add a formal FDR layer after the robustness screen, but for the present proof-of-concept study, we retain the resampling step specifically to capture stability rather than to serve as definitive error control.

      L427-433: I gathered that SAS reflects, for a particular pair of genes, how likely they are to be jointly significant across bootstraps. If so, perhaps this description or similar could be added since I found a "conceptual" description lacking which would have helped when reading through the maths. Does it make sense to also reflect joint significance across multiple cancer types in the SAS? Or did I miss it and this is already reflected?

      SAS is indeed meant to quantify, within a single cancer type, how consistently two genes are jointly significant across the 1,000 bootstrap resamples performed at a given sample-size tier. In another words, SAS is the empirical probability that the two genes “co-light-up” in the same permutation, providing a measure of shared prognostic behavior beyond what either gene shows alone. We have added this plain language description to the ‘Methods’ (Lines 405-418).

      In the current implementation SAS is calculated separately for each cancer type; it does not aggregate cosignificance across different cancers. Extending SAS to capture joint reproducibility across multiple tumor types is an interesting idea, especially for identifying pan-cancer gene pairs, and we note this as a potential future enhancement of the MEMORY pipeline.

      L432: "The SAS of significant genes with total genes was calculated, and the significant survival network was constructed" Are the "significant genes" the "GEAR" list extracted above according to the 0.8 threshold? If so, and this is a bit pedantic, I do not think they should be referred to as "significant genes" and that this phrase should be reserved for formal statistical significance.

      We have replaced “significant genes” with “GEAR genes” to avoid any confusion (Lines 421-422).

      L434: "some SAS values at the top of the rankings were extracted, and the SAS was visualized to a network by Cytoscape. The network was named core survival network (CSN)". I did not see it explicitly stated which nodes actually go into the CSN. The entire GEAR list? What threshold is applied to SAS values in order to determine which edges to include? How was that threshold chosen? Was it data driven? For readers not familiar with what Cytoscape is and how it works could you offer more of an explanation in-text please? I gather it is simply a piece of network visualisation/wrangling software and does not annotate additional information (e.g. external experimental data), which I think is an important point to clarify in the article without needing to look up the reference.

      We have now clarified these points in the revised ‘Methods’ section, including how the SAS threshold was selected and which nodes were included in the Core Survival Network (CSN). Specifically, the CSN was constructed using the top 1,000 gene pairs with the highest SAS values. This threshold was not determined by a fixed numerical cutoff, but rather chosen empirically after comparing networks built with varying numbers of edges (250, 500, 1,000, 2,000, 6,000, and 8,000; see Reviewer-only Figure 1). We observed that, while increasing the number of edges led to denser networks, the set of hub genes remained largely stable. Therefore, we selected 1,000 edges as a balanced compromise between capturing sufficient biological information and maintaining computational efficiency and interpretability.

      The resulting node list (i.e., the genes present in those top-ranked pairs) is provided in Supplementary Table 4. Cytoscape was used solely as a network visualization platform, and no external annotations or experimental data were added at this stage. We have added a brief clarification in the main text to help readers understand.

      L437: "The effect of molecular classification by hub genes is indicated that 1000 to 2000 was a range that the result of molecular classification was best." Can you clarify how "best" is assessed here, i.e. by what metric and with which data?

      We apologize for the confusion. Upon constructing the network, we observed that the number of edges affected both the selection of hub genes and the computational complexity. We analyzed the networks with 250, 500, 1,000, 2,000, 6,000 and 8,000 edges, and found that the differences in selected hub genes were small (Author response image 1). Although the networks with fewer edges had lower computational complexity, the choice of 1000 edges was a compromise to the balance between sufficient biological information and manageable computational complexity. Thus, we chose the network with 1,000 edges as it offered a practical balance between computational efficiency and the biological relevance of the hub genes.

      Author response image 1.

      The intersection of the network constructed by various number of edges.

      References

      (1) Gebski, V., Garès, V., Gibbs, E. & Byth, K. Data maturity and follow-up in time-to-event analyses.International Journal of Epidemiology 47, 850–859 (2018).

    1. Author response:

      Reviewer #1 (Public review):

      Weaknesses:

      The technical approach is strong and the conceptual framing is compelling, but several aspects of the evidence remain incomplete. In particular, it is unclear whether the reported changes in connectivity truly capture causal influences, as the rank metrics remain correlational and show discrepancies with the manipulation results.

      We agree that our functional connectivity ranking analyses cannot establish causal influences. As discussed in the manuscript, besides learning-related activity changes, the functional connectivity may also be influenced by neuromodulatory systems and internal state fluctuations. In addition, the spatial scope of our recordings is still limited compared to the full network implicated in visual discrimination learning, which may bias the ranking estimates. In future, we aim to achieve broader region coverage and integrate multiple complementary analyses to address the causal contribution of each region.

      The absolute response onset latencies also appear slow for sensory-guided behavior in mice, and it is not clear whether this reflects the method used to define onset timing or factors such as task structure or internal state.

      We believe this may be primarily due to our conservative definition of onset timing. Specifically, we required the firing rate to exceed baseline (t-test, p < 0.05) for at least 3 consecutive 25-ms time windows. This might lead to later estimates than other studies, such as using the latency to the first spike after visual stimulus onset (~50-60 ms, Siegle et al., Nature, 2023) or the time to half-max response (~65 ms, Goldbach et al., eLife, 2021).

      Furthermore, the small number of animals, combined with extensive repeated measures, raises questions about statistical independence and how multiple comparisons were controlled.

      We agree that a larger sample size would strengthen the robustness of the findings. However, as noted above, the current dataset has inherent limitations in both the number of recorded regions and the behavioral paradigm. Given the considerable effort required to achieve sufficient unit yields across all targeted regions, we wish to adjust the set of recorded regions, improve behavioral task design, and implement better analyses in future studies. This will allow us to both increase the number of animals and extract more precise insights into mesoscale dynamics during learning.

      The optogenetic experiments, while intended to test the functional relevance of rank increasing regions, leave it unclear how effectively the targeted circuits were silenced. Without direct evidence of reliable local inhibition, the behavioral effects or lack thereof are difficult to interpret.

      We appreciate this important point. Due to the design of the flexible electrodes and the implantation procedure, bilateral co-implantation of both electrodes and optical fibers was challenging, which prevented us from directly validating the inhibition effect in the same animals used for behavior. In hindsight, we could have conducted parallel validations using conventional electrodes, and we will incorporate such controls in future work to provide direct evidence of manipulation efficacy.

      Details on spike sorting are limited.

      We will provide more details on spike sorting, including the exact parameters used in the automated sorting algorithm and the subsequent manual curation criteria.

      Reviewer #2 (Public review):

      Weaknesses:

      I had several major concerns:

      (1) The number of mice was small for the ephys recordings. Although the authors start with 7 mice in Figure 1, they then reduce to 5 in panel F. And in their main analysis, they minimize their analysis to 6/7 sessions from 3 mice only. I couldn't find a rationale for this reduction, but in the methods they do mention that 2 mice were used for fruitless training, which I found no mention in the results. Moreover, in the early case, all of the analysis is from 118 CR trials taken from 3 mice. In general, this is a rather low number of mice and trial numbers. I think it is quite essential to add more mice.

      We apologize for the confusion. As described in the Methods section, 7 mice (Figure 1B) were used for behavioral training without electrode array or optical fiber implants to establish learning curves, and an additional 5 mice underwent electrophysiological recordings (3 for visual-based decision-making learning and 2 for fruitless learning).

      As we noted in our response to Reviewer #1, the current dataset has inherent limitations in both the number of recorded regions and the behavioral paradigm. Given the considerable effort required to achieve high-quality unit yields across all targeted regions, we wish to adjust the set of recorded regions, improve behavioral task design, and implement better analyses in future studies. These improvements will enable us to collect data from a larger sample size and extract more precise insights into mesoscale dynamics during learning.

      (2) Movement analysis was not sufficient. Mice learning a go/no-go task establish a movement strategy that is developed throughout learning and is also biased towards Hit trials. There is an analysis of movement in Figure S4, but this is rather superficial. I was not even sure that the 3 mice in Figure S4 are the same 3 mice in the main figure. There should be also an analysis of movement as a function of time to see differences. Also for Hits and FAs. I give some more details below. In general, most of the results can be explained by the fact that as mice gain expertise, they move more (also in CR during specific times) which leads to more activation in frontal cortex and more coordination with visual areas. More needs to be done in terms of analysis, or at least a mention of this in the text.

      Due to the limitation in the experimental design and implementation, movement tracking was not performed during the electrophysiological recordings, and the 3 mice shown in Figure S4 were from a separate group. We have carefully examined the temporal profiles of mouse movements and found it did not fully match the rank dynamics, and we will add these results and related discussion in the revised manuscript. However, we acknowledge that without synchronized movement recordings in the main dataset, we cannot fully disentangle movement-related neural activity from task-related signals. We will make this limitation explicit in the revised manuscript and discuss it as a potential confound, along with possible approaches to address it in future work.

      (3) Most of the figures are over-detailed, and it is hard to understand the take-home message. Although the text is written succinctly and rather short, the figures are mostly overwhelming, especially Figures 4-7. For example, Figure 4 presents 24 brain plots! For rank input and output rank during early and late stim and response periods, for early and expert and their difference. All in the same colormap. No significance shown at all. The Δrank maps for all cases look essentially identical across conditions. The division into early and late time periods is not properly justified. But the main take home message is positive Δrank in OFC, V2M, V1 and negative Δrank in ThalMD and Str. In my opinion, one trio map is enough, and the rest could be bumped to the Supplementary section, if at all. In general, the figure in several cases do not convey the main take home messages. See more details below.

      We thank the reviewer for this valuable critique. The statistical significance corresponding to the brain plots (Figure 4 and Figure 5) was presented in Figure S3 and S5, but we agree that the figure can be simplified to focus on the key results. In the revised manuscript, we will condense these figures to focus on the most important comparisons and relocate secondary plots to the Supplementary section. This will make the visual presentation more concise and the take-home message clearer.

      (4) The analysis is sometimes not intuitive enough. For example, the rank analysis of input and output rank seemed a bit over complex. Figure 3 was hard to follow (although a lot of effort was made by the authors to make it clearer). Was there any difference between the output and input analysis? Also, the time period seems redundant sometimes. Also, there are other network analysis that can be done which are a bit more intuitive. The use of rank within the 10 areas was not the most intuitive. Even a dimensionality reduction along with clustering can be used as an alternative. In my opinion, I don't think the authors should completely redo their analysis, but maybe mention the fact that other analyses exist

      We appreciate the reviewer’s comment. In brief, the input- and output-rank analyses yielded largely similar patterns across regions in CR trials, although some differences were observed in certain areas (e.g., striatum in Hit trials) where the magnitude of rank change was not identical between input and output measures. We agree that the division into multiple time periods sometimes led to redundant results; we will combine overlapping results in the revision to improve clarity.

      We did explore dimensionality reduction applied to the ranking data. However, the results were not intuitive and required additional interpretation, which did not bring more insights. Still, we acknowledge that other analysis approaches might provide complementary insights. While we do not plan to completely reanalyze the dataset at this stage, we will include a discussion of these alternative methods and their potential advantages in the revised manuscript.

      Reviewer #3 (Public review):

      Weaknesses:

      The weakness is also related to the strength provided by the method. It is demonstrated in the original method that this approach in principle can track individual units for four months (Luan et al, 2017). The authors have not showed chronically tracked neurons across learning. Without demonstrating that and taking advantage of analyzing chronically tracked neurons, this approach is not different from acute recording across multiple days during learning. Many studies have achieved acute recording across learning using similar tasks. These studies have recorded units from a few brain areas or even across brain-wide areas.

      We appreciate the reviewer’s important point. We did attempt to track the same neurons across learning in this project. However, due to the limited number of electrodes implanted in each brain region, the number of chronically tracked neurons in each region was insufficient to support statistically robust analyses. Concentrating probes in fewer regions would allow us to obtain enough units tracked across learning in future studies to fully exploit the advantages of this method.

      Another weakness is that major results are based on analyses of functional connectivity that is calculated using the cross-correlation score of spiking activity (TSPE algorithm). Functional connection strengthen across areas is then ranked 1-10 based on relative strength. Without ground truth data, it is hard to judge the underlying caveats. I'd strongly advise the authors to use complementary methods to verify the functional connectivity and to evaluate the mesoscale change in subnetworks. Perhaps the authors can use one key information of anatomy, i.e. the cortex projects to the striatum, while the striatum does not directly affect other brain structures recorded in this manuscript

      We agree that the functional connectivity measured in this study relies on statistical correlations rather than direct anatomical connections. We plan to test the functional connection data with shorter cross-correlation delay criteria to see whether the results are consistent with anatomical connections and whether the original findings still hold.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      (...) The study describes meticulously conducted and controlled experiments, showing the impressive biochemistry work consistently produced by this group. The statistical analysis and data presentation are appropriate, with the following major comments noted:

      Response: We thank the reviewer for their thoughtful and constructive review of our manuscript. We appreciate the positive comments on our experimentation.

      Major comments

      1. Please clarify why K8ac/K12ac, K5ac/K16ac, K5ac/K12ac are not quantified (Figure 3). If undetected, state explicitly and annotate figures with "n.d." rather than leaving gaps. If detected but excluded, justify the exclusion.

      Response: We restricted ourselves to mapping those diacetylated motifs that can be readily identified by MS2. The characteristic ions of the d3-labeled and endogenous acetylated peptides in the MS2 spectra could not differentiate the diacetylated forms mentioned by the reviewer. Rather than expanding the figure with non-informative rows we amended the legend of figure 3 accordingly "Diacetylated forms K8-K12, K5-K16, K5-K12 could not be distinguished from each other by MS2 and were thus not included in the analysis".

      The statement "Nevertheless, combinations of di- and triacetylation were much more frequent if K12ac was included, suggesting that K12 is the primary target." is under-supported because only two non-K12ac combinations are shown, and only one is lower than K12ac-containing combinations. Either soften the claim ("trend toward ... in our dataset") or expand the analysis to all observed di/tri combinations with effect sizes, n, and statistical tests.

      Response: The reviewer is right our statement does properly reflect the data. It rather seems that combinations lacking K12ac are considerably less frequent (K5K8K16 tri-ac, K5K8 di-ac). We now modified the sentence as follows: "Peptides lacking K12ac were less frequent, suggesting that K12 is a primary target".

      Please provide a more detailed discussion about the known nature of NU9056 inhibition and how it fits or doesn't fit with your data. Are there any structural studies on this?

      Response: Unfortunately, NU9056 is very poorly described, neither the mode of interaction with Tip60 nor the mechanism of inhibition are known. The specificity of the chemical has not really been shown, but nevertheless it is used as a selective Tip60 inhibitor in several papers which is why we picked it in the first place. Our conclusions on the inhibitor are in the last paragraph of the discussion: "The fact that acetylation of individual lysines is inhibited with different kinetics argues against a mechanism involving competition with acetyl-CoA, but for an allosteric distortion of the catalytic center." We think that any further interpretation would likely be considered an overstatement.

      Why was the inhibitor experiment MS only performed for H2A.V and not H2A? Given the clear H2A vs H2A.V differences reported in Fig. 2, it would be useful to have the matched data for H2A.

      Response: In these costly mass spec experiments we strive to balance limited resources and most informative output. Because H2A.V and H4 are the major functional targets of Tip60, we considered that documenting the effect of the inhibitor on these substrates would be most appropriate. In hindsight, including H2A would have been nice to have, but would not change our conclusions about the inhibitor.

      The inhibitor observations are very interesting as they can highlight systems to study the loss of specific acetyl residues: can the authors perform WB/IF validation in treated cells? I understand it will not be possible with the H2A antibodies, but the difference in H4K5ac vs H4K12ac should be possible to validate in cells

      Response: We attempted to monitor changes of histone modifications upon treatment of cells with NU9056 by immunoblotting. Probing H4K5 and K12, the results were variable. We also observed occasionally that acetylation of H4K5 and H4K12 was slightly diminished in whole cell extracts, but not in nuclear extracts. This reminded us that diacetylation of H4 at K5 and K12 is a feature of cytoplasmic H4 in complex with chaperones, a mark that is placed by HAT1 (Aguldo Garcia et al., DOI: 10.1021/acs.jproteome.9b00843; Varga et al., DOI: 10.1038/s41598-019-54497-0). The observed proliferation arrest by NU9056 may thus affect chromatin assembly and indirectly K5K12 acetylation. H4K12 is also acetylated by chameau (Chm).

      We observed a reduction of acetylated H4K16 and H2A.V. H4K16 is not a preferred target of Tip60, but Tip60 acetylates MSL1 and MBDR2, two subunits of the NSL1 complex (Apostolou et al. DOI: 10.1101/2025.07.15.664872). We, therefore, consider that effects on H4 acetylation upon NU9056 treatment may at least partially be affected indirectly. Because we are not confident about the data and because our manuscript emphasizes the direct, intrinsic specificity of Tip60, we refrain from showing the corresponding Western blots.

      You highlight that H2AK10 (a major TIP60 site here) is not conserved in human canonical H2A. Please expand the discussion of the potential function and physiological relevance. Maybe in relation to H2A.V being a fusion of different human variants?

      Response: The reviewer noted an interesting aspect of the evolution of the histone H2A variants. It turns out that H2A.Z is the more ancient variant, from which H2A derived by mutation. H2A.Z/H2A.V sequences are more conserved than H2A sequences. We summarized these evolutionary notions in Baldi and Becker (DOI: 10.1007/s00412-013-0409-x). In the context of the question, this means that mammalian H2A.Z, Drosophila H2A.V and mammalian H2A still contain the ancient sequence (lacking K10), and Drosophila H2A acquired K10 by mutation. The evolutionary advantage associated with this mutation in unclear. We now added a small paragraph summarizing these ideas on page 13 of the (changes tracked in red).

      To enable direct comparisons between variants and residues, please match y-axis scales where the biology invites comparison (e.g., H2A vs H2A.V; Figs. 2-3).

      Response: We adjusted the Y-axes in Figure 2 and 3 to facilitate direct comparisons, where such comparison is informative.

      Minor comments

      1. Add 1-2 sentences in the abstract on the gap in the field being addressed by the study.

      Response: We are grateful for this suggestion and have expanded the abstract accordingly (changes tracked in red).

      Either in the introduction or discussion, comment on your prior Tip60 three-subunit data (Kiss et al.). The three-subunit complex was significantly less active on H4, as indicated in that publication, which is likely due to the absence of Eaf6.

      Response: We thank the reviewer for the opportunity to emphasize this point. Motivated by findings in the yeast and mammalian systems that Eaf6 was important for acetylation, we added this subunit to our previously reconstituted 3-subunit 'piccolo' complex. As can be seen by the comparison of the older data (Kiss et al.) and the new data, the 4-subunit TIP60 core complex is a much more potent HAT. We amended the introduction (see marked text) accordingly. We also added a paragraph on what is known about the properties and function of Eaf6 to the discussion.

      3a. Text references Fig.1E before Fig.1C, please reorder

      Response: We deleted the premature mentioning of Figure 1E and added the following explanation to the relevant panels in Figure 1: "The blot was reprobed with an antibody detecting H3 as an internal standard for nucleosome input."

      3b. Fig.1B/C legend labels appear swapped.

      Response: We thank the reviewer for spotting the swap. We corrected the figure legend.

      3c. Fig.1E, 4A, 4B: add quantification

      Response: We quantified each acetylation level, and added to the relevant panel of Figure 1 and 4 the following phrase: "The quantified levels of each acetylation mark over H3 are shown below each plot." Notably, the difference in acetylation signal strength between the two antibodies highlights the inherent variability of antibody-based detection.

      3d. Fig.2A: Note explicitly that K5-K10 and K8-K10 are unresolvable pairs to explain the shading scheme used.

      Response: The legend of Figure 2A now includes the following sentence. "Peptides that are diacetylated at either K5/K10 or K8/K10 cannot be resolved by MS2. The last row reminds of this fact by the patterning of boxes and displays the combined values."

      Ensure consistent KAT5/TIP60 naming.

      Response: Our naming follows this logic: We use 'Tip60' for the Drosophila protein and 'TIP60' for the Drosophila 'piccolo' or 'core' complexes. The mammalian protein is referred to by the capital acronym TIP60, as is established in the literature. We use KAT5/TIP60 according to the unified nomenclature in the introduction and parts of the discussion, when we refer to the enzymes in more general terms, independent of species. We scrutinized the manuscript again and made a few changes to adhere to the above scheme.

      Consider moving the first two Discussion paragraphs (field context and challenges in antibody-based detection) into the Introduction to better frame the significance.

      Response: We thank the reviewer for this suggestion that improved the manuscript a lot. We incorporated the first two paragraphs of the discussion into the introduction.

      Significance

      This is a valuable and timely study for the histone acetylation field. The substrate specificity of many individual HATs remains incompletely understood owing to (i) cross-reactivity and limited selectivity of many anti-acetyl-lysine antibodies, (ii) functional redundancy among KATs, (iii) variability across in-vitro assays (HAT domain vs full-length/complex; free histones vs oligonucleosomes), and (iv) incomplete translation of in-vitro specificity to in-vivo settings. These factors have produced conflicting reports in the literature. By combining quantitative mass spectrometry with carefully engineered oligonucleosomal arrays, the authors make a principal step toward deconvoluting TIP60 biology in a controlled yet close-to-physiologically relevant system. Conceptually, the work delineates intrinsic, site-specific preferences of the TIP60 core on variant versus canonical nucleosomes, consistent with largely distributive behaviour and site-dependent inhibitor sensitivity. The inhibitor-dependent shifts in acetylation patterns are particularly intriguing and could enable dissection of residue-specific functions, with potential translational implications for preclinical cancer research and biomarker development. Overall, this manuscript will be of interest to the chromatin community, and I am supportive of publication pending satisfactory resolution of the points raised above.

      Response: Once more we thank the reviewer for their time and efforts devoted to help us improve the manuscript.


      Reviewer #2

      Major comments

      (...) A central limitation of the study, noted by the authors, is the uncertainty regarding the biological relevance of the findings. While the in vitro system provides a controlled framework for analyzing residue specificity and kinetics, it does not address the functional significance of these results in a cellular or organismal context. This limitation is outside the scope of the current work but indicates potential directions for follow-up studies. Within its defined objectives, the study presents a methodological framework and dataset that contribute to understanding TIP60 activity in a biochemical setting.

      Response: We agree with the referee.

      Minor comments

      While the manuscript is clearly presented overall, there are two minor issues that could be addressed:

      1. In Figure 1, the panels are not ordered according to their appearance in the Results section. In addition, the legends for Figures 1B and 1C appear to be swapped.

      Response: We thank the reviewer for spotting these oversights. We deleted the premature mentioning of Figure 1E and added the following explanation to the relevant panels in Figure 1: "The blot was reprobed with an antibody detecting H3 as an internal standard for nucleosome input." We also swapped the legends.

      For the quantitative MS data (N = 2 biological replicates), the phrasing "Error bars represent the two replicate values" could be refined. With N = 2, showing individual data points or the range may convey the information more transparently than conventional error bars, which are typically associated with statistical measures (e.g., SEM) from larger sample sizes. Alternatively, a brief note explaining the choice to use two replicates and represent them with error bars could be added.

      Response: We appreciate the reviewer's comment and have revised the figure to display individual data points for the two biological replicates instead of error bars, providing a clearer representation of the data distribution. We changed the phrasing 'Error bars represent...' to "Bars represent the mean of two biological replicates (each consisting of two TIP60 core complexes and two nucleosome arrays - each analyzed with two technical replicates), with individual replicate values shown as open circles." and hope that this describes the data better.

      Significance

      Krause and colleagues, using a clean in vitro system, define the substrate specificity of the Drosophila TIP60 core complex. They identify the main acetylation sites and their kinetic dynamics on H2A, H2A.V, and H4 tails, and further characterize the inhibitory activity of NU9056. This work addresses a longstanding question in the field and provides compelling evidence to support its conclusions. Future studies will be needed to establish the biological relevance of these findings.

      Response: We thank the reviewer for a thoughtful and constructive review of our manuscript. We appreciate the suggestions that helped to improve the manuscript.


      Reviewer #3

      (...) However, the authors should revisit some additional points:

      Major comments:

      1. The Tip60 core complex is usually described as containing three subunits: Tip60, Ing3 and E(Pc). The authors also included Eaf6 in their analysis, however, their motivation to include Eaf6 specifically remains unclear. They should explain in the manuscript why Eaf6 was included and how this could affect the observed acetylation pattern.

      Response: We thank the reviewer for the opportunity to emphasize this point. Motivated by findings in the yeast and mammalian systems that Eaf6 was important for acetylation, we added this subunit to our previously reconstituted 3-subunit piccolo complex. As can be seen by the comparison of the older data (ref Kiss) and the new data, the 4-subunit Tip60 core complex is a much more potent HAT. We amended the introduction accordingly. We also added a paragraph on what is known about the properties and function of Eaf6 to the discussion. Please see the amended text marked in red.

      The authors investigated the effectiveness of two Tip60 inhibitors by testing their effects on H4K12ac using an antibody. They state that "TH1834 had no detectable effect on either complex [Tip60 or Msl], even at very high concentrations." However, the initial publication describing TH1834 also stated that this inhibitor particularly affected H2AX with not direct effect on H4 acetylation. The authors should revisit TH1834 and specifically investigate its effect on H2A and, in particular, on H2Av as H2Av is the corresponding ortholog of H2AX.

      Response: The case of TH1834 is not very strong in the literature, which is why we discontinued the line of experimentation when we did not see any effect of TH1834 (2 different batches) on the preferred substrate. The reviewer's suggestion is very good, but given our limited resources we decided to remove the data and discussion of TH1834 from the manuscript (old Figure 4A). The deletion of these very minor data does not diminish the overall conclusion and significance of the manuscript.

      The authors performed a detailed analysis of NU9056 effects. However, they did not include effects on H2A. H2A is distinct from H4 and H2Av as it is the only one containing K10 and this lysine also showed high levels of acetylation by Tip60. Therefore, a comprehensive analysis of Nu9056 effects should include analyzing its effects on H2A acetylation.

      Response: In these costly mass spec experiments, we strive to balance limited resources and most informative output. Because H2A.V and H4 are the major functional targets of Tip60, we considered that documenting the effect of the inhibitor on these substrates would be most appropriate. In hindsight, including H2A would have been nice to have, but would not change our conclusions about the inhibitor.

      The authors have previously reported non-histone substrates of Tip60. It would be interesting to test whether the two investigated Tip60 inhibitors affect acetylation of non-histone substrates of Tip60. This analysis would greatly increase the understanding of how selective these inhibitors are. (OPTIONAL)

      Response: We agree with the reviewer that the proposed experiments may be an interesting extension of our current work. However, the Becker lab will be closed down by the end of this year due to retirement, precluding major follow-up studies at this point.

      __ Minor comments: __

      1. Fig. 1 a: instead of "blue residues", would be more accurate to refer to "blue arrows"?

      Response: Yes of course - the text has been revised accordingly.

      Fig.1 b-c: it would be helpful to include which staining (silver/Ponceau?) was performed here.

      Response: The legends now contain the relevant information.

      Fig. 2a: I did not understand the shading for the K5/K8-K10ac panel from the figure legend. The explanation is present in the main text but would be helpful in the figure legend to allow easy access for readers.

      Response: We agree and revised text accordingly.

      Fig. 4 c: bar graphs on the top: the X-values are missing.

      Response: The figure has been revised accordingly.

      This sentence in the discussion seems to require revision: "Whereas the replication-dependent H2A resides in most nucleosomes in the genome, H2A.V, the only H2A variant histone in Drosophila, is incorporated by exchange of H2A, independent of replication."

      Response: We revised the sentence as follows to improve clarity. "While the replication-dependent H2A is present in most nucleosomes across the genome, H2A.V, the only H2A variant in Drosophila, is incorporated through replication-independent exchange of H2A."

      In this sentence: "A comparison with the TIP60 core complex is instructive since both enzymes are MYST acetyltransferases and bear significant similarity in their catalytic center." do the authors mean "informative" rather than "instructive"?

      Response: We replaced 'instructive' by 'informative.

      Significance

      The findings are novel and expand our knowledge of Tip60 histone tail acetylation dynamics and specificity. The manuscript does not address the biological relevance of distinct acetylation marks, which is clearly beyond the scope of the study, but discuss their relevance where possible. The analysis of NU9056 is informative and relevant in a broad context. Optionally, the authors could expand their analysis of NU9056 on its effects on non-histone Tip60 targets to increase impact further. Their analysis of TH1834, however, is currently insufficient as they focused on H4 acetylation alone, which has already been reported to not be affected by TH1834. The authors should include an analysis of TH1834 effects on H2A and H2A.V acetylation. The manuscript is well written, easy to follow and of appropriate length. The methods are elegant and the findings of the study are novel. The manuscripts targets researchers specifically interested in chromatin remodeling as well as a broader audience using the Tip60 inhibitor NU9056.

      Response: We thank the reviewer for their profound assessment and the general appreciation of our work. We agree that the analysis of the TH1834 is not satisfactory at this point and have removed the corresponding data and description from figure 4. The deletion of these very minor data does not diminish the overall conclusion and significance of the manuscript.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      In this manuscript, Xiong and colleagues investigate the mechanisms operating downstream to TRIM32 and controlling myogenic progression from proliferation to differentiation. Overall, the bulk of the data presented is robust. Although further investigation of specific aspects would make the conclusions more definitive (see below), it is an interesting contribution to the field of scientists studying the molecular basis of muscle diseases.

      We thank the Reviewer for appreciating our work and for their valuable suggestions to improve our manuscript. We have carefully addressed some of the concerns raised, as detailed here, while others, which require more experimental efforts, will be addressed as detailed in the Revision Plan.

      In my opinion, a few aspects would improve the manuscript. Firstly, the conclusion that Trim32 regulates c-Myc mRNA stability could be expanded and corroborated by further mechanistic studies:

      1. Studies investigating whether Tim32 binds directly to c-Myc RNA. Moreover, although possibly beyond the scope of this study, an unbiased screening of RNA species binding to Trim32 would be informative. Authors’ response. This point will be addressed as detailed in the Revision Plan

      If possible, studies in which the overexpression of different mutants presenting specific altered functional domains (NHL domain known to bind RNAs and Ring domain reportedly involved in protein ubiquitination) would be used to test if they are capable or incapable of rescuing the reported alteration of Trim32 KO cell lines in c-Myc expression and muscle maturation.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      An optional aspect that might be interesting to explore is whether the alterations in c-Myc expression observed in C2C12 might be replicated with primary myoblasts or satellite cells devoid of Trim32.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      I also have a few minor points to highlight:

        • It is unclear if the differences highlighted in graphs 5G, EV5D, and EV5E are statistically significant.*

      Authors’ response. We thank the Reviewer for raising this point. We now indicated the statistical analyses performed on the data presented in the mentioned figures (according also to a point of Reviewer #3). According to the conclusion that Trim32 is necessary for proper regulation of c-Myc transcript stability, using 2-way-ANOVA, the data now reported as Figure 5G show the statistically significant effect of the genotype at 6h (right-hand graph) but not at D0 (left-hand graph). In the graphs of Fig. EV5 D and E at D0 no significant changes are observed whereas at 6h the data show significant difference at the 40 min time point. We included this info in the graphs and in the corresponding legends.

      - On page 10, it is stated that c-Myc down-regulation cannot rescue KO myotube morphology fully nor increase the differentiation index significantly, but the corresponding data is not shown. Could the authors include those quantifications in the manuscript?

      Authors’ response. As suggested, we included the graph showing the differentiation index upon c-Myc silencing in the Trim32 KO clones and in the WT clones, as a novel panel in Figure 6 (Fig. 6D). As already reported in the text, a partial recovery of differentiation index is observed but the increase is not statistically significant. In contrast, no changes are observed applying the same silencing in the WT cells. Legend and text were modified accordingly.

      Reviewer #1 (Significance (Required)):

      The manuscript offers several strengths. It provides novel mechanistic insight by identifying a previously unrecognized role for Trim32 in regulating c-Myc mRNA stability during the onset of myogenic differentiation. The study is supported by a robust methodology that integrates CRISPR/Cas9 gene editing, transcriptomic profiling, flow cytometry, biochemical assays, and rescue experiments using siRNA knockdown. Furthermore, the work has a disease relevance, as it uncovers a mechanistic link between Trim32 deficiency and impaired myogenesis, with implications for the pathogenesis of LGMDR8. * * At the same time, the study has some limitations. The findings rely exclusively on the C2C12 myoblast cell line, which may not fully represent primary satellite cell or in vivo biology. The functional rescue achieved through c-Myc knockdown is only partial, restoring Myogenin expression but not the full differentiation index or morphology, indicating that additional mechanisms are likely involved. Although evidence supports a role for Trim32 in mRNA destabilization, the precise molecular partners-such as RNA-binding activity, microRNA involvement, or ligase function-remain undefined. Some discrepancies with previous studies, including Trim32-mediated protein degradation of c-Myc, are acknowledged but not experimentally resolved. Moreover, functional validation in animal models or patient-derived cells is currently lacking. Despite these limitations, the study represents an advancement for the field. It shifts the conceptual framework from Trim32's canonical role in protein ubiquitination to a novel function in RNA regulation during myogenesis. It also raises potential clinical implications by suggesting that targeting the Trim32-c-Myc axis, or modulating c-Myc stability, may represent a therapeutic strategy for LGMDR8. This work will be of particular interest to muscle biology researchers studying myogenesis and the molecular basis of muscle disease, RNA biology specialists investigating post-transcriptional regulation and mRNA stability, and neuromuscular disease researchers and clinicians seeking to identify new molecular targets for therapeutic intervention in LGMDR8. * * The Reviewer expressing this opinion is an expert in muscle stem cells, muscle regeneration, and muscle development.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary: * * In this study, the authors sought to investigate the molecular role of Trim32, a tripartite motif-containing E3 ubiquitin ligase often associated with its dysregulation in Limb-Girdle Muscular Dystrophy Recessive 8 (LGMDR8), and its role in the dynamics of skeletal muscle differentiation. Using a CRISPR-Cas9 model of Trim32 knockout in C2C12 murine myoblasts, the authors demonstrate that loss of Trim32 alters the myogenic process, particularly by impairing the transition from proliferation to differentiation. The authors provide evidence in the way of transcriptomic profiling that displays an alteration of myogenic signaling in the Trim32 KO cells, leading to a disruption of myotube formation in-vitro. Interestingly, while previous studies have focused on Trim32's role in protein ubiquitination and degradation of c-Myc, the authors provide evidence that Trim32-regulation of c-Myc occurs at the level of mRNA stability. The authors show that the sustained c-Myc expression in Trim32 knockout cells disrupts the timely expression of key myogenic factors and interferes with critical withdrawal of myoblasts from the cell cycle required for myotube formation. Overall, the study offers a new insight into how Trim32 regulates early myogenic progression and highlights a potential therapeutic target for addressing the defects in muscular regeneration observed in LGMDR8.

      We thank the Reviewer for valuing our work and for their appreciated suggestions to improve our manuscript. We have carefully addressed some of the concerns raised as detailed here, while others, which require more laborious experimental efforts, will be addressed as reported in the Revision Plan.

      Major Comments:

      The work is a bit incremental based on this:

      https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030445 * * And this:

      https://www.nature.com/articles/s41418-018-0129-0 * * To their credit, the authors do cite the above papers.

      Authors’ response. We thank the Reviewer for this careful evaluation of our work against the current literature and for recognising the contribution of our findings to the understanding of myogenesis complex picture in which the involvement of Trim32 and c-Myc, and of the Trim32-c-Myc axis, can occur at several stages and likely in narrow time windows along the process, thus possibly explaining some reports inconsistencies.

      The authors do provide compelling evidence that Trim32 deficiency disrupts C2C12 myogenic differentiation and sustained c-Myc expression contributes to this defective process. However, while knockdown of c-Myc does restore Myogenin levels, it was not sufficient to normalize myotube morphology or differentiation index, suggesting an incomplete picture of the Trim32-dependent pathways involved. The authors should qualify their claim by emphasizing that c-Myc regulation is a major, but not exclusive, mechanism underlying the observed defects. This will prevent an overgeneralization and better align the conclusions with the author's data.

      Authors’ response. We agree with the Reviewer and we modified our phrasing that implied Trim32-c-Myc axis as the exclusive mechanism by explicitly indicated that other pathways contribute to guarantee proper myogenesis, in the Abstract and in Discussion.

      The Abstract now reads: … suggesting that the Trim32–c-Myc axis may represent an essential hub, although likely not the exclusive molecular mechanism, in muscle regeneration within LGMDR8 pathogenesis.”

      The Discussion now reads: “Functionally, we demonstrated that c-Myc contributes to the impaired myogenesis observed in Trim32 KO clones, although this is clearly not the only factor involved in the Trim32-mediated myogenic network; realistically other molecular mechanisms can participate in this process as also suggested by our transcriptomic results.”

      The authors provide a thorough and well-executed interrogation of cell cycle dynamics in Trim32 KO clones, combining phosphor-histone H3 flow cytometry of DNA content, and CFSE proliferation assays. These complementary approaches convincingly show that, while proliferation states remain similar in WT and KO cells, Trim32-deficient myoblasts fail in their normal withdraw from the cell cycle during exposure to differentiation-inducing conditions. This work adds clarity to a previously inconsistent literature and greatly strengthens the study.

      Authors’ response. We thank the Reviewer for appreciating our thorough analyses on cell cycle dynamics in proliferation conditions and at the onset of the differentiation process.

      The transcriptomic analysis (detailed In the "Transcriptomic analysis of Trim32 WT and KO clones along early differentiation" section of Results) is central to the manuscript and provides strong evidence that Trim32 deficiency disrupts normal differentiation processes. However, the description of the pathway enrichment results is highly detailed and somewhat compressed, which may make it challenging for readers to following the key biological 'take-homes'. The narrative quickly moves across their multiple analyses like MDS, clustering, heatmaps, and bubble plots without pausing to guide the reader through what each analysis contributes to the overall biological interpretation. As a result, the key findings (reduced muscle development pathways in KO cells and enrichment of cell cycle-related pathways) can feel somewhat muted. The authors may consider reorganizing this section, so the primary biological insights are highlighted and supported by each of their analyses. This would allow the biological implications to be more accessible to a broader readership.

      Authors’ response. We thank the Reviewer for raising this point and apologise for being too brief in describing the data, leaving indeed some points excessively implicit. As suggested, we now reorganised this session and added the lists of enriched canonical pathways relative to WT vs KO comparisons at D0 and D3 (Fig. EV3B) as well as those relative to the comparison between D0 and D3 for both WT and Trim32 KO samples (Fig. EV3C), with their relative scores. We changed the Results section “Transcriptomic analysis of Trim32 WT and Trim32 KO clones along early differentiationas reported here below and modified the legends accordingly.

      The paragraph now reads: Based on our initial observations, the absence of Trim32 already exerts a significant impact by day 3 (D3) of C2C12 myogenic differentiation. To investigate how Trim32 influences early global transcriptional changes during the proliferative phase (D0) and early differentiation (D3), we performed an unbiased transcriptomic profiling of WT and Trim32 KO clones (Fig. 2A). Multidimensional Scaling (MDS) analysis revealed clear segregation of gene expression profiles based on both time of differentiation (Dim1, 44% variance) and Trim32 genotype (Dim2, 16% variance) (Fig. 2A). Likewise, hierarchical clustering grouped WT and Trim32 KO clones into distinct clusters at both timepoints, indicating consistent genotype-specific transcriptional differences (Fig. EV3A). Differentially Expressed Genes (DEGs) were detected in the Trim32 KO transcriptome relative to WT, at both D0 and D3. In proliferating conditions, 72 genes were upregulated and 189 were downregulated whereas at D3 of differentiation, 72 genes were upregulated and 212 were downregulated. Ingenuity Pathway Analysis of the DEGs revealed the top 10 Canonical Pathways displayed in Fig. EV3B as enriched at either D0 or D3 (Fig. EV3B). Several of these pathways can underscore relevant Trim32-mediated functions though most of them represent generic functions not immediately attributable to the observed myogenesis defects.

      Notably, the transcriptional divergence between WT and Trim32 KO cells is more pronounced at D3, as evidenced by a greater separation along the MSD Dim2 axis, suggesting that Trim32-dependent transcriptional regulation intensifies during early differentiation (Fig. 2A). Given our interest in the differentiation process, we therefore focused our analyses comparing the changes occurring from D0 to D3 in WT (WT D3 vs. D0) and in Trim32 KO (KO D3 vs. D0) RNAseq data.

      Pathway enrichment analysis of D3 vs. D0 DEGs allowed the selection of the top-scored pathways for both WT and Trim32 KO data. We obtained 18 top-scored pathways enriched in each genotype (-log(p-value) ³ 9 cut-off): 14 are shared while 4 are top-ranked only in WT and 4 only in Trim32 KO (Fig. EV3C). For the following analyses, we employed thus a total of 22 distinct pathways and to better mine those relevant in the passage from the proliferation stage to the early differentiation one and that are affected by the lack of Trim32, we built a bubble plot comparing side-by-side the scores and enrichment of the 22 selected top-scored pathways above in WT and Trim32 KO (Fig. 2B). A heatmap of DEGs included within these selected pathways confirms the clustering of the samples considering both the genotypes and the timepoints highlighting gene expression differences (Fig. 2C). These pathways are mainly related to muscle development, cell cycle regulation, genome stability maintenance and few other metabolic cascades.

      As expected given the results related to Figure 1, moving from D0 to D3 WT clones showed robust upregulation of key transcripts associated with the Inactive Sarcomere Protein Complex, a category encompassing most genes in the “Striated Muscle Contraction” pathway, while in Trim32 KO clones this pathway was not among those enriched in the transition from D0 to D3 (Fig. EV3C). Detailed analyses of transcripts enclosed within this pathway revealed that on the transition from proliferation to differentiation, WT clones show upregulation of several Myosin Heavy Chain isoforms (e.g., MYH3, MYH6, MYH8), α-Actin 1 (ACTA1), α-Actinin 2 (ACTN2), Desmin (DES), Tropomodulin 1 (TMOD1), and Titin (TTN), a pattern consistent with previous reports, while these same transcripts were either non-detected or only modestly upregulated in Trim32 KO clones at D3 (Fig. 2D). This genotype-specific disparity was further confirmed by gene set enrichment barcode plots, which demonstrated significant enrichment of these muscle-related transcripts in WT cells (FDR_UP = 0.0062), but not in Trim32 KO cells (FDR_UP = 0.24) (Fig. EV3D). These findings support an early transcriptional basis for the impaired myogenesis previously observed in Trim32 KO cells.

      In addition to differences in muscle-specific gene expression, we observed that also several pathways related to cell proliferation and cell cycle regulation were more enriched in Trim32 KO cells compared to WT. This suggests that altered cell proliferation may contribute to the distinct differentiation behavior observed in Trim32 KO versus WT (Fig. 2B). Given that cell cycle exit is a critical prerequisite for the onset of myogenic differentiation and considering that previous studies on Trim32 role in cell cycle regulation have reported inconsistent findings, we further examined cell cycle dynamics under our experimental conditions to clarify Trim32 contribution to this process

      The work would be greatly strengthened by the conclusion of LGMDR8 primary cells, and rescue experiments of TRIM32 to explore myogenesis.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      Also, EU (5-ethynyl uridine) pulse-chase experiments to label nascent and stable RNA coupled with MYC pulldowns and qPCR (or RNA-sequencing of both pools) would further enhance the claim that MYC stability is being affected.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      "On one side, c-Myc may influence early stages of myogenesis, such as myoblast proliferation and initial myotube formation, but it may not contribute significantly to later events such as myotube hypertrophy or fusion between existing myotubes and myocytes. This hypothesis is supported by recent work showing that c-Myc is dispensable for muscle fiber hypertrophy but essential for normal MuSC function (Ham et al, 2025)." Also address and discuss the following, as what is currently written is not entirely accurate: https://www.embopress.org/doi/full/10.1038/s44319-024-00299-z and https://journals.physiology.org/doi/prev/20250724-aop/abs/10.1152/ajpcell.00528.2025

      Authors’ response. We thank the Reviewer for bringing to our attention these two publications, that indeed, add important piece of data to recapitulate the in vivo complexity of c-Myc role in myogenesis. We included this point in our Discussion.

      The Discussion now reads: “On one side, c-Myc may influence early stages of myogenesis, such as myoblast proliferation and initial myotube formation, but it may not contribute significantly to later events such as myotube hypertrophy or fusion between existing myotubes and myocytes. This hypothesis is supported by recent work showing that c-Myc is dispensable for muscle fiber hypertrophy but essential for normal MuSC function (Ham et al, 2025). Other reports, instead, demonstrated the implication of c-Myc periodic pulses, mimicking resistance-exercise, in muscle growth, a role that cannot though be observed in our experimental model (Edman et al., 2024; Jones et al., 2025).”

      Minor Comments:

      Z-score scale used in the pathway bubble plot (Figure 2C) could benefit from alternative color choices. Current gradient is a bit muddy and clarity for the reader could be improved by more distinct color options, particularly in the transition from positive to negative Z-score.

      Authors’ response. As suggested, we modified the z-score-representing colors using a more distinct gradient especially in the positive to negative transition in Figure 2B.

      Clarification on the rationale for selecting the "top 18" pathways would be helpful, as it is not clear if this cutoff was chosen arbitrarily or reflects a specific statistical or biological threshold.

      Authors’ response. As now better explained (see comment regarding Major point: Transcriptomics), we used a cut-off of -log(p-value) above or equal to 9 for pathways enriched in DEGs of the D0 vs D3 comparison for both WT and Trim32 KO. The threshold is now included in the Results section and the pathways (shared between WT and Trim32 KO and unique) are listed as Fig. EV3C.

      The authors alternates between using "Trim 32 KO clones" and "KO clones" throughout the manuscript. Consistent terminology across figures and text would improve readability.

      Authors’ response. We thank the Reviewer for this remark, and we apologise for having overlooked it. We amended this throughout the manuscript by always using for clarity “Trim32 KO clones/cells”.

      Cell culture methodology does not specify passage number or culture duration (only "At confluence") before differentiation. This is important, as C2C12 differentiation potential can drift with extended passaging.

      Authors’ response. We agree with the Reviewer that C2C12 passaging can reduce the differentiation potential of this myoblast cell lines; this is indeed the main reason why we decided to employ WT clones, which underwent the same editing process as those that resulted mutated in the Trim32 gene, as reference controls throughout our study. We apologise for not indicating the passages in the first version of the manuscript that now is amended as per here below in the Methods section:

      The C2C12 parental cells used in this study were maintained within passages 3–8. All clonal cell lines (see below) were utilized within 10 passages following gene editing. In all experiments, WT and Trim32 KO clones of comparable passage numbers were used to ensure consistency and minimize passage-related variability.

      Reviewer #2 (Significance (Required)):

      General Assessment:

      This study provides a thorough investigation of Trim32's role the processes related to skeletal muscle differentiation using a CRISPR-Cas9 knockout C2C12 model. The strengths of this study lie in the multi-layered experimental approach as the authors incorporated transcriptomics, cell cycle profiling, and stability assays which collectively build a strong case for their hypothesis that Trim32 is a key factor in the normal regulation of myogenesis. The work is also strengthened by the use of multiple biological and technical replicates, particularly the independent KO clones which helps address potential clonal variation issues that could occur. The largest limitation to this study is that, while the c-Myc mechanism is well explored, the other Trim32-dependent pathways associated with the disruption (implicated by the incomplete rescue by c-Myc knockdown) are not as well addressed. Overall however, the study convincingly identifies a critical function for Trim32 during skeletal muscle differentiation. * * Advance: * * To my knowledge, this is the first study to demonstrate the mRNA stability level of c-Myc regulation by Trim32, rather than through the ubiquitin-mediated protein degradation. This work will advance the current understanding and provide a more complete understanding of Trim32's role in c-Myc regulation. Beyond c-Myc, this work highlights the idea that TRIM family proteins can influence RNA stability which could implicate a broader role in RNA biology and has potential for future therapeutic targeting. * * Audience: * * This research will be of interest to an audience that focuses on broad skeletal muscle biology but primarily to readers with more focused research such as myogenesis and neuromuscular disease (LGMDR8 in particular) where the defined Trim32 governance over early differentiation checkpoints will be of interest. It will also provide mechanistic insights to those outside of skeletal muscle that study TRIM family proteins, ubiquitin biology, and RNA regulation. For translational/clinical researchers, it identifies the Trim32/c-Myc axis as a potential therapeutic target for LGMDR8 and related muscular dystrophies.

      Expertise: * * My expertise lies in skeletal muscle biology, gene editing, transgenic mouse models, and bioinformatics. I feel confident evaluating the data and conclusions as presented.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      • In this paper, the authors examine the role of TRIM32, implicated in limb girdle muscular dystrophy recessive 8 (LGMDR8), in the differentiation of C2C12 mouse myoblasts. Using CRISPR, they generate mutant and wild-type clones and compare their differentiation capacity in vitro. They report that Trim32-deficient clones exhibit delayed and defective myogenic differentiation. RNA-seq analysis reveals widespread changes in gene expression, although few are validated by independent methods. Notably, Trim32 mutant cells maintain residual proliferation under differentiation conditions, apparently due to a failure to downregulate c-Myc. Translation inhibition experiments suggest that TRIM32 promotes c-Myc mRNA destabilization, but this conclusion is insufficiently substantiated. The authors also perform rescue experiments, showing that c-Myc knockdown in Trim32-deficient cells alleviates some differentiation defects. However, this rescue is not quantified, was conducted in only two of the three knockout lines, and is supported by inappropriate statistical analysis of gene expression. Overall, the manuscript in its current form has substantial weaknesses that preclude publication. Beyond statistical issues, the major concerns are: (1) exclusive reliance on the immortalized C2C12 line, with no validation in primary/satellite cells or in vivo, (2) insufficient mechanistic evidence that TRIM32 acts directly on c-Myc mRNA, and (3) overinterpretation of disease relevance in the absence of supporting patient or in vivo data. Please find more details below:*

      We thank the Reviewer for the in-depth assessment of our work and precious suggestions to improve the manuscript. We have carefully addressed some of the concerns raised, as detailed here, while others, which require more experimental efforts, will be addressed as detailed in the Revision Plan.

      - TRIM32 complementation / rescue experiments to exclude clonal or off-target CRISPR effects and show specificity are lacking.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      - The authors link their in vitro findings to LGMDR8 pathogenesis and propose that the Trim32-c-Myc axis may serve as a central regulator of muscle regeneration in the disease. However, LGMDR8 is a complex disorder, and connecting muscle wasting in patients to differentiation assays in C2C12 cells is difficult to justify. No direct evidence is provided that the proposed mRNA mechanism operates in patient-derived samples or in mouse satellite cells. Moreover, the partial rescue achieved by c-Myc knockdown (which does not fully restore myotube morphology or differentiation index) further suggests that the disease connection is not straightforward. Validation of the TRIM32-c-Myc axis in a physiologically relevant system, such as LGMD patient myoblasts or Trim32 mutant mouse cells, would greatly strengthen the claim.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      -Some gene expression changes from the RNA-seq study in Figure 2 should be validated by qPCR

      Authors’ response. We thank the reviewer for this suggestion. This point will be addressed as detailed in the Revision Plan. We have selected several transcripts that will be evaluated in independent samples in order to validate the RNAseq results.

      - The paper shows siRNA knockdown of c-Myc in KO restores Myogenin RNA/protein but does not fully rescue myotube morphology or differentiation index. This suggests that Trim32 controls additional effectors beyond c-Myc; yet the authors do not pursue other candidate mediators identified in the RNA-seq. The manuscript would be strengthened by systematically testing whether other deregulated transcripts contribute to the phenotype.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      - There are concerns with experimental/statistical issues and insufficient replicate reporting. The authors use unpaired two-tailed Student's t-test across many comparisons; multiple testing corrections or ANOVA where appropriate should be used. In Figure EV5B and Figure 6B, the authors perform statistical analyses with control values set to 1. This method masks the inherent variability between experiments and artificially augments p values. Control sample values need to be normalized to one another to have reliable statistical analysis. Myotube morphology and differentiation index quantifications need clear description of fields counted, blind analysis, and number of biological replicates.

      Authors’ response. We thank the Reviewer for raising this point.

      Regarding the replicates, we clarified in the Methods and Legends that the Trim32 KO experiments have been performed on 3 biological replicates (independent clones) and the same for the reference control (3 independent WT clones), except for the Fig. 6 experiments that were performed on 2 Trim32 KO and 2 WT clones. All the Western Blots, immunofluorescence, qPCR data are representative of the results of at least 3 independent experiments unless otherwise stated. We reported the number and type of replicates as well as the microscope fields analyzed.

      We repeated the statistical analyses of the data in Figure 5G, EV5D, EV5E, employing more appropriately the 2-way-ANOVA test, as suggested, and we now reported this info in the graphs and legends.

      We thank the Reviewer for raising this point, we agree and substituted the graphs in Fig. EV5B and 6B showing the control values normalised as suggested. The statistical analyses now reflect this change.

      -Some English mistakes require additional read-throughs. For example: "Indeed, Trim32 has no effect on the stability of c-Myc mRNA in proliferating conditions, but upon induction of differentiation the stability of c-Myc mRNA resulted enhanced in Trim32 KO clones (Fig. 5G, Fig. EV5D and 5E)."

      Authors’ response. We re-edited this revised version of the manuscript as suggested.

      -Results in Figure 5A should be quantified

      Authors’ response. We amended this point by quantifying the results shown in Fig. 5A, we added the graph of the quantification of 3 experimental replicates to the Figure. Quantification confirms that no statistically significant difference is observed. The Figure and the relative legend are modified accordingly.

      -Based on the nuclear marker p84, the separation of cytoplasmic and nuclear fractions is not ideal in Figure 5D

      Authors’ response. We agree with the Reviewer that the presence of p84 also in the cytoplasmic fraction is not ideal. Regrettably, we observed this faint p84 band in all the experiments performed. We think however, that this is not impacting on the result that clearly shows that c-Myc and Trim32 are never detected in the same compartment.

      -In Figure 6, it is not appropriate to perform statistical analyses on only two data points per condition.

      Authors’ response. We agree with the Reviewer and we now show the graph of the results of the 3 technical replicates for 2 biological replicates and do not indicate any statistics (Fig. 6B). The graph was also modified according to a previous point raised.

      -The nuclear MYOG phenotype is very interesting; could this be related to requirements of TRIM32 in fusion?

      Authors’ response. We agree with the Reviewer that Trim32 might also be necessary for myoblast fusion. This point is however beyond the scope of the present study and will be addressed in future work.

      - The hypothesis that TRIM32 destabilizes c-Myc mRNA is intriguing but requires stronger mechanistic support. This would be more convincing with RNA immunoprecipitation to test direct association with c-Myc mRNA, and/or co-immunoprecipitation to identify interactions between TRIM32 and proteins involved in mRNA stability. The study would also be strengthened by reporter assays, such as c-Myc 3′UTR luciferase constructs in WT and KO cells, to directly demonstrate 3′UTR-dependent regulation of mRNA stability.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      Reviewer #3 (Significance (Required)):

      The manuscript presents a minor conceptual advance in understanding TRIM32 function in myogenic differentiation. Its main limitation is that all experiments were performed in C2C12 cells. While C2C12 are a classical system to study muscle differentiation, they are an immortalized, long-cultured, and genetically unstable line that represents a committed myoblast stage rather than bona fide satellite cells. They therefore do not fully model the biology of early regenerative responses. Several TRIM32 phenotypes reported in the literature differ between primary satellite cells and cell lines, and the authors themselves note such discrepancies. Extrapolating these findings to LGMDR8 pathogenesis without validation in primary human myoblasts, satellite cell assays, or in vivo regeneration models is therefore not justified. Previous work has already established clear roles for TRIM32 in mouse satellite cells in vivo and in patient myoblasts in vitro, whereas this study introduces a novel link to c-Myc regulation during differentiation. In addition, without mechanistic evidence, the central claim that TRIM32 regulates c-Myc mRNA stability remains descriptive and incomplete. Nevertheless, the results will be of interest to researchers studying LGMD and to those exploring TRIM32 biology in broader contexts. I review this manuscript as a muscle biologist with expertise in satellite cell biology and transcriptional regulation.

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Reply to the Reviewers

      I thank the Referees for their...

      Referee #1

      1. The authors should provide more information when...

      Responses + The typical domed appearance of a hydrocephalus-harboring skull is apparent as early as P4, as shown in a new side-by-side comparison of pups at that age (Fig. 1A). + Though this is not stated in the MS 2. Figure 6: Why has only...

      Response: We expanded the comparison

      Minor comments:

      1. The text contains several...

      Response: We added...

      Referee #2

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Reply to the Reviewers

      I thank the Referees for their...

      Referee #1

      1. The authors should provide more information when...

      Responses + The typical domed appearance of a hydrocephalus-harboring skull is apparent as early as P4, as shown in a new side-by-side comparison of pups at that age (Fig. 1A). + Though this is not stated in the MS 2. Figure 6: Why has only...

      Response: We expanded the comparison

      Minor comments:

      1. The text contains several...

      Response: We added...

      Referee #2

    1. it may over- or

      You mean for periodic populations? Usually it is assumed to overestimate, as we generally assume a trend in the population. I think this should be made clear, the proof for this should be in Matern's paper from 1960.

    1. The San people of the Kalahari Desert in southern Africa are one remaining such group. What thoughts come to mind when you see a picture of hunter-gatherers? Most Westerners see such groups as primitive, backward, or underdeveloped. We may think of hunter-gatherers as “less developed” than city dwellers in New York or London. Whether we are conscious of it or not, we likely place people on a continuum of development, a scale typically linked to indicators of material well-being. What criteria do we use to measure development in our mind, and why do we use these criteria? Development implies progress, but progress in what? Does development mean amassing wealth? Does development mean access to clean water and a steady food supply? Can people be poor and developed at the same time? While we may perceive hunter-gatherers as primitive or underdeveloped, hunter-gatherers necessarily worse off than we are? Studies suggest that one group of San spent 12 to 19 hours per week working to obtain food as compared to the 40-some-hour workweek of most people in the so-called developed world.

      The discussion about the San people really challenges the idea of them being “underdeveloped.” Honestly, I feel like they actually use their time really well and live in a sustainable way that works for them. In some ways, that makes them more economically balanced than people might assume when they call them “underdeveloped.”

  4. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. Evolution of cetaceans. November 2023. Page Version ID: 1186568602. URL: https://en.wikipedia.org/w/index.php?title=Evolution_of_cetaceans&oldid=1186568602 (visited on 2023-12-08). [l2] Nobu Tamura. Spinops. 2023. URL: http://spinops.blogspot.com/ (visited on 2023-12-13). [l3] The Selfish Gene. December 2023. Page Version ID: 1188207750. URL: https://en.wikipedia.org/w/index.php?title=The_Selfish_Gene&oldid=1188207750 (visited on 2023-12-08). [l4] Meme. December 2023. Page Version ID: 1187840093. URL: https://en.wikipedia.org/w/index.php?title=Meme&oldid=1187840093#Etymology (visited on 2023-12-08). [l5] Oliver Tearle. Who Said, ‘A Lie Is Halfway Round the World Before the Truth Has Got Its Boots On’? June 2021. URL: https://interestingliterature.com/2021/06/lie-halfway-round-world-before-truth-boots-on-quote-origin-meaning/ (visited on 2023-12-08). [l6] Tom Standage. Writing on the Wall: Social Media - The First 2,000 Years. Bloomsbury USA, New York, 1st edition edition, October 2013. ISBN 978-1-62040-283-2. [l7] Chain letter. December 2023. Page Version ID: 1188532303. URL: https://en.wikipedia.org/w/index.php?title=Chain_letter&oldid=1188532303 (visited on 2023-12-08). [l8] Pyramid scheme. December 2023. Page Version ID: 1188350070. URL: https://en.wikipedia.org/w/index.php?title=Pyramid_scheme&oldid=1188350070 (visited on 2023-12-08). [l9] Chain Letters. November 1999. URL: https://cs.uwaterloo.ca/~mli/chain.html (visited on 2023-12-08). [l10] Janus Sandsgaard. Sourdough starter. April 2014. URL: https://commons.wikimedia.org/wiki/File:Sourdough.jpg (visited on 2023-12-08). [l11] Nutrition Health, Food Safety &. Dutch Oven sourdough bread. September 2020. URL: https://commons.wikimedia.org/wiki/File:Dutch_Oven_Sourdough_Bread_2.jpg (visited on 2023-12-08). [l12] Carl Griffith's sourdough starter. November 2022. Page Version ID: 1120864146. URL: https://en.wikipedia.org/w/index.php?title=Carl_Griffith%27s_sourdough_starter&oldid=1120864146 (visited on 2023-12-08). [l13] Monica Lewinsky. December 2023. Page Version ID: 1187944516. URL: https://en.wikipedia.org/w/index.php?title=Monica_Lewinsky&oldid=1187944516 (visited on 2023-12-08). [l14] Monica Lewinsky (she/her) [@MonicaLewinsky]. 👀. May 2021. URL: https://twitter.com/MonicaLewinsky/status/1395734868407984136 (visited on 2023-12-08). [l15] Clinton–Lewinsky scandal. November 2023. Page Version ID: 1187645037. URL: https://en.wikipedia.org/w/index.php?title=Clinton%E2%80%93Lewinsky_scandal&oldid=1187645037 (visited on 2023-12-08). [l16] Matt Stopera. Monica Lewinsky Has Been Making Jokes About The Clinton Impeachment For Years, And It Really Is Funny Every Single Time. BuzzFeed, September 2021. URL: https://www.buzzfeed.com/mjs538/monica-lewinsky-twitter-comebacks (visited on 2023-12-08). [l17] Aja Romano. This is why there are jokes about plums all over your Twitter feed. Vox, December 2017. URL: https://www.vox.com/2017/12/1/16723210/this-is-just-to-say-plums-twitter-baby-shoes (visited on 2023-12-08). [l18] Ecological niche. October 2023. Page Version ID: 1182139023. URL: https://en.wikipedia.org/w/index.php?title=Ecological_niche&oldid=1182139023 (visited on 2023-12-08). [l19] Tanya Chen. A 27-Year-Old Composer Has Inspired One Of The Most Epic And Delightful Duet Chains On TikTok. BuzzFeed News, October 2020. URL: https://www.buzzfeednews.com/article/tanyachen/epic-tiktok-chain-musical-fighting-in-a-grocery-store (visited on 2023-12-08). [l20] Natalie [@historyadjunct]. Without downloading any new pics, what’s your energy going into 2022? January 2022. URL: https://twitter.com/historyadjunct/status/1477282737430147073 (visited on 2023-12-09). [l21] Star Wars Kid. December 2008. URL: https://knowyourmeme.com/memes/star-wars-kid (visited on 2023-12-08). [l22] Rebecca Black - Friday. March 2011. URL: https://knowyourmeme.com/memes/rebecca-black-friday (visited on 2023-12-08). [l23] Bean Dad. January 2021. URL: https://knowyourmeme.com/memes/events/bean-dad (visited on 2023-12-08). [l24] Twitter's Main Character. September 2020. URL: https://knowyourmeme.com/memes/twitters-main-character (visited on 2023-12-08). [l25] Dennis Lee. I made that viral Spaghettio pie that everyone is crapping themselves over. January 2021. URL: https://foodisstupid.substack.com/p/i-made-that-viral-spaghettio-pie (visited on 2023-12-08). [l26] Gina Vaynshteyn. I Made The Viral SpaghettiO And Milk Pie So That You Don’t Have To. February 2021. URL: https://www.scarymommy.com/spotted/spaghettio-pie (visited on 2023-12-08). [l27] Ryan Broderick. Your Least Favorite Gross Viral Food Videos Are All Connected to This Guy. Eater, May 2021. URL: https://www.eater.com/2021/5/11/22430383/why-are-gross-viral-food-videos-popular-rick-lax-facebook-watch (visited on 2023-12-08). [l28] Rowland Manthorpe. It's the attention economy, stupid: why Trump represents the future whether we like it or not. Wired UK, 2016. URL: https://www.wired.co.uk/article/us-president-donald-trump-attention-economy (visited on 2023-12-08). [l29] Nat King Cole. Nature Boy. March 1948. URL: https://genius.com/Nat-king-cole-nature-boy-lyrics (visited on 2023-12-08). [l30] This Looks Like A Cavalcade Of Beggars Sin And Wine Lyrics. November 2021. URL: https://thegeniuslyrics.com/this-looks-like-a-cavalcade-of-beggars-sin-and-wine-lyrics/ (visited on 2023-12-08). [l31] Morgan Sung. Their children went viral. Now they wish they could wipe them from the internet. NBC News, November 2022. URL: https://www.nbcnews.com/pop-culture/influencers-parents-posting-kids-online-privacy-security-concerns-rcna55318 (visited on 2023-12-08). [l32] The Onion. ‘Do You Mind If I Put You In My TikTok?’ Asks Younger Cousin About To Ruin Your Life. The Onion, November 2019. URL: https://www.theonion.com/do-you-mind-if-i-put-you-in-my-tiktok-asks-younger-c-1840052744 (visited on 2023-12-08). [l33] Central Park birdwatching incident. December 2023. Page Version ID: 1188867291. URL: https://en.wikipedia.org/w/index.php?title=Central_Park_birdwatching_incident&oldid=1188867291 (visited on 2023-12-08). [l34] Murder of George Floyd. December 2023. Page Version ID: 1188546892. URL: https://en.wikipedia.org/w/index.php?title=Murder_of_George_Floyd&oldid=1188546892 (visited on 2023-12-08). [l35] Taylor Lorenz. Elon Musk: Memelord or Meme Lifter? The New York Times, May 2021. URL: https://www.nytimes.com/2021/05/07/style/elon-musk-memes.html (visited on 2023-12-08). [l36] Miles Klee. Tesla CEO Elon Musk stole my meme. SFGATE, April 2021. URL: https://www.sfgate.com/tech/article/2021-04-elon-musk-twitter-covid-19-meme-tesla-ceo-16118139.php (visited on 2023-12-08). [l37] Matt Novak. 18 Jokes Elon Musk Stole From His Fans On Twitter. URL: https://www.forbes.co

      I looked at [l48] “We Need to Talk About Digital Blackface in GIFs” from Teen Vogue (2017). This article really stood out to me because it explains how using GIFs of Black people to express exaggerated emotions can unintentionally repeat old stereotypes — similar to how blackface mocked Black expression in the past. What I found powerful was how it connected something as casual as sending a reaction GIF to deeper issues of race and representation online.

      This source made me think about how easy it is to participate in cultural appropriation without realizing it. It also connects to the chapter’s point about “copying” — that not all copying is harmless or funny; sometimes it carries history and meaning that needs to be respected. I think this article pushes readers to be more self-aware and ethical about what we share, even in small everyday actions on social media.

    1. One challenge of designing good A/B tests is ensuring that the results can be trusted. Industry is also still learning how to design good experiments66 Riche, Y. (2016). A/B testing vs. User Experience Research. LinkedIn. ; most A/B tests fail to meet even minimum standards of the kinds of randomized controlled experiments used in science.

      I agree that while A/B testing can help provide evidence of causality, there may be issues in verifying if the results can be trusted or not. This makes me think about concepts such as validity. How do we know that the results are because of the specified variable, and not other extraneous variables that may have influenced the results?

    1. The information you share online can last a long time and may be seen by thousands of people all around the world.

      This is one of the scariest parts about the internet. I don't think we ever truly understand this. I have always tried to be very careful online as I'm afraid of post or shares of something connected to my name could be misunderstood. There are so many stories of people losing their jobs over post from years before. I wonder how this will continue in the future. With AI, has it become more common for companies to quickly scan our identity on the web?

  5. Oct 2025
    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Summary

      The authors develop a set of biophysical models to investigate whether a constant area hypothesis or a constant curvature hypothesis explains the mechanics of membrane vesiculation during clathrin-mediated endocytosis.

      Strengths

      The models that the authors choose are fairly well-described in the field and the manuscript is wellwritten.

      Thank you for your positive comments on our work.

      Weaknesses

      One thing that is unclear is what is new with this work. If the main finding is that the differences are in the early stages of endocytosis, then one wonders if that should be tested experimentally. Also, the role of clathrin assembly and adhesion are treated as mechanical equilibrium but perhaps the process should not be described as equilibria but rather a time-dependent process. Ultimately, there are so many models that address this question that without direct experimental comparison, it's hard to place value on the model prediction.

      Thank you for your insightful questions. We fully agree that distinguishing between the two models should ultimately be guided by experimental tests. This is precisely the motivation for including Fig. 5 in our manuscript, where we compare our theoretical predictions with experimental data. In the middle panel of Fig. 5, we observe that the predicted tip radius as a function of 𝜓<sub>𝑚𝑎𝑥</sub> from the constant curvature model (magenta curve) deviates significantly from both the experimental data points and the rolling median, highlighting the inconsistency of this model with the data.

      Regarding our treatment of clathrin assembly and membrane adhesion as mechanical equilibrium processes, our reasoning is based on a timescale separation argument. Clathrin assembly typically occurs over approximately 1 minute. In contrast, the characteristic relaxation time for a lipid membrane to reach mechanical equilibrium is given by , where 𝜇∼5 × 10<sup>-9</sup> 𝑁𝑠𝑚<sup>-1</sup> is the membrane viscosity, 𝑅<sub>0</sub> =50𝑛𝑚 is the vesicle size, 𝜅=20 𝑘<sub>𝐵</sub>𝑇 is the bending rigidity. This yields a relaxation time of 𝜏≈1.5 × 10<sup>−4</sup>𝑠, which is several orders of magnitude shorter than the timescale of clathrin assembly. Therefore, it is reasonable to treat the membrane shape as being in mechanical equilibrium throughout the assembly process.

      We believe the value of our model lies in the following key novelties:

      (1) Model novelty: We introduce an energy term associated with curvature generation, a contribution that is typically neglected in previous models.

      (2) Methodological novelty: We perform a quantitative comparison between theoretical predictions and experimental data, whereas most earlier studies rely on qualitative comparisons.

      (3) Results novelty: Our quantitative analysis enables us to unambiguously exclude the constant curvature hypothesis based on time-independent electron microscopy data.

      In the revised manuscript (line 141), we have added a statement about why we treat the clathrin assembly as in mechanical equilibrium.

      While an attempt is made to do so with prior published EM images, there is excessive uncertainty in both the data itself as is usually the case but also in the methods that are used to symmetrize the data. This reviewer wonders about any goodness of fit when such uncertainty is taken into account.

      Author response: We thank the reviewer for raising this important point. We agree that there is uncertainty in the experimental data. Our decision to symmetrize the data is based on the following considerations:

      (1) The experimental data provide a one-dimensional membrane profile corresponding to a cross-sectional view. To reconstruct the full two-dimensional membrane surface, we must assume rotational symmetry.

      (2)In addition to symmetrization, we also average membrane profiles within a certain range of 𝜓<sub>𝑚𝑎𝑥</sub> values (see Fig. 5d). This averaging helps reduce the uncertainty (due to biological and experimental variability) inherent to individual measurements.

      (3)To further address the noise in the experimental data, we compare our theoretical predictions not only with individual data points but also with a rolling median, which provides a smoothed representation of the experimental trends.

      These steps are taken to ensure a more robust and meaningful comparison between theory and experiments.

      In the revised manuscript (line 338), we have explained why we have to symmetrize the data:

      “To facilitate comparison between the axisymmetric membrane shapes predicted by the model and the non-axisymmetric profiles obtained from electron microscopy, we apply a symmetrization procedure to the experimental data, which consist of one-dimensional membrane profiles extracted from cross-sectional views, as detailed in Appendix 3 (see also Appendix 3--Fig. 1).”

      Reviewer #2:

      Summary

      In this manuscript, the authors employ theoretical analysis of an elastic membrane model to explore membrane vesiculation pathways in clathrin-mediated endocytosis. A complete understanding of clathrin-mediated endocytosis requires detailed insight into the process of membrane remodeling, as the underlying mechanisms of membrane shape transformation remain controversial, particularly regarding membrane curvature generation. The authors compare constant area and constant membrane curvature as key scenarios by which clathrins induce membrane wrapping around the cargo to accomplish endocytosis. First, they characterize the geometrical aspects of the two scenarios and highlight their differences by imposing coating area and membrane spontaneous curvature. They then examine the energetics of the process to understand the driving mechanisms behind membrane shape transformations in each model. In the latter part, they introduce two energy terms: clathrin assembly or binding energy, and curvature generation energy, with two distinct approaches for the latter. Finally, they identify the energetically favorable pathway in the combined scenario and compare their results with experiments, showing that the constant-area pathway better fits the experimental data.

      Thank you for your clear and comprehensive summary of our work.

      Strengths

      The manuscript is well-written, well-organized, and presents the details of the theoretical analysis with sufficient clarity. The calculations are valid, and the elastic membrane model is an appropriate choice for addressing the differences between the constant curvature and constant area models.

      The authors' approach of distinguishing two distinct free energy terms-clathrin assembly and curvature generation-and then combining them to identify the favorable pathway is both innovative and effective in addressing the problem.

      Notably, their identification of the energetically favorable pathways, and how these pathways either lead to full endocytosis or fail to proceed due to insufficient energetic drives, is particularly insightful.

      Thank you for your positive remarks regarding the innovative aspects of our work.

      Weaknesses and Recommendations

      Weakness: Membrane remodeling in cellular processes is typically studied in either a constant area or constant tension ensemble. While total membrane area is preserved in the constant area ensemble, membrane area varies in the constant tension ensemble. In this manuscript, the authors use the constant tension ensemble with a fixed membrane tension, σe. However, they also use a constant area scenario, where 'area' refers to the surface area of the clathrin-coated membrane segment. This distinction between the constant membrane area ensemble and the constant area of the coated membrane segment may cause confusion.

      Recommendation: I suggest the authors clarify this by clearly distinguishing between the two concepts by discussing the constant tension ensemble employed in their theoretical analysis.

      Thank you for raising this question.

      In the revised manuscript (line 136), we have added a sentence, emphasizing the implication of the term “constant area model”:

      “We emphasize that the constant area model refers to the assumption that the clathrin-coated area 𝑎<sub>0</sub> remains fixed. Meanwhile, the membrane tension 𝜎<sub>𝑒</sub> at the base is held constant, allowing the total membrane area 𝐴𝐴 to vary in response to deformations induced by the clathrin coat.”

      Weakness: As mentioned earlier, the theoretical analysis is performed in the constant membrane tension ensemble at a fixed membrane tension. The total free energy E_tot of the system consists of membrane bending energy E_b and tensile energy E_t, which depends on membrane tension, σe. Although the authors mention the importance of both E_b and E_t, they do not present their individual contributions to the total energy changes. Comparing these contributions would enable readers to cross-check the results with existing literature, which primarily focuses on the role of membrane bending rigidity and membrane tension.

      Recommendation: While a detailed discussion of how membrane tension affects their results may fall outside the scope of this manuscript, I suggest the authors at least discuss the total membrane area variation and the contribution of tensile energy E_t for the singular value of membrane tension used in their analysis.

      Thank you for the insightful suggestion. In the revised manuscript (line 916), we have added Appendix 6 and a supplementary figure to compare the bending energy 𝐸<sub>𝑏</sub> and the tension energy 𝐸<sub>𝑡</sub>. Our analysis shows that both energy components exhibit an energy barrier between the flat and vesiculated membrane states, with the tension energy contributing more significantly than the bending energy.

      In the revised manuscript (line 151), we have also added one paragraph explaining why we set the dimensionless tension . This choice is motivated by our use of the characteristic length as the length scale, and as the energy scale. In this way, the dimensionless tension energy is written as

      Where is the dimensionless area.

      Weakness: The authors introduce two different models, (1,1) and (1,2), for generating membrane curvature. Model 1 assumes a constant curvature growth, corresponding to linear curvature growth, while Model 2 relates curvature growth to its current value, resembling exponential curvature growth. Although both models make physical sense in general, I am concerned that Model 2 may lead to artificial membrane bending at high curvatures. Normally, for intermediate bending, ψ > 90, the bending process is energetically downhill and thus proceeds rapidly. The bending process is energetically downhill and thus proceeds rapidly. However, Model 2's assumption would accelerate curvature growth even further. This is reflected in the endocytic pathways represented by the green curves in the two rightmost panels of Fig. 4a, where the energy steeply increases at large ψ. I believe a more realistic version of Model 2 would require a saturation mechanism to limit curvature growth at high curvatures.

      Recommendation 1: I suggest the authors discuss this point and highlight the pros and cons of Model 2. Specifically, addressing the potential issue of artificial membrane bending at high curvatures and considering the need for a saturation mechanism to limit excessive curvature growth. A discussion on how Model 2 compares to Model 1 in terms of physical relevance, especially in the context of high curvature scenarios, would provide valuable insights for the reader.

      Thank you for raising the question of excessive curvature growth in our models and the constructive suggestion of introducing a saturation mechanism. In the revised manuscript (line 405), following your recommendation, we have added a subsection “Saturation effect at high membrane curvatures” in the discussion to clarify the excessive curvature issue and a possible way to introduce a saturation mechanism:

      “Note that our model involves two distinct concepts of curvature growth. The first is the growth of imposed curvature — referred to here as intrinsic curvature and denoted by the parameter 𝑐<sub>0</sub> — which is driven by the reorganization of bonds between clathrin molecules within the coat. The second is the growth of the actual membrane curvature, reflected by the increasing value of 𝜓<sub>𝑚𝑎𝑥</sub>.

      The latter process is driven by the former.

      Models (1,1) and (1,2) incorporate energy terms (Equation 6) that promote the increase of intrinsic curvature 𝑐<sub>0</sub>, which in turn drives the membrane to adopt a more curved shape (increasing 𝜓<sub>𝑚𝑎𝑥</sub>). In the absence of these energy contributions, the system faces an energy barrier separating a weakly curved membrane state (low 𝜓<sub>𝑚𝑎𝑥</sub>) from a highly curved state (high 𝜓<sub>𝑚𝑎𝑥</sub>). This barrier can be observed, for example, in the red curves of Figure 3(a–c) and in Appendix 6—Figure 1. As a result, membrane bending cannot proceed spontaneously and requires additional energy input from clathrin assembly.

      The energy terms described in Equation 6 serve to eliminate this energy barrier by lowering the energy difference between the uphill and downhill regions of the energy landscape. However, these same terms also steepen the downhill slope, which may lead to overly aggressive curvature growth.

      To mitigate this effect, one could introduce a saturation-like energy term of the form:

      where 𝑐<sub>𝑠</sub> represents a saturation curvature. Importantly, adding such a term would not alter the conclusions of our study, since the energy landscape already favors high membrane curvature (i.e., it is downward sloping) even without the additional energy terms. “

      Recommendation 2: Referring to the previous point, the green curves in the two rightmost panels of Fig. 4a seem to reflect a comparison between slow and fast bending regimes. The initial slow vesiculation (with small curvature growth) in the left half of the green curves is followed by much more rapid curvature growth beyond a certain threshold. A similar behavior is observed in Model 1, as shown by the green curves in the two rightmost panels of Fig. 4b. I believe this transition between slow and fast bending warrants a brief discussion in the manuscript, as it could provide further insight into the dynamic nature of vesiculation.

      Thank you for your constructive suggestion regarding the transition between slow and fast membrane bending. As you pointed out, in both Fig. 4a (model (1,2)) and Fig. 4b (model (1,1)), the green curves tend to extend vertically at the late stage. This suggests a significant increase in 𝑐<sub>0</sub> on the free energy landscape. However, we remain cautious about directly interpreting this vertical trend as indicative of fast endocytic dynamics, since our model is purely energetic and does not explicitly incorporate kinetic details. Meanwhile, we agree with your observation that the steep decrease in free energy along the green curve could correspond to an acceleration in dynamics. To address this point, we have added a paragraph in the revised manuscript (in Subsection “Cooperativity in the curvature generation process”) discussing this potential transition and its consistency with experimental observations (line 395):

      “Furthermore, although our model is purely energetic and does not explicitly incorporate dynamics, we observe in Figure 3(a) that along the green curve—representing the trajectory predicted by model (1,2)—the total free energy (𝐸<sub>𝑡𝑜𝑡</sub>) exhibits a much sharper decrease at the late stage (near the vesiculation line) compared to the early stage (near the origin). This suggests a transition from slow to fast dynamics during endocytosis. Such a transition is consistent with experimental observations, where significantly fewer number of images with large 𝜓<sub>𝑚𝑎𝑥</sub> are captured compared to those with small 𝜓<sub>𝑚𝑎𝑥</sub> (Mund et al., 2023).”

      The geometrical properties of both the constant-area and constant-curvature scenarios, as well depicted in Fig. 1, are somewhat straightforward. I wonder what additional value is presented in Fig. 2. Specifically, the authors solve differential shape equations to show how Rt and Rcoat vary with the angle ψ, but this behavior seems predictable from the simple schematics in Fig. 1. Using a more complex model for an intuitively understandable process may introduce counter-intuitive results and unnecessary complications, as seen with the constant-curvature model where Rt varies (the tip radius is not constant, as noted in the text) despite being assumed constant. One could easily assume a constant-curvature model and plot Rt versus ψ. I wonder What is the added value of solving shape equations to measure geometrical properties, compared to a simpler schematic approach (without solving shape equations) similar to what they do in App. 5 for the ratio of the Rt at ψ=30 and 150.

      Thank you for raising this important question. While simple and intuitive theoretical models are indeed convenient to use, their validity must be carefully assessed. The approximate model becomes inaccurate when the clathrin shell significantly deviates from its intrinsic shape, namely a spherical cap characterized by intrinsic curvature 𝑐<sub>0</sub>. As shown in the insets of Fig. 2b and 2c (red line and black points), our comparison between the simplified model and the full model demonstrates that the simple model provides a good approximation under the constant-area constraint. However, it performs poorly under the constant-curvature constraint, and the deviation between the full model and the simplified model becomes more pronounced as 𝑐<sub>0</sub> increases.

      In the revised manuscript, we have added a sentence emphasizing the discrepancy between the exact calculation with the idealized picture for the constant curvature model (line 181):

      “For the constant-curvature model, the ratio remains close to 1 only at small values of 𝑐<sub>0</sub>, as expected from the schematic representation of the model in Figure 1. However, as 𝑐<sub>0</sub> increases, the deviation from this idealized picture becomes increasingly pronounced.”

      Recommendation: The clathrin-mediated endocytosis aims at wrapping cellular cargos such as viruses which are typically spherical objects which perfectly match the constant-curvature scenario. In this context, wrapping nanoparticles by vesicles resembles constant-curvature membrane bending in endocytosis. In particular analogous shape transitions and energy barriers have been reported (similar to Fig.3 of the manuscript) using similar theoretical frameworks by varying membrane particle binding energy acting against membrane bending:

      DOI: 10.1021/la063522m

      DOI: 10.1039/C5SM01793A

      I think a short comparison to particle wrapping by vesicles is warranted.

      Thank you for your constructive suggestion to compare our model with particle wrapping. In the revised manuscript (line 475), we have added a subsection “Comparison with particle wrapping” in the discussion:

      “The purpose of the clathrin-mediated endocytosis studied in our work is the recycling of membrane and membrane-protein, and the cellular uptake of small molecules from the environment — molecules that are sufficiently small to bind to the membrane or be encapsulated within a vesicle. In contrast, the uptake of larger particles typically involves membrane wrapping driven by adhesion between the membrane and the particle, a process that has also been studied previously (Góźdź, 2007; Bahrami et al., 2016). In our model, membrane bending is driven by clathrin assembly, which induces curvature. In particle wrapping, by comparison, the driving force is the adhesion between the membrane and a rigid particle. In the absence of adhesion, wrapping increases both bending and tension energies, creating an energy barrier that separates the flat membrane state from the fully wrapped state. This barrier can hinder complete wrapping, resulting in partial or no engulfment of the particle. Only when the adhesion energy is sufficiently strong can the process proceed to full wrapping. In this context, adhesion plays a role analogous to curvature generation in our model, as both serve to overcome the energy barrier. If the particle is spherical, it imposes a constant-curvature pathway during wrapping. However, the role of clathrin molecules in this process remains unclear and will be the subject of future investigation.”

      Minor points:

      Line 20, abstract, "....a continuum spectrum ..." reads better.

      Line 46 "...clathrin results in the formation of pentagons ...." seems Ito be grammatically correct.

      Line 106, proper citation of the relevant literature is warranted here.

      Line 111, the authors compare features (plural) between experiments and calculations. I would write "....compare geometric features calculated by theory with those ....".

      Line 124, "Here, we choose a ..." (with comma after Here).

      Line 134, "The membrane tension \sigma_e and bending rigidity \kappa define a ...."

      Line 295, "....tip radius, and invagination ...." (with comma before and).

      Line 337, "abortive tips, and ..." (with comma before and).

      We thank you for your thorough review of our manuscript and have corrected all the issues raised.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Recommendations for the Authors:

      (1) Clarify Mechanistic Interpretations

      (a) Provide stronger evidence or a more cautious interpretation regarding whether intracellular BK-CaV1.3 ensembles are precursors to plasma membrane complexes.

      This is an important point. We adjusted the interpretation regarding intracellular BKCa<sub>V</sub>1.3 hetero-clusters as precursors to plasma membrane complexes to reflect a more cautious stance, acknowledging the limitations of available data. We added the following to the manuscript.

      “Our findings suggest that BK and Ca<sub>V</sub>1.3 channels begin assembling intracellularly before reaching the plasma membrane, shaping their spatial organization and potentially facilitating functional coupling. While this suggests a coordinated process that may contribute to functional coupling, further investigation is needed to determine the extent to which these hetero-clusters persist upon membrane insertion.”

      (b) Discuss the limitations of current data in establishing the proportion of intracellular complexes that persist on the cell surface.

      We appreciate the suggestion. We expanded the discussion to address the limitations of current data in determining the proportion of intracellular complexes that persist on the cell surface. We added the following to the manuscript.

      “Our findings highlight the intracellular assembly of BK-Ca<sub>V</sub>1.3 hetero-clusters, though limitations in resolution and organelle-specific analysis prevent precise quantification of the proportion of intracellular complexes that ultimately persist on the cell surface. While our data confirms that hetero-clusters form before reaching the plasma membrane, it remains unclear whether all intracellular hetero-clusters transition intact to the membrane or undergo rearrangement or disassembly upon insertion. Future studies utilizing live cell tracking and high resolution imaging will be valuable in elucidating the fate and stability of these complexes after membrane insertion.”

      (2) Refine mRNA Co-localization Analysis

      (a) Include appropriate controls using additional transmembrane mRNAs to better assess the specificity of BK and CaV1.3 mRNA co-localization.

      We agree with the reviewers that these controls are essential. We explain better the controls used to address this concern. We added the following to the manuscript. 

      “To explore the origins of the initial association, we hypothesized that the two proteins are translated near each other, which could be detected as the colocalization of their mRNAs (Figure 5A and B). The experiment was designed to detect single mRNA molecules from INS-1 cells in culture. We performed multiplex in situ hybridization experiments using an RNAScope fluorescence detection kit to be able to image three mRNAs simultaneously in the same cell and acquired the images in a confocal microscope with high resolution. To rigorously assess the specificity of this potential mRNA-level organization, we used multiple internal controls. GAPDH mRNA, a highly expressed housekeeping gene with no known spatial coordination with channel mRNAs, served as a baseline control for nonspecific colocalization due to transcript abundance. To evaluate whether the spatial proximity between BK mRNA (KCNMA1) and Ca<sub>V</sub>1.3 mRNA (CACNA1D) was unique to functionally coupled channels, we also tested for Na<sup>V</sup>1.7 mRNA (SCN9A), a transmembrane sodium channel expressed in INS-1 cells but not functionally associated with BK. This allowed us to determine whether the observed colocalization reflected a specific biological relationship rather than shared expression context. Finally, to test whether this proximity might extend to other calcium sources relevant to BK activation, we probed the mRNA of ryanodine receptor 2 (RyR2), another Ca<sup>2+</sup> channel known to interact structurally with BK channels [32]. Together, these controls were chosen to distinguish specific mRNA colocalization patterns from random spatial proximity, shared subcellular distribution, or gene expression level artifacts.”

      (b) Quantify mRNA co-localization in both directions (e.g., BK with CaV1.3 and vice versa) and account for differences in expression levels.

      We thank the reviewer for this suggestion. We chose to quantify mRNA co-localization in the direction most relevant to the formation of functionally coupled hetero-clusters, namely, the proximity of BK (KCNMA1) mRNA to Ca<sub>V</sub>1.3 (CACNA1D) mRNA. Since BK channel activation depends on calcium influx provided by nearby Ca<sub>V</sub>1.3 channels, this directional analysis more directly informs the hypothesis of spatially coordinated translation and channel assembly. To address potential confounding effects of transcript abundance, we implemented a scrambled control approach in which the spatial coordinates of KCNMA1 mRNAs were randomized while preserving transcript count. This control resulted in significantly lower colocalization with CACNA1D mRNA, indicating that the observed proximity reflects a specific spatial association rather than expressiondriven overlap. We also assessed colocalization of CACNA1D with both KCNMA1, GAPDH mRNAs and SCN9 (NaV1.7); as you can see in the graph below these data support t the same conclusion but were not included in the manuscript.

      Author response image 1.

      (c) Consider using ER labeling as a spatial reference when analyzing mRNA localization

      We thank the reviewers for this suggestion. Rather than using ER labeling as a spatial reference, we assess BK and CaV1.3 mRNA localization using fluorescence in situ hybridization (smFISH) alongside BK protein immunostaining. This approach directly identifies BK-associated translation sites, ensuring that observed mRNA localization corresponds to active BK synthesis rather than general ER association. By evaluating BK protein alongside its mRNA, we provide a more functionally relevant measure of spatial organization, allowing us to assess whether BK is synthesized in proximity to CaV1.3 mRNA within micro-translational complexes. The results added to the manuscript is as follows.

      “To further investigate whether KCNMA1 and CACNA1D are localized in regions of active translation (Figure 7A), we performed RNAScope targeting KCNMA1 and CACNA1D alongside immunostaining for BK protein. This strategy enabled us to visualize transcript-protein colocalization in INS-1 cells with subcellular resolution. By directly evaluating sites of active BK translation, we aimed to determine whether newly synthesized BK protein colocalized with CACNA1D mRNA signals (Figure 7A). Confocal imaging revealed distinct micro-translational complex where KCNMA1 mRNA puncta overlapped with BK protein signals and were located adjacent to CACNA1D mRNA (Figure 7B). Quantitative analysis showed that 71 ± 3% of all KCNMA1 colocalized with BK protein signal which means that they are in active translation. Interestingly, 69 ± 3% of the KCNMA1 in active translation colocalized with CACNA1D (Figure 7C), supporting the existence of functional micro-translational complexes between BK and Ca<sub>V</sub>1.3 channels.”

      (3) Improve Terminology and Definitions

      (a) Clarify and consistently use terms like "ensemble," "cluster," and "complex," especially in quantitative analyses.

      We agree with the reviewers, and we clarified terminology such as 'ensemble,' 'cluster,' and 'complex' and used them consistently throughout the manuscript, particularly in quantitative analyses, to enhance precision and avoid ambiguity.  

      (b) Consider adopting standard nomenclature (e.g., "hetero-clusters") to avoid ambiguity.

      We agree with the reviewers, and we adapted standard nomenclature, such as 'heteroclusters,' in the manuscript to improve clarity and reduce ambiguity.

      (4) Enhance Quantitative and Image Analysis

      (a) Clearly describe how colocalization and clustering were measured in super-resolution data.

      We thank the reviewers for this suggestion. We have modified the Methods section to provide a clearer description of how colocalization and clustering were measured in our super-resolution data. Specifically, we now detail the image processing steps, including binary conversion, channel multiplication for colocalization assessment, and density-based segmentation for clustering analysis. These updates ensure transparency in our approach and improve accessibility for readers, and we added the following to the manuscript.

      “Super-resolution imaging: 

      Direct stochastic optical reconstruction microscopy (dSTORM) images of BK and 1.3 overexpressed in tsA-201 cells were acquired using an ONI Nanoimager microscope equipped with a 100X oil immersion objective (1.4 NA), an XYZ closed-loop piezo 736 stage, and triple emission channels split at 488, 555, and 640 nm. Samples were imaged at 35°C. For singlemolecule localization microscopy, fixed and stained cells were imaged in GLOX imaging buffer containing 10 mM β-mercaptoethylamine (MEA), 0.56 mg/ml glucose oxidase, 34 μg/ml catalase, and 10% w/v glucose in Tris-HCl buffer. Single-molecule localizations were filtered using NImOS software (v.1.18.3, ONI). Localization maps were exported as TIFF images with a pixel size of 5 nm. Maps were further processed in ImageJ (NIH) by thresholding and binarization to isolate labeled structures. To assess colocalization between the signal from two proteins, binary images were multiplied. Particles smaller than 400 nm<sup>2</sup> were excluded from the analysis to reflect the spatial resolution limit of STORM imaging (20 nm) and the average size of BK channels. To examine spatial localization preference, binary images of BK were progressively dilated to 20 nm, 40 nm, 60 nm, 80 nm, 100 nm, and 200 nm to expand their spatial representation. These modified images were then multiplied with the Ca<sub>V</sub>1.3 channel to quantify colocalization and determine BK occupancy at increasing distances from Ca<sub>V</sub>1.3. To ensure consistent comparisons across distance thresholds, data were normalized using the 200 nm measurement as the highest reference value, set to 1.”

      (b) Where appropriate, quantify the proportion of total channels involved in ensembles within each compartment.

      We thank the reviewers for this comment. However, our method does not allow for direct quantification of the total number of BK and Ca<sub>V</sub>1.3 channels expressed within the ER or ER exit sites, as we rely on proximity-based detection rather than absolute fluorescence intensity measurements of individual channels. Traditional methods for counting total channel populations, such as immunostaining or single-molecule tracking, are not applicable to our approach due to the hetero-clusters formation process. Instead, we focused on the relative proportion of BK and Ca<sub>V</sub>1.3 hetero-clusters within these compartments, as this provides meaningful insights into trafficking dynamics and spatial organization. By assessing where hetero-cluster preferentially localize rather than attempting to count total channel numbers, we can infer whether their assembly occurs before plasma membrane insertion. While this approach does not yield absolute quantification of ER-localized BK and Ca<sub>V</sub>1.3 channels, it remains a robust method for investigating hetero-cluster formation and intracellular trafficking pathways. To reflect this limitation, we added the following to the manuscript.

      “Finally, a key limitation of this approach is that we cannot quantify the proportion of total BK or Ca<sub>V</sub>1.3 channels engaged in hetero-clusters within each compartment. The PLA method provides proximity-based detection, which reflects relative localization rather than absolute channel abundance within individual organelles”.

      (5) Temper Overstated Claims

      (a) Revise language that suggests the findings introduce a "new paradigm," instead emphasizing how this study extends existing models.

      We agree with the reviewers, and we have revised the language to avoid implying a 'new paradigm.' The following is the significance statement.

      “This work examines the proximity between BK and Ca<sub>V</sub>1.3 molecules at the level of their mRNAs and newly synthesized proteins to reveal that these channels interact early in their biogenesis. Two cell models were used: a heterologous expression system to investigate the steps of protein trafficking and a pancreatic beta cell line to study the localization of endogenous channel mRNAs. Our findings show that BK and Ca<sub>V</sub>1.3 channels begin assembling intracellularly before reaching the plasma membrane, revealing new aspects of their spatial organization. This intracellular assembly suggests a coordinated process that contributes to functional coupling.”

      (b) Moderate conclusions where the supporting data are preliminary or correlative.

      We agree with the reviewers, and we have moderated conclusions in instances where the supporting data are preliminary or correlative, ensuring a balanced interpretation. We added the following to the manuscript. 

      “This study provides novel insights into the organization of BK and Ca<sub>V</sub>1.3 channels in heteroclusters, emphasizing their assembly within the ER, at ER exit sites, and within the Golgi. Our findings suggest that BK and Ca<sub>V</sub>1.3 channels begin assembling intracellularly before reaching the plasma membrane, shaping their spatial organization, and potentially facilitating functional coupling. While this suggests a coordinated process that may contribute to functional coupling, further investigation is needed to determine the extent to which these hetero-clusters persist upon membrane insertion. While our study advances the understanding of BK and Ca<sub>V</sub>1.3 heterocluster assembly, several key questions remain unanswered. What molecular machinery drives this colocalization at the mRNA and protein level? How do disruptions to complex assembly contribute to channelopathies and related diseases? Additionally, a deeper investigation into the role of RNA binding proteins in facilitating transcript association and localized translation is warranted”.

      (6) Address Additional Technical and Presentation Issues

      (a) Include clearer figure annotations, especially for identifying PLA puncta localization (e.g., membrane vs. intracellular).

      We agree with the reviewers, and we have updated the figures to include clearer annotations that distinguish PLA puncta localized at the membrane versus those within intracellular compartments.

      (b) Reconsider the scale and arrangement of image panels to better showcase the data.

      We agree with the reviewers, and we have adjusted the scale and layout of the image panels to enhance data visualization and readability. Enlarged key regions now provide better clarity of critical features.

      (c) Provide precise clone/variant information for BK and CaV1.3 channels used.

      We thank the reviewers for their suggestion, and we now provide precise information regarding the BK and Ca<sub>V</sub>1.3 channel constructs used in our experiments, including their Addgene plasmid numbers and relevant variant details. These have been incorporated into the Methods section to ensure reproducibility and transparency. We added the following to the manuscript. 

      “The Ca<sub>V</sub>1.3 α subunit construct used in our study corresponds to the rat Ca<sub>V</sub>1.3e splice variant containing exons 8a, 11, 31b, and 42a, with a deletion of exon 32. The BK channel construct used in this study corresponds to the VYR splice variant of the mouse BKα subunit (KCNMA1)”.

      (d) Correct typographical errors and ensure proper figure/supplementary labeling throughout.

      Typographical errors have been corrected, and figure/supplementary labeling has been reviewed for accuracy throughout the manuscript.

      (7) Expand the Discussion

      (a) Include a brief discussion of findings such as BK surface expression in the absence of CaV1.3.

      We thank the reviewers for their suggestion. We expanded the Discussion to include a brief analysis of BK surface expression in the absence of Ca<sub>V</sub>1.3. We included the following in the manuscript. 

      “BK Surface Expression and Independent Trafficking Pathways

      BK surface expression in the absence of Ca<sub>V</sub>1.3 indicates that its trafficking does not strictly rely on Ca<sub>V</sub>1.3-mediated interactions. Since BK channels can be activated by multiple calcium sources, their presence in intracellular compartments suggests that their surface expression is governed by intrinsic trafficking mechanisms rather than direct calcium-dependent regulation. While some BK and Ca<sub>V</sub>1.3 hetero-clusters assemble into signaling complexes intracellularly, other BK channels follow independent trafficking pathways, demonstrating that complex formation is not obligatory for all BK channels. Differences in their transport kinetics further reinforce the idea that their intracellular trafficking is regulated through distinct mechanisms. Studies have shown that BK channels can traffic independently of Ca<sub>V</sub>1.3, relying on alternative calcium sources for activation [13, 41]. Additionally, Ca<sub>V</sub>1.3 exhibits slower synthesis and trafficking kinetics than BK, emphasizing that their intracellular transport may not always be coordinated. These findings suggest that BK and Ca<sub>V</sub>1.3 exhibit both independent and coordinated trafficking behaviors, influencing their spatial organization and functional interactions”.

      (b) Clarify why certain colocalization comparisons (e.g., ER vs. ER exit sites) are not directly interpretable.

      We thank the reviewer for their suggestion. A clarification has been added to the result section and discussion of the manuscript explaining why colocalization comparisons, such as ER versus ER exit sites, are not directly interpretable. We included the following in the manuscript.

      “Result:

      ER was not simply due to the extensive spatial coverage of ER labeling, we labeled ER exit sites using Sec16-GFP and probed for hetero-clusters with PLA. This approach enabled us to test whether the hetero-clusters were preferentially localized to ER exit sites, which are specialized trafficking hubs that mediate cargo selection and direct proteins from the ER into the secretory pathway. In contrast to the more expansive ER network, which supports protein synthesis and folding, ER exit sites ensure efficient and selective export of proteins to their target destinations”.

      “By quantifying the proportion of BK and Ca<sub>V</sub>1.3 hetero-clusters relative to total channel expression at ER exit sites, we found 28 ± 3% colocalization in tsA-201 cells and 11 ± 2% in INS-1 cells (Figure 3F). While the percentage of colocalization between hetero-clusters and the ER or ER exit sites alone cannot be directly compared to infer trafficking dynamics, these findings reinforce the conclusion that hetero-clusters reside within the ER and suggest that BK and Ca<sub>V</sub>1.3 channels traffic together through the ER and exit in coordination”.

      “Colocalization and Trafficking Dynamics

      The colocalization of BK and Ca<sub>V</sub>1.3 channels in the ER and at ER exit sites before reaching the Golgi suggests a coordinated trafficking mechanism that facilitates the formation of multi-channel complexes crucial for calcium signaling and membrane excitability [37, 38]. Given the distinct roles of these compartments, colocalization at the ER and ER exit sites may reflect transient proximity rather than stable interactions. Their presence in the Golgi further suggests that posttranslational modifications and additional assembly steps occur before plasma membrane transport, providing further insight into hetero-cluster maturation and sorting events. By examining BK-Ca<sub>V</sub>1.3 hetero-cluster distribution across these trafficking compartments, we ensure that observed colocalization patterns are considered within a broader framework of intracellular transport mechanisms [39]. Previous studies indicate that ER exit sites exhibit variability in cargo retention and sorting efficiency [40], emphasizing the need for careful evaluation of colocalization data. Accounting for these complexities allows for a robust assessment of signaling complexes formation and trafficking pathways”.

      Reviewer #1 (Recommendations for the authors):

      In addition to the general aspects described in the public review, I list below a few points with the hope that they will help to improve the manuscript: 

      (1) Page 3: "they bind calcium delimited to the point of entry at calcium channels", better use "sources" 

      We agree with the reviewer. The phrasing on Page 3 has been updated to use 'sources' instead of 'the point of entry at calcium channels' for clarity.

      (2) Page 3 "localized supplies of intracellular calcium", I do not like this term, but maybe this is just silly.

      We agree with the reviewer. The term 'localized supplies of intracellular calcium' on Page 3 has been revised to “Localized calcium sources”

      (3) Regarding the definitions stated by the authors: How do you distinguish between "ensembles" corresponding to "coordinated collection of BK and Cav channels" and "assembly of BK clusters with Cav clusters"? I believe that hetero-clusters is more adequate. The nomenclature does not respond to any consensus in the protein biology field, and I find that it introduces bias more than it helps. I would stick to heteroclusters nomenclature that has been used previously in the field. Moreover, in some discussion sections, the term "ensemble" is used in ways that border on vague, especially when talking about "functional signaling complexes" or "ensembles forming early." It's still acceptable within context but could benefit from clearer language to distinguish ensemble (structural proximity) from complex (functional consequence).

      We agree with the reviewer, and we recognize the importance of precise nomenclature and have adopted hetero-clusters instead of ensembles to align with established conventions in the field. This term specifically refers to the spatial organization of BK and Ca<sub>V</sub>1.3 channels, while functional complexes denote mechanistic interactions. We have revised sections where ensemble was used ambiguously to ensure clear distinction between structure and function.

      The definition of "cluster" is clearly stated early but less emphasized in later quantitative analyses (e.g., particle size discussions in Figure 7). Figure 8 is equally confusing, graphs D and E referring to "BK ensembles" and "Cav ensembles", but "ensembles" should refer to combinations of both channels, whereas these seem to be "clusters". In fact, the Figure legend mentions "clusters".

      We agree with the reviewer. Terminology has been revised throughout the manuscript to ensure consistency, with 'clusters' used appropriately in quantitative analyses and figure descriptions.

      (4) Methods: how are clusters ("ensembles") analysed from the STORM data? What is the logarithm used for? More info about this is required. Equally, more information and discussion about how colocalization is measured and interpreted in superresolution microscopy are required.

      We thank the reviewer for their suggestion, and additional details have been incorporated into the Methods section to clarify how clusters ('ensembles') are analyzed from STORM data, including the role of the logarithm in processing. Furthermore, we have expanded the discussion to provide more information on how colocalization is measured and interpreted in super resolution microscopy. We include the following in the manuscript.

      “Direct stochastic optical reconstruction microscopy (dSTORM) images of BK and Ca<sub>V</sub>1.3 overexpressed in tsA-201 cells were acquired using an ONI Nanoimager microscope equipped with a 100X oil immersion objective (1.4 NA), an XYZ closed-loop piezo 736 stage, and triple emission channels split at 488, 555, and 640 nm. Samples were imaged at 35°C. For singlemolecule localization microscopy, fixed and stained cells were imaged in GLOX imaging buffer containing 10 mM β-mercaptoethylamine (MEA), 0.56 mg/ml glucose oxidase, 34 μg/ml catalase, and 10% w/v glucose in Tris-HCl buffer. Single-molecule localizations were filtered using NImOS software (v.1.18.3, ONI). Localization maps were exported as TIFF images with a pixel size of 5 nm. Maps were further processed in ImageJ (NIH) by thresholding and binarization to isolate labeled structures. To assess colocalization between the signal from two proteins, binary images were multiplied. Particles smaller than 400 nm<sup>2</sup> were excluded from the analysis to reflect the spatial resolution limit of STORM imaging (20 nm) and the average size of BK channels. To examine spatial localization preference, binary images of BK were progressively dilated to 20 nm, 40 nm, 60 nm, 80 nm, 100 nm, and 200 nm to expand their spatial representation. These modified images were then multiplied with the Ca<sub>V</sub>1.3 channel to quantify colocalization and determine BK occupancy at increasing distances from Ca<sub>V</sub>1.3. To ensure consistent comparisons across distance thresholds, data were normalized using the 200 nm measurement as the highest reference value, set to 1”.

      (5) Related to Figure 2:

      (a) Why use an antibody to label GFP when PH-PLCdelta should be a membrane marker? Where is the GFP in PH-PKC-delta (intracellular, extracellular? Images in Figure 2E are confusing, there is a green intracellular signal.

      We thank the reviewer for their feedback. To clarify, GFP is fused to the N-terminus of PH-PLCδ and primarily localizes to the inner plasma membrane via PIP2 binding. Residual intracellular GFP signal may reflect non-membrane-bound fractions or background from anti-GFP immunostaining. We added a paragraph explaining the use of the antibody anti GFP in the Methods section Proximity ligation assay subsection. 

      (b) The images in Figure 2 do not help to understand how the authors select the PLA puncta located at the plasma membrane. How do the authors do this? A useful solution would be to indicate in Figure 2 an example of the PLA signals that are considered "membrane signals" compared to another example with "intracellular signals". Perhaps this was intended with the current Figure, but it is not clear.

      We agree with the reviewer. We have added a sentence to explain how the number of PLA puncta at the plasma membrane was calculated. 

      “We visualized the plasma membrane with a biological sensor tagged with GFP (PHPLCδ-GFP) and then probed it with an antibody against GFP (Figure 2E). By analyzing the GFP signal, we created a mask that represented the plasma membrane. The mask served to distinguish between the PLA puncta located inside the cell and those at the plasma membrane, allowing us to calculate the number of PLA puncta at the plasma membrane”.

      (c) Figure 2C: What is the negative control? Apologies if it is described somewhere, but I seem not to find it in the manuscript.

      We thank the reviewer for their suggestion. For the negative control in Figure 2C, BK was probed using the primary antibody without co-staining for Ca<sub>V</sub>1.3 or other proteins, ensuring specificity and ruling out non-specific antibody binding or background fluorescence. A sentence clarifying the negative control for Figure 2C has been added to the Results section, specifying that BK was probed using the primary antibody without costaining for Ca<sub>V</sub>1.3 or other proteins to ensure specificity. 

      “To confirm specificity, a negative control was performed by probing only for BK using the primary antibody, ensuring that detected signals were not due to non-specific binding or background fluorescence”.

      (d) What is the resolution in z of the images shown in Figure 2? This is relevant for the interpretation of signal localization.

      The z-resolution of the images shown in Figure 2 was approximately 270–300 nm, based on the Zeiss Airyscan system’s axial resolution capabilities. Imaging was performed with a step size of 300 nm, ensuring adequate sampling for signal localization while maintaining optimal axial resolution.

      “In a different experiment, we analyzed the puncta density for each focal plane of the cell (step size of 300 nm) and compared the puncta at the plasma membrane to the rest of the cell”.

      (e) % of total puncta in PM vs inside cell are shown for transfected cells, what is this proportion in INS-1 cells?

      This quantification was performed for transfected cells; however, we have not conducted the same analysis in INS-1 cells. Future experiments could address this to determine potential differences in puncta distribution between endogenous and overexpressed conditions.

      (6) Related to Figure 3:

      (a) Figure 3B: is this antibody labelling or GFP fluorescence? Why do they use GFP antibody labelling, if the marker already has its own fluorescence? This should at least be commented on in the manuscript.

      We thank the reviewer for their concern. In Figure 3B, GFP was labeled using an antibody rather than relying on its intrinsic fluorescence. This approach was necessary because GFP fluorescence does not withstand the PLA protocol, resulting in significant fading. Antibody labeling provided stronger signal intensity and improved resolution, ensuring optimal signal-to-noise ratio for accurate analysis.

      A clarification regarding the use of GFP antibody labeling in Figure 3B has been added to the Methods section, explaining that intrinsic GFP fluorescence does not endure the PLA protocol, necessitating antibody-based detection for improved signal and resolution.We added the following to the manuscript. 

      “For PLA combined with immunostaining, PLA was followed by a secondary antibody incubation with Alexa Fluor-488 at 2 μg/ml for 1 hour at 21˚C. Since GFP fluorescence fades significantly during the PLA protocol, resulting in reduced signal intensity and poor image resolution, GFP was labeled using an antibody rather than relying on its intrinsic fluorescence”.

      (b) Why is it relevant to study the ER exit sites? Some explanation should be included in the main text (page 11) for clarification to non-specialized readers. Again, the quantification should be performed on the proportion of clusters/ensembles out of the total number of channels expressed at the ER (or ER exit sites).

      We thank the reviewer for their feedback. We have modified this section to include a more detailed explanation of the relevance of ER exit sites to protein trafficking. ER exit sites serve as specialized sorting hubs that regulate the transition of proteins from the ER to the secretory pathway, distinguishing them from the broader ER network, which primarily facilitates protein synthesis and folding. This additional context clarifies why studying ER exit sites provides valuable insights into ensemble trafficking dynamics.

      Regarding quantification, our method does not allow for direct measurement of the total number of BK and Ca<sub>V</sub>1.3 channels expressed at the ER or ER exit sites. Instead, we focused on the proportion of hetero-clusters localized within these compartments, which provides insight into trafficking pathways despite the limitation in absolute channel quantification. We included the following in the manuscript in the Results section. 

      “To determine whether the observed colocalization between BK–Ca<sub>V</sub>1.3 hetero-clusters and the ER was not simply due to the extensive spatial coverage of ER labeling, we labeled ER exit sites using Sec16-GFP and probed for hetero-clusters with PLA. This approach enabled us to test whether the hetero-clusters were preferentially localized to ER exit sites, which are specialized trafficking hubs that mediate cargo selection and direct proteins from the ER into the secretory pathway. In contrast to the more expansive ER network, which supports protein synthesis and folding, ER exit sites ensure efficient and selective export of proteins to their target destinations”.

      “By quantifying the proportion of BK and Ca<sub>V</sub>1.3 hetero-clusters relative to total channel expression at ER exit sites, we found 28 ± 3% colocalization in tsA-201 cells and 11 ± 2% in INS-1 cells (Figure 3F). While the percentage of colocalization between hetero-clusters and the ER or ER exit sites alone cannot be directly compared to infer trafficking dynamics, these findings reinforce the conclusion that hetero-clusters reside within the ER and suggest that BK and Ca<sub>V</sub>1.3 channels traffic together through the ER and exit in coordination”.

      (7) Related to Figure 4:

      A control is included to confirm that the formation of BK-Cav1.3 ensembles is not unspecific. Association with a protein from the Golgi (58K) is tested. Why is this control only done for Golgi? No similar experiment has been performed in the ER. This aspect should be commented on.

      We thank the reviewer for their suggestion. We selected the Golgi as a control because it represents the final stage of protein trafficking before proteins reach their functional destinations. If BK and Ca<sub>V</sub>1.3 hetero-cluster formation is specific at the Golgi, this suggests that their interaction is maintained throughout earlier trafficking steps, including within the ER. While we did not perform an equivalent control experiment in the ER, the Golgi serves as an effective checkpoint for evaluating specificity within the broader protein transport pathway. We included the following in the manuscript.

      “We selected the Golgi as a control because it represents the final stage of protein trafficking, ensuring that hetero-cluster interactions observed at this point reflect specificity maintained throughout earlier trafficking steps, including within the ER”.

      (8) How is colocalization measured, eg, in Figure 6? Are the images shown in Figure 6 representative? This aspect would benefit from a clearer description.

      We thank the reviewer for their suggestion. A section clarifying colocalization measurement and the representativeness of Figure 6 images has been added to the Methods under Data Analysis. We included the following in the manuscript.

      For PLA and RNAscope experiments, we used custom-made macros written in ImageJ. Processing of PLA data included background subtraction. To assess colocalization, fluorescent signals were converted into binary images, and channels were multiplied to identify spatial overlap.

      (9) The text should be revised for typographical errors, for example:

      (a) Summary "evidence of" (CHECK THIS ONE)

      We agree with the reviewer, and we corrected the typographical errors

      (b) Table 1, row 3: "enriches" should be "enrich"

      We agree with the reviewer. The term 'enriches' in Table 1, row 3 has been corrected to 'enrich'.

      (c) Figure 2B "priximity"

      We agree with the reviewer. The typographical errors in Figure 2B has been corrected from 'priximity' to 'proximity'.

      (d) Legend of Figure 7 (C) "size of BK and Cav1.3 channels". Does this correspond to individual channels or clusters?

      We agree with the reviewer. The legend of Figure 7C has been clarified to indicate that 'size of BK and Cav1.3 channels' refers to clusters rather than individual channels.

      (e) Methods: In the RNASCOPE section, "Fig.4-supp1" should be "Fig. 5-supp1"

      (f) Page 15, Figure 5B is cited, should be Figure 6B

      We agree with the reviewer. The reference in the RNASCOPE section has been updated from 'Fig.4-supp1' to 'Fig. 5-supp1,' and the citation on Page 15 has been corrected from Figure 5B to Figure 6B.

      Reviewer #2 (Recommendations for the authors):

      (1) The abstract could be more accessible for a wider readership with improved flow.

      We thank the reviewer for their suggestion. We modified the summary as follows to provide a more coherent flow for a wider readership. 

      “Calcium binding to BK channels lowers BK activation threshold, substantiating functional coupling with calcium-permeable channels. This coupling requires close proximity between different channel types, and the formation of BK–Ca<sub>V</sub>1.3 hetero-clusters at nanometer distances exemplifies this unique organization. To investigate the structural basis of this interaction, we tested the hypothesis that BK and Ca<sub>V</sub>1.3 channels assemble before their insertion into the plasma membrane. Our approach incorporated four strategies: (1) detecting interactions between BK and Ca<sub>V</sub>1.3 proteins inside the cell, (2) identifying membrane compartments where intracellular hetero-clusters reside, (3) measuring the proximity of their mRNAs, and (4) assessing protein interactions at the plasma membrane during early translation. These analyses revealed that a subset of BK and Ca<sub>V</sub>1.3 transcripts are spatially close in micro-translational complexes, and their newly synthesized proteins associate within the endoplasmic reticulum (ER) and Golgi. Comparisons with other proteins, transcripts, and randomized localization models support the conclusion that BK and Ca<sub>V</sub>1.3 hetero-clusters form before their insertion at the plasma membrane”.

      (2) Figure 2B - spelling of proximity.

      We agree with the reviewer. The typographical errors in Figure 2B has been corrected from 'priximity' to 'proximity'.

      Reviewer #3 (Recommendations for the authors):

      Minor issues to improve the manuscript:

      (1) For completeness, the authors should include a few sentences and appropriate references in the Introduction to mention that BK channels are regulated by auxiliary subunits.

      We agree with the reviewer. We have revised the Introduction to include a brief discussion of how BK channel function is modulated by auxiliary subunits and provided appropriate references to ensure completeness. These additions highlight the broader regulatory mechanisms governing BK channel activity, complementing the focus of our study. We included the following in the manuscript. 

      “Additionally, BK channels are modulated by auxiliary subunits, which fine-tune BK channel gating properties to adapt to different physiological conditions. β and γ subunits regulate BK channel kinetics, altering voltage sensitivity and calcium responsiveness [18]. These interactions ensure precise control over channel activity, allowing BK channels to integrate voltage and calcium signals dynamically in various cell types. Here, we focus on the selective assembly of BK channels with Ca<sub>V</sub>1.3 and do not evaluate the contributions of auxiliary subunits to BK channel organization.”

      (2) Insert a space between 'homeostasis' and the square bracket at the end of the Introduction's second paragraph.

      We agree with the reviewer. A space has been inserted between 'homeostasis' and the square bracket in the second paragraph of the Introduction for clarity.

      (3) The images presented in Figures 2-5 should be increased in size (if permitted by the Journal) to allow the reader to clearly see the puncta in the fluorescent images. This would necessitate reconfiguring the figures into perhaps a full A4 page per figure, but I think the quality of the images presented really do deserve to "be seen". For example, Panels A & B could be at the top of Figure 2, with C & D presented below them. However, I'll leave it up to the authors to decide on the most aesthetically pleasing way to show these.

      We agree with the reviewer. We have increased the size of Figures 2–8 to enhance the visibility of fluorescent puncta, as suggested. To accommodate this, we reorganized the panel layout for each figure—for example, in Figure 2, Panels A and B are now placed above Panels C and D to support a more intuitive and aesthetically coherent presentation. We believe this revised configuration highlights the image quality and improves readability while conforming to journal layout constraints.

      (4) I think that some of the sentences could be "toned down"

      (a) eg, in the first paragraph below Figure 2, the authors state "that 46(plus minus)3% of the puncta were localised on intracellular membranes" when, at that stage, no data had been presented to confirm this. I think changing it to "that 46(plus minus)3% of the puncta were localised intracellularly" would be more precise.

      (b) Similarly, please consider replacing the wording of "get together at membranes inside the cell" to "co-localise intracellularly".

      (c) In the paragraph just before Figure 5, the authors mention that "the abundance of KCNMA1 correlated more with the abundance of CACNA1D than ... with GAPDH." Although this is technically correct, the R2 value was 0.22, which is exceptionally poor. I don't think that the paper is strengthened by sentences such as this, and perhaps the authors might tone this down to reflect this.

      (d) The authors clearly demonstrate in Figure 8 that a significant number of BK channels can traffic to the membrane in the absence of Cav1.3. Irrespective of the differences in transcription/trafficking time between the two channel types, the authors should insert a few lines into their discussion to take this finding into account.

      We appreciate the reviewer’s feedback regarding the clarity and precision of our phrasing.

      Our responses for each point are below.

      (a) We have modified the statement in the first paragraph below Figure 2, changing '46 ± 3% of the puncta were localized on intracellular membranes' to '46 ± 3% of the puncta were localized ‘intracellularly’ to ensure accuracy in the absence of explicit data confirming membrane association.

      (b) Similarly, we have replaced 'get together at membranes inside the cell' with 'colocalize intracellularly' to maintain clarity and avoid unintended implications. 

      (c) Regarding the correlation between KCNMA1 and CACNA1D abundance, we recognize that the R² value of 0.22 is relatively low. To reflect this appropriately, we have revised the phrasing to indicate that while a correlation exists, it is modest. We added the following to the manuscript. 

      “Interestingly, the abundance of KCNMA1 transcripts correlated more with the abundance of CACNA1D transcripts than with the abundance of GAPDH, a standard housekeeping gene, though with a modest R² value.”

      (d) To incorporate the findings from Figure 8, we have added discussion acknowledging that a substantial number of BK channels traffic to the membrane independently of Ca<sub>V</sub>1.3. This addition provides context for potential trafficking mechanisms that operate separately from ensemble formation.

      (5) For clarity, please insert the word "total" in the paragraph after Figure 3 "..."63{plus minus}3% versus 50%{plus minus}6% of total PLA puncta were localised at the ER". I know this is explicitly stated later in the manuscript, but I think it needs to be clarified earlier.

      We agree with the reviewer. The word 'total' has been inserted in the paragraph following Figure 3 to clarify the percentage of PLA puncta localized at the ER earlier in the manuscript

      (6) In the discussion, I think an additional (short) paragraph needs to be included to clarify to the reader why the % "colocalization between ensembles and the ER or the ER exit sites can't be compared or used to understand the dynamics of the ensembles". This may permit the authors to remove the last sentence of the paragraph just before the results section, "BK and Cav1.3 ensembles go through the Golgi."

      We thank the reviewer for their suggestion. We have added a short paragraph in the discussion to clarify why colocalization percentages between ensembles and the ER or ER exit sites cannot be compared to infer ensemble dynamics. This allowed us to remove the final sentence of the paragraph preceding the results section ('BK and Cav1.3 ensembles go through the Golgi).

      (7) In the paragraph after Figure 6, Figure 5B is inadvertently referred to. Please correct this to Figure 6B.

      We agree with the reviewer. The reference to Figure 5B in the paragraph after Figure 6 has been corrected to Figure 6B.

      (8) In the discussion under "mRNA co-localisation and Protein Trafficking", please insert a relevant reference illustrating that "disruption in mRNA localization... can lead to ion channel mislocalization".

      We agree with the reviewer. We have inserted a relevant reference under 'mRNA Colocalization and Protein Trafficking' to illustrate that disruption in mRNA localization can lead to ion channel mislocalization.

      (9) The supplementary Figures appear to be incorrectly numbered. Please correct and also ensure that they are correctly referred to in the text.

      We agree with the reviewer. The numbering of the supplementary figures has been corrected, and all references to them in the text have been updated accordingly.

      (10) The final panels of the currently labelled Figure 5-Supplementary 2 need to have labels A-F included on the image.

      We agree with the reviewer. Labels A-F have been added to the final panels of Figure 5-Supplementary 2.

      References

      (1) Shah, K.R., X. Guan, and J. Yan, Structural and Functional Coupling of Calcium-Activated BK Channels and Calcium-Permeable Channels Within Nanodomain Signaling Complexes. Frontiers in Physiology, 2022. Volume 12 - 2021.

      (2) Chen, A.L., et al., Calcium-Activated Big-Conductance (BK) Potassium Channels Traffic through Nuclear Envelopes into Kinocilia in Ray Electrosensory Cells. Cells, 2023. 12(17): p. 2125.

      (3) Berkefeld, H., B. Fakler, and U. Schulte, Ca2+-activated K+ channels: from protein complexes to function. Physiol Rev, 2010. 90(4): p. 1437-59.

      (4) Loane, D.J., P.A. Lima, and N.V. Marrion, Co-assembly of N-type Ca2+ and BK channels underlies functional coupling in rat brain. J Cell Sci, 2007. 120(Pt 6): p. 98595.

      (5) Boncompain, G. and F. Perez, The many routes of Golgi-dependent trafficking. Histochemistry and Cell Biology, 2013. 140(3): p. 251-260.

      (6) Kurokawa, K. and A. Nakano, The ER exit sites are specialized ER zones for the transport of cargo proteins from the ER to the Golgi apparatus. The Journal of Biochemistry, 2019. 165(2): p. 109-114.

      (7) Chen, G., et al., BK channel modulation by positively charged peptides and auxiliary γ subunits mediated by the Ca2+-bowl site. Journal of General Physiology, 2023. 155(6).

    1. Sleep-time Compute: Beyond Inference Scaling at Test-time

      Core Concept

      Sleep-time compute allows models to "think" offline about contexts before queries are presented, reducing test-time compute requirements by ~5× on benchmark tasks

      "by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time"

      • The approach works by processing context c during idle time to create an enhanced representation c', which is then used at test-time: S(c) → c', followed by Tb(q, c') → a

      "In practice, this is achieved by prompting the model to generate a new context consisting of inferences about the existing context, which may be potentially useful for answering test-time queries"

      Key Results

      Performance improvements: Sleep-time compute reduces test-time compute needed to achieve same accuracy by ~5× on Stateful GSM-Symbolic and Stateful AIME

      "Sleep-time compute produces a pareto improvement in the test-time compute vs. accuracy curve, reducing the test-time compute needed to achieve the same accuracy by ∼ 5×"

      Scaling benefits: By scaling up sleep-time compute, accuracy increases by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME

      Cost amortization: When multiple queries share the same context, average cost per query decreases by 2.5×

      "By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5×"

      Datasets Introduced

      Stateful GSM-Symbolic: Modified from GSM-Symbolic (P1: 5000 examples, P2: 2500 examples) by splitting problems into context and question

      "We introduce two datasets to study applying sleep-time compute in stateful settings, Stateful GSM-Symbolic, and Stateful AIME – by splitting the existing problems in these datasets into a context and a question"

      Stateful AIME: Contains 60 questions from AIME 2024 and 2025, split into context and query components

      Multi-Query GSM-Symbolic: Extends GSM-Symbolic with multiple related queries per context (P1: 12,043 questions, 1,095 contexts; P2: 5,497 questions, 500 contexts)

      SWE-Features: Software engineering benchmark for multi-file feature implementation tasks (33 examples from Aider-AI/aider and ComfyUI repositories)

      Models Evaluated

      Non-reasoning models: GPT-4o-mini and GPT-4o on GSM-Symbolic tasks

      Reasoning models: OpenAI's o1, o3-mini, Anthropic's Claude Sonnet 3.7 Extended Thinking, and DeepSeek-R1 on AIME tasks

      • Test-time compute scaled both sequentially (varying verbosity/reasoning effort) and in parallel (pass@k sampling)

      Effectiveness Analysis

      Query predictability correlation: Sleep-time compute is most effective when queries are predictable from context

      "sleep-time compute is more effective in settings where the query is more easily predictable from the context"

      • Predictability measured using log-probability of question given context under Llama2-70B base model

      • Accuracy gap between sleep-time and test-time compute widens for more predictable questions (binned analysis across 5 quantiles)

      Implementation Details

      • Sleep-time compute implemented via function calling with two functions: - rethink_memory: Takes new string input and replaces current context - finish_rethinking: Terminates sleep-time compute process

      • Models allowed up to 10 calls to rethink_memory function

      • Cost modeling assumes test-time tokens are 10× more expensive than sleep-time tokens (t=10) due to latency optimization

      "Since at test-time, there are strict latency constraints, and latency optimized inference can be roughly 10× more expensive, we model the total cost of inference between both sleep-time and test-time, by up-weighing the cost of test-time tokens"

      Comparison to Baselines

      Pass@k parallel scaling: Sleep-time compute consistently outperforms pass@k at same test-time token budget

      "sleep-time compute consistently outperforms pass@k parallel scaling at the same test-time token budget, demonstrating that sleep-time compute can be a more effective way to scale inference-time compute than standard parallel test-time scaling"

      Context-only baseline: Sleep-time compute significantly outperforms models that only receive context and must guess the question, demonstrating questions are not trivially predictable

      SWE-Features Case Study

      • At lower test-time budgets, sleep-time compute achieves ~1.5× reduction in test-time tokens with higher F1 scores

      • At higher budgets, standard test-time compute performs better, with higher precision but comparable recall

      • Hypothesis: sleep-time compute explores more files, leading to editing more files and slightly lower precision

      Related Work & Context

      • Builds on recent test-time scaling approaches: sequential (OpenAI o1, DeepSeek-R1) and parallel (pass@k, best-of-N)

      • Connection to speculative decoding (Leviathan et al., 2023): Both speculate on user queries, but sleep-time compute uses generated tokens as input regardless of actual query

      • Connection to pre-computation in systems: Similar to memory caches (Smith, 1982) and data cubes for OLAP workloads (Gray et al., 1997)

      • Resembles representation learning but operates in natural language space rather than parameter/activation space

      Limitations & Future Directions

      • Sleep-time compute less effective when queries are unpredictable or unrelated to context

      • Current approach assumes simple two-phase interaction (sleep-time and test-time), but real-world scenarios involve multiple interaction rounds

      • Future work: Optimal allocation of compute between sleep-time and test-time based on query predictability

      • Potential application to synthetic data generation at scale for pretraining

      Authors & Affiliation

      Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez (Letta & UC Berkeley)

      Code and data: https://github.com/letta-ai/sleep-time-compute

    1. The Prompt Report: A Systematic Survey of Prompting Techniques

      Overview & Scope

      • Comprehensive taxonomy: "We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities."

      • Scope limitation: "We limit our study to focus on prefix prompts rather than cloze prompts, because modern LLM transformer architectures widely employ prefix prompts"

      • Focus on hard prompts: "Additionally, we refined our focus to hard (discrete) prompts rather than soft (continuous) prompts and leave out papers that make use of techniques using gradient-based updates (i.e. fine-tuning). Hard prompts contain only tokens (vectors) that correspond to words in the model's vocabulary"

      Key Definitions

      Prompt & Prompting

      • Prompt definition: "A prompt is an input to a Generative AI model, that is used to guide its output"

      • Prompt template: "A prompt template is a function that contains one or more variables which will be replaced by some media (usually text) to create a prompt"

      • Prompting: "Prompting is the process of providing a prompt to a GenAI, which then generates a response"

      Prompt Engineering

      • Consolidated definition: "Prompt engineering is the iterative process of developing a prompt by modifying or changing the prompting technique that you are using"

      • Process description: "The Prompt Engineering Process consists of three repeated steps 1) performing inference on a dataset 2) evaluating performance and 3) modifying the prompt template"

      Core Prompt Components

      Essential Elements

      • Directive: "Many prompts issue a directive in the form of an instruction or question. This is the core intent of the prompt"

      • Examples/Exemplars: "Examples, also known as exemplars or shots, act as demonstrations that guide the GenAI to accomplish a task"

      • Output formatting: "It is often desirable for the GenAI to output information in certain formats, for example, CSV, Markdown, XML, or even custom formats"

      • Style instructions: "Style instructions are a type of output formatting used to modify the output stylistically rather than structurally"

      • Role/Persona: "A Role, also known as a persona, is a frequently discussed component that can improve writing and style text"

      Systematic Review Methodology

      PRISMA Process

      • Approach: "We conducted a machine-assisted systematic review grounded in the PRISMA process to identify 58 different text-based prompting techniques"

      • Data sources: "Our main data sources were arXiv, Semantic Scholar, and ACL. We query these databases with a list of 44 keywords narrowly related to prompting and prompt engineering"

      • Pipeline: "We retrieve papers from arXiv based on a simple set of keywords and boolean rules. Then, human annotators label a sample of 1,661 articles"

      • Inter-rater reliability: "A set of 300 articles are reviewed independently by two annotators, with 92% agreement (Krippendorff's α = Cohen's κ = 81%)"

      • Final dataset: "The combined human and LLM annotations generate a final set of 1,565 papers"

      Major Technique Categories

      In-Context Learning (ICL)

      • Definition: "ICL refers to the ability of GenAIs to learn skills and tasks by providing them with exemplars and or relevant instructions within the prompt, without the need for weight updates/retraining"

      • Few-Shot Prompting: "Brown et al. (2020) is the paradigm seen in Figure 2.4, where the GenAI learns to complete a task with only a few examples (exemplars)"

      Design Decisions for Few-Shot Prompting

      • Exemplar quantity: "Increasing the quantity of exemplars in the prompt generally improves model performance, particularly in larger models. However, in some cases, the benefits may diminish beyond 20 exemplars"

      • Exemplar ordering: "The order of exemplars affects model behavior. On some tasks, exemplar order can cause accuracy to vary from sub-50% to 90%+"

      • Label distribution impact: "As in traditional supervised machine learning, the distribution of exemplar labels in the prompt affects behavior"

      • Label quality: "Despite the general benefit of multiple exemplars, the necessity of strictly valid demonstrations is unclear. Some work suggests that the accuracy of labels is irrelevant—providing models with exemplars with incorrect labels may not negatively diminish performance"

      • Exemplar format: "The formatting of exemplars also affects performance. One of the most common formats is 'Q: {input}, A: {label}', but the optimal format may vary across tasks"

      • Exemplar similarity: "Selecting exemplars that are similar to the test sample is generally beneficial for performance. However, in some cases, selecting more diverse exemplars can improve performance"

      Few-Shot Techniques

      • K-Nearest Neighbor (KNN): "Liu et al. (2021) is part of a family of algorithms that selects exemplars similar to test samples to boost performance"

      • Vote-K: "Su et al. (2022) is another method to select similar exemplars to the test sample... Vote-K also ensures that newly added exemplars are sufficiently different than existing ones to increase diversity"

      • Self-Generated In-Context Learning (SG-ICL): "Kim et al. (2022) leverages a GenAI to automatically generate exemplars. While better than zero-shot scenarios when training data is unavailable, the generated samples are not as effective as actual data"

      • Prompt Mining: "Jiang et al. (2020) is the process of discovering optimal 'middle words' in prompts through large corpus analysis"

      Zero-Shot Techniques

      • Role Prompting: "Wang et al. (2023j); Zheng et al. (2023d), also known as persona prompting, assigns a specific role to the GenAI in the prompt"

      • Style Prompting: "Lu et al. (2023a) involves specifying the desired style, tone, or genre in the prompt to shape the output"

      • Emotion Prompting: "Li et al. (2023a) incorporates phrases of psychological relevance to humans (e.g., 'This is important to my career') into the prompt, which may lead to improved LLM performance"

      • System 2 Attention (S2A): "Weston and Sukhbaatar (2023) first asks an LLM to rewrite the prompt and remove any information unrelated to the question therein"

      • Rephrase and Respond (RaR): "Deng et al. (2023) instructs the LLM to rephrase and expand the question before generating the final answer"

      • Re-reading (RE2): "Xu et al. (2023) adds the phrase 'Read the question again:' to the prompt in addition to repeating the question"

      • Self-Ask: "Press et al. (2022) prompts LLMs to first decide if they need to ask follow up questions for a given prompt"

      Thought Generation

      • Chain-of-Thought (CoT): "Wei et al. (2022b) leverages few-shot prompting to encourage the LLM to express its thought process before delivering its final answer"

      • Zero-Shot-CoT: "The most straightforward version of CoT contains zero exemplars. It involves appending a thought inducing phrase like 'Let's think step by step.' to the prompt"

      • Step-Back Prompting: "Zheng et al. (2023c) is a modification of CoT where the LLM is first asked a generic, high-level question about relevant concepts or facts before delving into reasoning"

      • Thread-of-Thought (ThoT): "Zhou et al. (2023) consists of an improved thought inducer for CoT reasoning. Instead of 'Let's think step by step,' it uses 'Walk me through this context in manageable parts step by step, summarizing and analyzing as we go.'"

      • Tabular Chain-of-Thought (Tab-CoT): "Jin and Lu (2023) consists of a Zero-Shot CoT prompt that makes the LLM output reasoning as a markdown table"

      Few-Shot CoT Variants

      • Contrastive CoT: "Chia et al. (2023) adds both exemplars with incorrect and correct explanations to the CoT prompt in order to show the LLM how not to reason"

      • Complexity-based Prompting: "Fu et al. (2023b) involves two major modifications to CoT. First, it selects complex examples for annotation and inclusion in the prompt... Second, during inference, it samples multiple reasoning chains"

      • Active Prompting: "Diao et al. (2023) starts with some training questions/exemplars, asks the LLM to solve them, then calculates uncertainty (disagreement in this case) and asks human annotators to rewrite the exemplars with highest uncertainty"

      • Memory-of-Thought: "Li and Qiu (2023b) leverage unlabeled training exemplars to build Few-Shot CoT prompts at test time"

      • Automatic Chain-of-Thought (Auto-CoT): "Zhang et al. (2022b) uses Wei et al. (2022b)'s Zero-Shot prompt to automatically generate chains of thought. These are then used to build a Few-Shot CoT prompt"

      Decomposition

      • Least-to-Most Prompting: "Zhou et al. (2022a) starts by prompting a LLM to break a given problem into sub-problems without solving them. Then, it solves them sequentially, appending model responses to the prompt each time"

      • Decomposed Prompting (DECOMP): "Khot et al. (2022) Few-Shot prompts a LLM to show it how to use certain functions. These might include things like string splitting or internet searching"

      • Plan-and-Solve Prompting: "Wang et al. (2023f) consists of an improved Zero-Shot CoT prompt, 'Let's first understand the problem and devise a plan to solve it. Then, let's carry out the plan and solve the problem step by step'"

      • Tree-of-Thought (ToT): "Yao et al. (2023b), also known as Tree of Thoughts, creates a tree-like search problem by starting with an initial problem then generating multiple possible steps in the form of thoughts"

      • Program-of-Thoughts: "Chen et al. (2023d) uses LLMs like Codex to generate programming code as reasoning steps. A code interpreter executes these steps to obtain the final answer"

      • Skeleton-of-Thought: "Ning et al. (2023) focuses on accelerating answer speed through parallelization. Given a problem, it prompts an LLM to create a skeleton of the answer"

      Ensembling

      • Demonstration Ensembling (DENSE): "Khalifa et al. (2023) creates multiple few-shot prompts, each containing a distinct subset of exemplars from the training set. Next, it aggregates over their outputs"

      • Self-Consistency: "Wang et al. (2022) is based on the intuition that multiple different reasoning paths can lead to the same answer. This method first prompts the LLM multiple times to perform CoT, crucially with a non-zero temperature"

      • Universal Self-Consistency: "Chen et al. (2023e) is similar to Self-Consistency except that rather than selecting the majority response by programmatically counting how often it occurs, it inserts all outputs into a prompt template"

      • DiVeRSe: "Li et al. (2023i) creates multiple prompts for a given problem then performs Self-Consistency for each, generating multiple reasoning paths"

      • Prompt Paraphrasing: "Jiang et al. (2020) transforms an original prompt by changing some of the wording, while still maintaining the overall meaning"

      Self-Criticism

      • Self-Calibration: "Kadavath et al. (2022) first prompts an LLM to answer a question. Then, it builds a new prompt that includes the question, the LLM's answer, and an additional instruction asking whether the answer is correct"

      • Self-Refine: "Madaan et al. (2023) is an iterative framework where, given an initial answer from the LLM, it prompts the same LLM to provide feedback on the answer, and then prompts the LLM to improve the answer based on the feedback"

      • Self-Verification: "Weng et al. (2022) generates multiple candidate solutions with Chain-of-Thought (CoT). It then scores each solution by masking certain parts of the original question"

      • Chain-of-Verification (COVE): "Dhuliawala et al. (2023) first uses an LLM to generate an answer to a given question. Then, it creates a list of related questions that would help verify the correctness of the answer"

      Prompt Engineering Automation

      Meta Prompting

      • Definition: "Meta Prompting is the process of prompting a LLM to generate or improve a prompt or prompt template"

      Automated Techniques

      • AutoPrompt: "Shin et al. (2020b) uses a frozen LLM as well as a prompt template that includes some 'trigger tokens', whose values are updated via backpropagation at training time"

      • Automatic Prompt Engineer (APE): "Zhou et al. (2022b) uses a set of exemplars to generate a Zero-Shot instruction prompt. It generates multiple possible prompts, scores them, then creates variations of the best ones"

      • Gradientfree Instructional Prompt Search (GrIPS): "Prasad et al. (2023) is similar to APE, but uses a more complex set of operations including deletion, addition, swapping, and paraphrasing"

      • RLPrompt: "Deng et al. (2022) uses a frozen LLM with an unfrozen module added. It uses this LLM to generate prompt templates, scores the templates on a dataset, and updates the unfrozen module using Soft Q-Learning"

      Answer Engineering

      Core Concept

      • Definition: "Answer engineering is the iterative process of developing or selecting among algorithms that extract precise answers from LLM outputs"

      Three Design Decisions

      • Answer Shape: "The shape of an answer is its physical format. For example, it could be a token, span of tokens, or even an image or video"

      • Answer Space: "The space of an answer is the domain of values that its structure may contain. This may simply be the space of all tokens, or in a binary labeling task, could just be two possible tokens"

      • Answer Extractor: "In cases where it is impossible to entirely control the answer space... a rule can be defined to extract the final answer. This rule is often a simple function (e.g. a regular expression)"

      Extraction Methods

      • Verbalizer: "Often used in labeling tasks, a verbalizer maps a token, span, or other type of output to a label and vice-versa (injective)"

      • Regex: "Regexes are often used to extract answers. They are usually used to search for the first instance of a label"

      • Separate LLM: "Sometimes outputs are so complicated that regexes won't work consistently. In this case, it can be useful to have a separate LLM evaluate the output and extract an answer"

      Multilingual Prompting

      Core Challenges

      • Performance disparity: "State-of-the-art GenAIs have often been predominately trained with English dataset, leading to a notable disparity in the output quality in languages other than English, particularly low-resource languages"

      Key Techniques

      • Translate First Prompting: "Shi et al. (2022) is perhaps the simplest strategy and first translates non-English input examples into English"

      • Cross-Lingual Thought (XLT): "Huang et al. (2023a) utilizes a prompt template composed of six separate instructions, including role assignment, cross-lingual thinking, and CoT"

      • Cross-Lingual Self Consistent Prompting (CLSP): "Qin et al. (2023a) introduces an ensemble technique that constructs reasoning paths in different languages to answer the same question"

      Prompt Language Selection

      • English advantage: "Constructing the prompt template in English is often more effective than in the task language for multilingual tasks. This is likely due to the predominance of English data during LLM pre-training"

      • Native language rationale: "In contrast, many multilingual prompting benchmarks such as BUFFET or LongBench use task language prompts for language-specific use cases"

      Machine Translation Techniques

      • Multi-Aspect Prompting and Selection (MAPS): "He et al. (2023b) mimics the human translation process, which involves multiple preparatory steps to ensure high-quality output"

      • Chain-of-Dictionary (CoD): "Lu et al. (2023b) first extracts words from the source phrase, then makes a list of their meanings in multiple languages, automatically via retrieval from a dictionary"

      • Interactive-Chain-Prompting (ICP): "Pilault et al. (2023) deals with potential ambiguities in translation by first asking the GenAI to generate sub-questions about any ambiguities in the phrase to be translated"

      Multimodal Prompting

      Image Prompting

      • Prompt Modifiers: "are simply words appended to a prompt to change the resultant image. Components such as Medium (e.g. 'on canvas') or Lighting (e.g. 'a well lit scene') are often used"

      • Negative Prompting: "allows users to numerically weight certain terms in the prompt so that the model considers them more/less heavily than others"

      Multimodal ICL

      • Paired-Image Prompting: "shows the model two images: one before and one after some transformation. Then, present the model with a new image for which it will perform the demonstrated conversion"

      • Image-as-Text Prompting: "Hakimov and Schlangen (2023) generates a textual description of an image. This allows for the easy inclusion of the image (or multiple images) in a text-based prompt"

      Multimodal CoT

      • Duty Distinct Chain-of-Thought (DDCoT): "Zheng et al. (2023b) extends Least-to-Most prompting to the multimodal setting, creating subquestions, then solving them and combining the answers"

      • Chain-of-Images (CoI): "Meng et al. (2023) is a multimodal extension of Chain-of-Thought prompting, that generates images as part of its thought process"

      Other Modalities

      • Audio: "Experiments with audio ICL have generated mixed results, with some open source audio models failing to perform ICL. However, other results do show an ICL ability in audio models"

      • Video: "Prompting has also been extended to the video modality, for use in text-to-video generation, video editing, and video-to-text generation"

      • 3D: "Prompting can also be used in 3D modalities, for example in 3D object synthesis, 3D surface texturing, and 4D scene generation"

      Agents

      Definition

      • Agent concept: "In the context of GenAI, we define agents to be GenAI systems that serve a user's goals via actions that engage with systems outside the GenAI itself"

      Tool Use Agents

      • Modular Reasoning, Knowledge, and Language (MRKL) System: "Karpas et al. (2022) is one of the simplest formulations of an agent. It contains a LLM router providing access to multiple tools"

      • Self-Correcting with Tool-Interactive Critiquing (CRITIC): "Gou et al. (2024a) first generates a response to the prompt, with no external calls. Then, the same LLM criticizes this response for possible errors"

      Code-Generation Agents

      • Program-aided Language Model (PAL): "Gao et al. (2023b) translates a problem directly into code, which is sent to a Python interpreter to generate an answer"

      • Tool-Integrated Reasoning Agent (ToRA): "Gou et al. (2024b) is similar to PAL, but instead of a single code generation step, it interleaves code and reasoning steps for as long as necessary"

      Observation-Based Agents

      • Reasoning and Acting (ReAct): "Yao et al. (2022) generates a thought, takes an action, and receives an observation (and repeats this process) when given a problem to solve"

      • Reflexion: "Shinn et al. (2023) builds on ReAct, adding a layer of introspection. It obtains a trajectory of actions and observations, then is given an evaluation of success/failure"

      Lifelong Learning

      • Voyager: "Wang et al. (2023a) is composed of three parts. First, it proposes tasks for itself to complete in order to learn more about the world. Second, it generates code to execute these actions. Finally, it saves these actions to be retrieved later"

      • Ghost in the Minecraft (GITM): "Zhu et al. (2023) starts with an arbitrary goal, breaks it down into subgoals recursively, then iteratively plans and executes actions by producing structured text"

      Retrieval Augmented Generation (RAG)

      • Core concept: "RAG is a paradigm in which information is retrieved from an external source and inserted into the prompt. This can enhance performance in knowledge intensive tasks"

      • Verify-and-Edit: "Zhao et al. (2023a) improves on self-consistency by generating multiple chains-of-thought, then selecting some to be edited. They do this by retrieving relevant (external) information"

      • Interleaved Retrieval guided by Chain-of-Thought (IRCoT): "Trivedi et al. (2023) is a technique for multi-hop question answering that interleaves CoT and retrieval"

      Evaluation

      Prompting Techniques for Evaluation

      • In-Context Learning: "is frequently used in evaluation prompts, much in the same way it is used in other applications"

      • Role-based Evaluation: "is a useful technique for improving and diversifying evaluations. By creating prompts with the same instructions for evaluation, but different roles, it is possible to effectively generate diverse evaluations"

      • Chain-of-Thought: "prompting can further improve evaluation performance"

      • Model-Generated Guidelines: "Liu et al. (2023d, h) prompt an LLM to generate guidelines for evaluation. This reduces the insufficient prompting problem arising from ill-defined scoring guidelines"

      Output Formats

      • Styling: "Formatting the LLM's response using XML or JSON styling has also been shown to improve the accuracy of the judgment generated by the evaluator"

      • Linear Scale: "A very simple output format is a linear scale (e.g. 1-5). Many works use ratings of 1-10, 1-5, or even 0-1"

      • Binary Score: "Prompting the model to generate binary responses like Yes or No and True or False is another frequently used output format"

      • Likert Scale: "Prompting the GenAI to make use of a Likert Scale can give it a better understanding of the meaning of the scale"

      Evaluation Frameworks

      • LLM-EVAL: "Lin and Chen (2023) is one of the simplest evaluation frameworks. It uses a single prompt that contains a schema of variables to evaluate"

      • G-EVAL: "Liu et al. (2023d) is similar to LLM-EVAL, but includes an AutoCoT steps in the prompt itself"

      • ChatEval: "Chan et al. (2024) uses a multi-agent debate framework with each agent having a separate role"

      Other Methodologies

      • Batch Prompting: "For improving compute and cost efficiency, some works employ batch prompting for evaluation where multiple instances are evaluated at once"

      • Pairwise Evaluation: "Chen et al. (2023g) find that directly comparing the quality of two texts may lead to suboptimal results and that explicitly asking LLM to generate a score for individual summaries is the most effective"

      Security & Safety

      Prompt Hacking

      • Definition: "Prompt hacking refers to a class of attacks which manipulate the prompt in order to attack a GenAI"

      • Prompt Injection: "is the process of overriding original developer instructions in the prompt with user input"

      • Jailbreaking: "is the process of getting a GenAI model to do or say unintended things through prompting"

      Security Risks

      • Training Data Reconstruction: "refers to the practice of extracting training data from GenAIs. A straightforward example of this is Nasr et al. (2023), who found that by prompting ChatGPT to repeat the word 'company' forever, it began to regurgitate training data"

      • Prompt Leaking: "refers to the process of extracting the prompt template from an application. Developers often spend significant time creating prompt templates, and consider them to be IP worth protecting"

      • Package Hallucination: "occurs when LLM-generated code attempts to import packages that do not exist. After discovering what package names are frequently hallucinated by LLMs, hackers could create those packages, but with malicious code"

      Defense Mechanisms

      • Prompt-based Defenses: "Multiple prompt-based defenses have been proposed, in which instructions are included in the prompt to avoid prompt injection. However, Schulhoff et al. (2023) ran a study with hundreds of thousands of malicious prompts and found that no prompt-based defense is fully secure"

      • Detectors: "are tools designed to detect malicious inputs and prevent prompt hacking. Many companies have built such detectors, which are often built using fine-tuned models trained on malicious prompts"

      • Guardrails: "are rules and frameworks for guiding GenAI outputs. Guardrails often make use of detectors, but not always. Guardrails are more concerned with the general dialogue flow in an application"

      Alignment Issues

      Prompt Sensitivity

      • Small changes impact: "Several works show that LLMs are highly sensitive to the input prompt, i.e., even subtle changes to a prompt such as exemplar order can result in vastly different outputs"

      • Task format variation: "describes different ways to prompt an LLM to execute the same task... Zhao et al. (2021b) show that these minor changes can alter the accuracy of GPT-3 by up to 30%"

      • Prompt Drift: "Chen et al. (2023b) occurs when the model behind an API changes over time, so the same prompt may produce different results on the updated model"

      Calibration Issues

      • Overconfidence: "LLMs are often overconfident in their answers, especially when prompted to express their own confidence in words, which may lead to user overreliance on model outputs"

      • Sycophancy: "refers to the concept that LLMs will often express agreement with the user, even when that view contradicts the model's own initial output"

      Bias & Fairness

      • Vanilla Prompting: "Si et al. (2023b) simply consists of an instruction in the prompt that tells the LLM to be unbiased. This technique has also been referred to as moral self-correction"

      • Cultural Awareness: "Yao et al. (2023a) can be injected into prompts to help LLMs with cultural adaptation"

      • AttrPrompt: "Yu et al. (2023) is a prompting technique designed to avoid producing text biased towards certain attributes when generating synthetic data"

      Ambiguity Handling

      • Ambiguous Demonstrations: "Gao et al. (2023a) are examples that have an ambiguous label set. Including them in a prompt can increase ICL performance"

      • Question Clarification: "Rao and Daumé III (2019) allows the LLM to identify ambiguous questions and generate clarifying questions to pose to the user"

      Benchmarking Results

      MMLU Evaluation

      • Performance trends: "Performance generally improved as techniques grew more complex. However, Zero-Shot-CoT dropped precipitously from Zero-Shot. Although it had a wide spread, for all variants, Zero-Shot performed better"

      • Best performer: "Few-Shot CoT performs the best, and unexplained performance drops from certain techniques need further research"

      • Self-Consistency impact: "Both cases of Self-Consistency, naturally had lower spread since they repeated a single technique, but it only improved accuracy for Zero-Shot prompts"

      Case Study: Suicide Crisis Detection

      • Problem domain: "Our illustrative problem involves detection of signal that is predictive of crisis-level suicide risk in text written by a potentially suicidal individual"

      • Target construct: "We focus here on the most important predictive factor in Suicide Crisis Syndrome assessments, referred to in the literature as either frantic hopelessness or entrapment"

      • Dataset: "Two coders trained on the recognition of the factors in Suicide Crisis Syndrome coded a set of 221 posts for presence or absence of entrapment, achieving solid inter-coder reliability (Krippendorff's alpha = 0.72)"

      Prompt Engineering Process

      • Development effort: "The exercise proceeded through 47 recorded development steps, cumulatively about 20 hours of work. From a cold start with 0% performance, performance was boosted to an F1 of 0.53"

      • Best manual approach: "10-Shot AutoDiCoT prompt includes 15 exemplars (without CoT reasoning) and one bootstrapped reasoning demonstration"

      • DSPy comparison: "The best resulting prompt... achieves 0.548 F1 (and 0.385 / 0.952 precision / recall) on the test set, without making any use of the professor's email nor the incorrect instruction about the explicitness of entrapment"

      Key Takeaways

      • Sensitivity to details: "prompt engineering is fundamentally different from other ways of getting a computer to behave the way you want it to: these systems are being cajoled, not programmed, and... can be incredibly sensitive to specific details in prompts without there being any obvious reason those details should matter"

      • Domain expertise crucial: "the third and most important take-away is that prompt engineering should involve engagement between the prompt engineer, who has expertise in how to coax LLMs to behave in desired ways, and domain experts, who understand what those desired ways are and why"

      • Automation value: "Ultimately we found that there was significant promise in an automated method for exploring the prompting space, but also that combining that automation with human prompt engineering/revision was the most successful approach"

      Most-Used Techniques & Models

      Popular Techniques (by citations)

      • Top techniques: "The prevalence of citations for Few-Shot and Chain-of-Thought prompting is unsurprising and helps to establish a baseline for understanding the prevalence of other techniques"

      Popular Models (by citations in dataset)

      • Top models cited include: GPT-3, GPT-4, ChatGPT, PaLM, LLaMA families

      Popular Benchmarks

      • Top datasets: MMLU, GSM8K, various arithmetic and commonsense reasoning benchmarks

      Future Directions & Recommendations

      For Beginners

      • Start simple: "To those just beginning in prompt engineering, our recommendations resemble what one would recommend in any machine learning setting: understand the problem you are trying to solve (rather than just focusing on input/output and benchmark scores)"

      • Stay skeptical: "It is better to start with simpler approaches first, and to remain skeptical of claims about method performance"

      For Practitioners

      • Contextual understanding: "To those already engaged in prompt engineering, we hope that our taxonomy will shed light on the relationships between existing techniques"

      For Researchers

      • Situate new work: "To those developing new techniques, we encourage situating new methods within our taxonomy, as well as including ecologically valid case studies and illustrations of those techniques"

      Key References & Tools

      Foundational Papers

      Agent Frameworks

      Tools & Platforms

      Evaluation & Safety

      Multilingual & Multimodal

      Automated Prompt Engineering

      Dataset & Methodology Details

      Dataset Composition

      • Final corpus: "The dataset contains 1,565 research papers in PDF format. Any duplicate papers were removed automatically, though some could exist"

      • Time frame: "The dataset was curated the duration of the research paper, primarily in February of 2024"

      • Source distribution: "We wrote scripts to automatically query the APIs of Arxiv and Semantic Scholar"

      Quality Control

      • Human validation: "After collecting data from different sources, we removed duplicate papers and did a manual and semi-automated review of papers to ensure they were all relevant"

      • LLM-assisted review: "We develop a prompt using gpt-4-1106-preview to classify the remaining articles. We validate the prompt against 100 ground-truth annotations, achieving 89% precision and 75% recall (for an F1 of 81%)"

      Search Keywords (Selected Examples)

      • Core terms: "jailbreak prompt", "prompt engineering", "few-shot learning", "in-context learning"
      • Technique-specific: "chain-of-thought", "zero-shot prompting", "prompt optimization"
      • Domain-specific: "llm prompting", "transformer model prompts", "multimodal prompting"

      Critical Insights & Limitations

      Nature of Prompting

      • Black art acknowledgment: "This can be interpreted both optimistically and pessimistically. Optimistically, it demonstrates how improvements can arise through exploration and fortuitous discovery. On the pessimistic side, the value of duplicating the email in the prompt highlights the extent to which prompting remains a difficult to explain black art"

      • Emergent vs discovered: "Many of the techniques described here have been called 'emergent', but it is perhaps more appropriate to say that they were discovered—the result of thorough experimentation, analogies from human reasoning, or pure serendipity"

      Validation Challenges

      • Lack of standardization: "The field is new, and evaluation is variable and unstandardized—even the most meticulous experimentation may suffer from unanticipated shortcomings, and model outputs themselves are sensitive to meaning-preserving changes in inputs"

      • Transfer uncertainty: "As a result, we encourage the reader to avoid taking any claims at face value and to recognize that techniques may not transfer to other models, problems, or datasets"

      Scope Limitations

      • Focus restrictions: "To keep the work approachable to less technical readers and maintain a manageable scope... we only study task-agnostic techniques"

      • Exclusions: "These decisions keep the work approachable to less technical readers and maintain a manageable scope"

      Practical Implementation Notes

      Prompt Template Best Practices

      • Variable replacement: "A prompt template is a function that contains one or more variables which will be replaced by some media (usually text) to create a prompt"

      • Context preservation: "It is often necessary to include additional information in the prompt... Additional Information is sometimes called 'context', though we discourage the use of this term as it is overloaded with other meanings in the prompting space"

      Answer Extraction Strategies

      • Verbalizer design: "For example, if we wish for a model to predict whether a Tweet is positive or negative, we could prompt it to output either '+' or '-' and a verbalizer would map these token sequences to the appropriate labels"

      • Regex patterns: "Regexes are often used to extract answers. They are usually used to search for the first instance of a label. However, depending on the output format and whether CoTs are generated, it may be better to search for the last instance"

      • Cascading approaches: "Sometimes outputs are so complicated that regexes won't work consistently. In this case, it can be useful to have a separate LLM evaluate the output and extract an answer"

      Model Selection Considerations

      • Guardrails interference: "A take-away from this initial phase is that the 'guard rails' associated with some large language models may interfere with the ability to make progress on a prompting task, and this could influence the choice of model for reasons other than the LLM's potential quality"

      • Temperature settings: "For the two Self-Consistency results, we set temperature to 0.5, following Wang et al. (2022)'s guidelines. For all other prompts, a temperature of 0 was used"

      Terminology Disambiguation

      Conflicting Usages

      • In-Context Learning ambiguity: "Note that the word 'learn' is misleading. ICL can simply be task specification–the skills are not necessarily new, and can have already been included in the training data"

      • Brown et al. definitions: "Brown et al. (2020) seemingly offer two different definitions for ICL... However, they explicitly state that ICL does not necessarily involve learning new tasks"

      • Prompt vs Prompt Template: "Brown et al. (2020) consider the word 'llama' to be the prompt, while 'Translate English to French:' is the 'task description'. More recent papers, including this one, refer to the entire string passed to the LLM as the prompt"

      Hard vs Soft Prompts

      • Hard (discrete): "These prompts only contain tokens that directly correspond to words in the LLM vocabulary"

      • Soft (continuous): "These prompts contain tokens that may not correspond to any word in the vocabulary... Soft prompts can be used when fine-tuning is desired, but modifying the weights of the full model is prohibitively expensive"

      Prefix vs Cloze

      • Prefix prompts: "In Prefix prompts, the token to be predicted is at the end of the prompt. This is usually the case with modern GPT-style models"

      • Cloze prompts: "In Cloze prompts, the token(s) to be predicted are presented as 'slots to fill', usually somewhere in the middle of the prompt. This is usually the case for earlier transformer models such as BERT"

      Advanced Technique Details

      AutoDiCoT (Novel Contribution)

      • Algorithm description: "We call the algorithm in Figure 6.12 Automatic Directed CoT (AutoDiCoT), since it automatically directs the CoT process to reason in a particular way"

      • Process: "For each pair (qi, ai) in training data: Label qi as entrapment or not using the model. If correct, prompt with 'Why?' to generate reasoning. If incorrect, prompt 'It is actually [is/is not] entrapment, please explain why.'"

      • Generalizability: "This technique can be generalized to any labeling task. It combines the automatic generation of CoTs with showing the LLM examples of bad reasoning, as in the case of Contrastive CoT"

      Design Decision Framework

      • Six critical factors: "We highlight six separate design decisions, including the selection and order of exemplars that critically influence the output quality"

      • Tradeoffs: "Although effective, employing KNN during prompt generation may be time and resource intensive"

      Iterative Retrieval

      • FLARE approach: "Forward-Looking Active REtrieval augmented generation (FLARE) and Imitate, Retrieve, Paraphrase (IRP) perform retrieval multiple times during long-form generation"

      • Three-step process: "1) generating a temporary sentence to serve as a content plan; 2) retrieving external knowledge using the temporary sentence as a query; 3) injecting the retrieved knowledge into the temporary sentence"

      • Query quality: "These temporary sentences have been shown to be better search queries compared to the document titles provided in long-form generation tasks"

      Meta-Analysis Statistics

      Citation Patterns

      • Most cited techniques: "The prevalence of citations for Few-Shot and Chain-of-Thought prompting is unsurprising and helps to establish a baseline for understanding the prevalence of other techniques"

      • Model usage: Citation analysis shows GPT family dominates research, followed by PaLM and open-source alternatives

      • Dataset popularity: MMLU, GSM8K, and arithmetic reasoning benchmarks most frequently used

      Research Trends

      • Paper growth: 1,565 relevant papers identified from broader corpus of 4,247 unique records

      • Quality metrics: Inter-annotator agreement of 92% (Krippendorff's α = Cohen's κ = 81%) for relevance labeling

      • LLM assistance: "We validate the prompt against 100 ground-truth annotations, achieving 89% precision and 75% recall (for an F1 of 81%)" for automated paper screening

      Formal Definitions

      Mathematical Formulation

      • Basic prompt conditioning: "p(A|T,Q) = ∏(i=1 to |A|) p_LM(ai|T,Q,a1:i-1)" where T is prompt template, Q is question, A is answer

      • Few-shot extension: "p(A|T(X,x)) = ∏(i=1 to |A|) p_LM(ai|T(X,x),a1:i-1)" where X is set of training exemplars

      • Optimization objective: "T* = argmax_T E_{xi,yi~D}[S(p_LM(A|T(xi)),yi)]" maximizing scoring function S over dataset D

      • Answer engineering: "A ~ p_LM(A|T(xi),yi); T* = argmax_{T,E} E_{xi,yi~D}[S(E(A),yi)]" where E is extraction function

      Storage & Implementation Constraints

      Browser Environment

      • Critical restriction: "NEVER use localStorage, sessionStorage, or ANY browser storage APIs in artifacts. These APIs are NOT supported and will cause artifacts to fail in the Claude.ai environment"

      • Alternatives: "Instead, you MUST: Use React state (useState, useReducer) for React components; Use JavaScript variables or objects for HTML artifacts; Store all data in memory during the session"

      Library Availability (React Artifacts)

      • Available libraries include: lucide-react, recharts, MathJS, lodash, d3, Plotly, Three.js (r128), Papaparse, SheetJS, shadcn/ui, Chart.js, Tone, mammoth, tensorflow
      • Important limitation: "NO OTHER LIBRARIES ARE INSTALLED OR ABLE TO BE IMPORTED"
      • Three.js caveat: "IMPORTANT: Do NOT use THREE.CapsuleGeometry as it was introduced in r142. Use alternatives like CylinderGeometry, SphereGeometry, or create custom geometries instead"

      Contributions & Authorship

      Team Structure

      • Lead authors: Sander Schulhoff (lead), Michael Ilie (co-lead)
      • Principal investigator: Philip Resnik
      • Total contributors: 58 authors from 13 institutions

      Major Section Leads

      • Benchmarking: Konstantine Kahadze
      • Agents: Ashay Srivastava
      • Alignment: Nishant Balepur
      • Security: Sevien Schulhoff
      • Multilingual: Dayeon Ki
      • Evaluation: Sweta Agrawal

      Domain Expertise

      • SCS labeling: Megan L. Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker provided clinical expertise
      • Multilingual guidance: Marine Carpuat framed and reviewed multilingual section

      Additional Resources

      Maintained Resources

      • Live terminology: "We maintain an up-to-date list of terms and techniques at LearnPrompting.org"
      • Dataset access: Available on HuggingFace with full datasheet
      • Code repository: GitHub with systematic review pipeline

      Future Updates

      • Iterative taxonomy: "We expect this to be the first iteration of terminologies that will develop over time"
      • Community contribution: "If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? Yes, anyone is free to use/modify the data"

      Citation Information

      • Preferred citation: Schulhoff et al. (2024), "The Prompt Report: A Systematic Survey of Prompting Techniques"
      • Contact: sanderschulhoff@gmail.com for dataset inquiries
      • Funding acknowledgment: "$10,000 in API credits given by OpenAI"
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors analyze electrophysiological data recorded bilaterally from the rat hippocampus to investigate the coupling of ripple oscillations across the hemispheres. Commensurate with the majority of previous research, the authors report that ripples tend to co-occur across both hemispheres. Specifically, the amplitude of ripples across hemispheres is correlated but their phase is not. These data corroborate existing models of ripple generation suggesting that CA3 inputs (coordinated across hemispheres via the commisural fibers) drive the sharp-wave component while the individual ripple waves are the result of local interactions between pyramidal cells and interneurons in CA1.

      Strengths:

      The manuscript is well-written, the analyses well-executed and the claims are supported by the data.

      Weaknesses:

      One question left unanswered by this study is whether information encoded by the right and left hippocampi is correlated.

      Thank you for raising this important point. While our study demonstrates ripple co-occurrence across hemispheres, we did not directly assess whether the information encoded in each hippocampus is correlated. Addressing this question would require analyses of coordinated activity patterns, such as neuronal assemblies formed during novelty exposure, which falls beyond the scope of the present study. However, we agree this is an important avenue for future work, and we now acknowledge this limitation and outlined it as a future direction in the Conclusion section (lines 796–802).

      Reviewer #2 (Public review):

      Summary:

      The authors completed a statistically rigorous analysis of the synchronization of sharp-wave ripples in the hippocampal CA1 across and within hemispheres. They used a publicly available dataset (collected in the Buzsaki lab) from 4 rats (8 sessions) recorded with silicon probes in both hemispheres. Each session contained approximately 8 hours of activity recorded during rest. The authors found that the characteristics of ripples did not differ between hemispheres, and that most ripples occurred almost simultaneously on all probe shanks within a hemisphere as well as across hemispheres. The differences in amplitude and exact timing of ripples between recording sites increased slightly with the distance between recording sites. However, the phase coupling of ripples (in the 100-250 Hz range), changed dramatically with the distance between recording sites. Ripples in opposite hemispheres were about 90% less coupled than ripples on nearby tetrodes in the same hemisphere. Phase coupling also decreased with distance within the hemisphere. Finally, pyramidal cell and interneuron spikes were coupled to the local ripple phase and less so to ripples at distant sites or the opposite hemisphere.

      Strengths:

      The analysis was well-designed and rigorous. The authors used statistical tests well suited to the hypotheses being tested, and clearly explained these tests. The paper is very clearly written, making it easy to understand and reproduce the analysis. The authors included an excellent review of the literature to explain the motivation for their study.

      Weaknesses:

      The authors state that their findings (highly coincident ripples between hemispheres), contradict other findings in the literature (in particular the study by Villalobos, Maldonado, and Valdes, 2017), but fail to explain why this large difference exists. They seem to imply that the previous study was flawed, without examining the differences between the studies.

      The paper fails to mention the context in which the data was collected (the behavior the animals performed before and after the analyzed data), which may in fact have a large impact on the results and explain the differences between the current study and that by Villalobos et al. The Buzsaki lab data includes mice running laps in a novel environment in the middle of two rest sessions. Given that ripple occurrence is influenced by behavior, and that the neurons spiking during ripples are highly related to the prior behavioral task, it is likely that exposure to novelty changed the statistics of ripples. Thus, the authors should analyze the pre-behavior rest and post-behavior rest sessions separately. The Villalobos et al. data, in contrast, was collected without any intervening behavioral task or novelty (to my knowledge). Therefore, I predict that the opposing results are a result of the difference in recent experiences of the studied rats, and can actually give us insight into the memory function of ripples.

      We appreciate this thoughtful hypothesis and have now addressed it explicitly. Our main analysis was conducted on 1-hour concatenated SWS epochs recorded before any novel environment exposure (baseline sleep). This was not clearly stated in the original manuscript, so we have now added a clarifying paragraph (lines 131–143). The main findings therefore remain unchanged.

      To directly test the reviewer’s hypothesis, we performed the suggested comparison between pre- and post-maze rest sessions, including maze-type as a factor. These new analyses are now presented in a dedicated Results subsection (lines 475 - 493) and in Supplementary Figure 5.1. While we observed a modest increase in ripple abundance after the maze sessions — consistent with known experienced-dependent changes in ripple occurrence — the key findings of interhemispheric synchrony remained unchanged. Both pre- and post-maze sleep sessions showed robust bilateral time-locking of ripple events and similar dissociations between phase and amplitude coupling across hemispheres.

      In one figure (5), the authors show data separated by session, rather than pooled. They should do this for other figures as well. There is a wide spread between sessions, which further suggests that the results are not as widely applicable as the authors seem to think. Do the sessions with small differences between phase coupling and amplitude coupling have low inter-hemispheric amplitude coupling, or high phase coupling? What is the difference between the sessions with low and high differences in phase vs. amplitude coupling? I noticed that the Buzsaki dataset contains data from rats running either on linear tracks (back and forth), or on circular tracks (unidirectionally). This could create a difference in inter-hemisphere coupling, because rats running on linear tracks would have the same sensory inputs to both hemispheres (when running in opposite directions), while rats running on a circular track would have different sensory inputs coming from the right and left (one side would include stimuli in the middle of the track, and the other would include closer views of the walls of the room). The synchronization between hemispheres might be impacted by how much overlap there was in sensory stimuli processed during the behavior epoch.

      Thank you for this insightful suggestion. In our new analyses comparing pre- and post-maze sessions, we have also addressed this question. Supplementary Figures 4.1 and 5.1 (E-F) present coupling metrics averaged per session and include coding for maze type. Additionally, we have incorporated the reviewer’s hypothesis regarding sensory input differences and their potential impact on inter-hemispheric synchronization into a new Results subsection (lines 475–493).

      The paper would be a lot stronger if the authors analyzed some of the differences between datasets, sessions, and epochs based on the task design, and wrote more about these issues. There may be more publicly available bi-hemispheric datasets to validate their results.

      To further validate our findings, we have analyzed another publicly available dataset that includes bilateral CA1 recordings (https://crcns.org/data-sets/hc/hc-18). We have added a description of this dataset and our analysis approach in the Methods section (lines 119–125 and 144-145), and present the corresponding results in a new Supplementary Figure (Supplementary Figure 4.2). These new analyses replicated our main findings, confirming robust interhemispheric time-locking of ripple events and a greater dissociation between phase and amplitude coupling in ipsilateral versus contralateral recordings.

      Reviewer #1 (Recommendations for the authors):

      My only suggestion is that the introduction can be shortened. The authors discuss in great length literature linking ripples and memory, although the findings in the paper are not linked to memory. In addition, ripples have been implicated in non-mnemonic functions such as sleep and metabolic homeostasis.

      The reviewer`s suggestion is valid and aligns with the main message of our paper. However, we believe that the relationship between ripples and memory has been extensively discussed in the literature, sometimes overshadowing other important functional roles (based on the reviewer’s comment, we now also refer to non-mnemonic functions of ripples in the revised introduction [lines 87–89]). Thus, we find it important to retain this context because highlighting the publication bias towards mnemonic interpretations helps frame the need for studies like ours that revisit still incompletely understood basic ripple mechanisms.

      We also note that, based on a suggestion from reviewer 2, we have supplemented our manuscript with a new figure demonstrating ripple abundance increases during SWS following novel environment exposure (Supplementary Figure 5.1), linking it to memory and replicating the findings of Eschenko et al. (2008), though we present this result as a covariate, aimed at controlling for potential sources of variation in ripple synchronization.

      Reviewer #2 (Recommendations for the authors):

      It would be useful to include more information about the analyzed dataset in the methods section, e.g. how long were the recordings, how many datasets per rat, did the authors analyze the entire recording epoch or sub-divide it in any way, how many ripples were detected per recording (approximately).

      We have now included more detailed information in the Methods section (lines 104 - 145).

      A few of the references to sub-figures are mislabeled (e.g. lines 327-328).

      Thank you for noticing these inconsistencies. We have carefully reviewed and corrected all figure sub-panel labels and references throughout the manuscript.

      In Figure 7 C&D, are the neurons on the left sorted by contralateral ripple phase? It doesn't look like it. It would be easier to compare to ipsilateral if they were.

      In Figures 7C and 7D, neurons are sorted by their ipsilateral peak ripple phase, with the contralateral data plotted using the same ordering to facilitate comparison. To avoid confusion, we have clarified this explicitly in the figure legend and corresponding main text (lines 544–550).

      In Figure 6, using both bin sizes 50 and 100 doesn't contribute much.

      We used both 50 ms and 100 ms bin sizes to directly compare with previous studies (Villalobos et al. 2017 used 5 ms and 100 ms; Csicsvari et al. 2000 used 5–50 ms). Because the proportion of coincident ripples is a non-decreasing function of the window size, larger bins can inflate coincidence measures. Including a mid-range bin of 50 ms allowed us to show that high coincidence levels are reached well before the 100 ms upper bound, supporting that the 100 ms window is not an overshoot. We have added clarification on this point in the Methods section on ripple coincidence (lines 204–212).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Lu & Golomb combined EEG, artificial neural networks, and multivariate pattern analyses to examine how different visual variables are processed in the brain. The conclusions of the paper are mostly well supported, but some aspects of methods and data analysis would benefit from clarification and potential extensions.

      The authors find that not only real-world size is represented in the brain (which was known), but both retinal size and real-world depth are represented, at different time points or latencies, which may reflect different stages of processing. Prior work has not been able to answer the question of real-world depth due to the stimuli used. The authors made this possible by assessing real-world depth and testing it with appropriate methodology, accounting for retinal and real-world size. The methodological approach combining behavior, RSA, and ANNs is creative and well thought out to appropriately assess the research questions, and the findings may be very compelling if backed up with some clarifications and further analyses.

      The work will be of interest to experimental and computational vision scientists, as well as the broader computational cognitive neuroscience community as the methodology is of interest and the code is or will be made available. The work is important as it is currently not clear what the correspondence between many deep neural network models and the brain is, and this work pushes our knowledge forward on this front. Furthermore, the availability of methods and data will be useful for the scientific community.

      Reviewer #2 (Public Review):

      Summary:

      This paper aims to test if neural representations of images of objects in the human brain contain a 'pure' dimension of real-world size that is independent of retinal size or perceived depth. To this end, they apply representational similarity analysis on EEG responses in 10 human subjects to a set of 200 images from a publicly available database (THINGS-EEG2), correlating pairwise distinctions in evoked activity between images with pairwise differences in human ratings of real-world size (from THINGS+). By partialling out correlations with metrics of retinal size and perceived depth from the resulting EEG correlation time courses, the paper claims to identify an independent representation of real-world size starting at 170 ms in the EEG signal. Further comparisons with artificial neural networks and language embeddings lead the authors to claim this correlation reflects a relatively 'high-level' and 'stable' neural representation.

      Strengths:

      The paper features insightful figures/illustrations and clear figures.

      The limitations of prior work motivating the current study are clearly explained and seem reasonable (although the rationale for why using 'ecological' stimuli with backgrounds matters when studying real-world size could be made clearer; one could also argue the opposite, that to get a 'pure' representation of the real-world size of an 'object concept', one should actually show objects in isolation).

      The partial correlation analysis convincingly demonstrates how correlations between feature spaces can affect their correlations with EEG responses (and how taking into account these correlations can disentangle them better).

      The RSA analysis and associated statistical methods appear solid.

      Weaknesses:

      The claim of methodological novelty is overblown. Comparing image metrics, behavioral measurements, and ANN activations against EEG using RSA is a commonly used approach to study neural object representations. The dataset size (200 test images from THINGS) is not particularly large, and neither is comparing pre-trained DNNs and language models, or using partial correlations.

      Thanks for your feedback. We agree that the methods used in our study – such as RSA, partial correlations, and the use of pretrained ANN and language models – are indeed well-established in the literature. We therefore revised the manuscript to more carefully frame our contribution: rather than emphasizing methodological novelty in isolation, we now highlight the combination of techniques, the application to human EEG data with naturalistic images, and the explicit dissociation of real-world size, retinal size, and depth representations as the primary strengths of our approach. Corresponding language in the Abstract, Introduction, and Discussion has been adjusted to reflect this more precise positioning:

      (Abstract, line 34 to 37) “our study combines human EEG and representational similarity analysis to disentangle neural representations of object real-world size from retinal size and perceived depth, leveraging recent datasets and modeling approaches to address challenges not fully resolved in previous work.”

      (Introduction, line 104 to 106) “we overcome these challenges by combining human EEG recordings, naturalistic stimulus images, artificial neural networks, and computational modeling approaches including representational similarity analysis (RSA) and partial correlation analysis …”

      (Introduction, line 108) “We applied our integrated computational approach to an open EEG dataset…”

      (Introduction, line 142 to 143) “The integrated computational approach by cross-modal representational comparisons we take with the current study…”

      (Discussion, line 550 to 552) “our study goes beyond the contributions of prior studies in several key ways, offering both theoretical and methodological advances: …”

      The claims also seem too broad given the fairly small set of RDMs that are used here (3 size metrics, 4 ANN layers, 1 Word2Vec RDM): there are many aspects of object processing not studied here, so it's not correct to say this study provides a 'detailed and clear characterization of the object processing process'.

      Thanks for pointing this out. We softened language in our manuscript to reflect that our findings provide a temporally resolved characterization of selected object features, rather than a comprehensive account of object processing:

      (line 34 to 37) “our study combines human EEG and representational similarity analysis to disentangle neural representations of object real-world size from retinal size and perceived depth, leveraging recent datasets and modeling approaches to address challenges not fully resolved in previous work.”

      (line 46 to 48) “Our research provides a temporally resolved characterization of how certain key object properties – such as object real-world size, depth, and retinal size – are represented in the brain, …”

      The paper lacks an analysis demonstrating the validity of the real-world depth measure, which is here computed from the other two metrics by simply dividing them. The rationale and logic of this metric is not clearly explained. Is it intended to reflect the hypothesized egocentric distance to the object in the image if the person had in fact been 'inside' the image? How do we know this is valid? It would be helpful if the authors provided a validation of this metric.

      We appreciate the comment regarding the real-world depth metric. Specifically, this metric was computed as the ratio of real-world size (obtained via behavioral ratings) to measured retinal size. The rationale behind this computation is grounded in the basic principles of perspective projection: for two objects subtending the same retinal size, the physically larger object is presumed to be farther away. This ratio thus serves as a proxy for perceived egocentric depth under the simplifying assumption of consistent viewing geometry across images.

      We acknowledge that this is a derived estimate and not a direct measurement of perceived depth. While it provides a useful approximation that allows us to analytically dissociate the contributions of real-world size and depth in our RSA framework, we agree that future work would benefit from independent perceptual depth ratings to validate or refine this metric. We added more discussions about this to our revised manuscript:

      (line 652 to 657) “Additionally, we acknowledge that our metric for real-world depth was derived indirectly as the ratio of perceived real-world size to retinal size. While this formulation is grounded in geometric principles of perspective projection and served the purpose of analytically dissociating depth from size in our RSA framework, it remains a proxy rather than a direct measure of perceived egocentric distance. Future work incorporating behavioral or psychophysical depth ratings would be valuable for validating and refining this metric.”

      Given that there is only 1 image/concept here, the factor of real-world size may be confounded with other things, such as semantic category (e.g. buildings vs. tools). While the comparison of the real-world size metric appears to be effectively disentangled from retinal size and (the author's metric of) depth here, there are still many other object properties that are likely correlated with real-world size and therefore will confound identifying a 'pure' representation of real-world size in EEG. This could be addressed by adding more hypothesis RDMs reflecting different aspects of the images that may correlate with real-world size.

      We thank the reviewer for this thoughtful and important point. We agree that semantic category and real-world size may be correlated, and that semantic structure is one of the plausible sources of variance contributing to real-world size representations. However, we would like to clarify that our original goal was to isolate real-world size from two key physical image features — retinal size and inferred real-world depth — which have been major confounds in prior work on this topic. We acknowledge that although our analysis disentangled real-world size from depth and retinal size, this does not imply a fully “pure” representation; therefore, we now refer to the real-world size representations as “partially disentangled” throughout the manuscript to reflect this nuance.

      Interestingly, after controlling for these physical features, we still found a robust and statistically isolated representation of real-world size in the EEG signal. This motivated the idea that realworld size may be more than a purely perceptual or image-based property — it may be at least partially semantic. Supporting this interpretation, both the late layers of ANN models and the non-visual semantic model (Word2Vec) also captured real-world size structure. Rather than treating semantic information as an unwanted confound, we propose that semantic structure may be an inherent component of how the brain encodes real-world size.

      To directly address the your concern, we conducted an additional variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by four RDMs: real-world depth, retinal size, real-world size, and semantic information (from Word2Vec). Specifically, for each EEG timepoint, we quantified (1) the unique variance of real-world size, after controlling for semantic similarity, depth, and retinal size; (2) the unique variance of semantic information, after controlling for real-world size, depth, and retinal size; (3) the shared variance jointly explained by real-world size and semantic similarity, controlling for depth and retinal size. This analysis revealed that real-world size explained unique variance in EEG even after accounting for semantic similarity. And there was also a substantial shared variance, indicating partial overlap between semantic structure and size. Semantic information also contributed unique explanatory power, as expected. These results suggest that real-world size is indeed partially semantic in nature, but also has independent neural representation not fully explained by general semantic similarity. This strengthens our conclusion that real-world size functions as a meaningful, higher-level dimension in object representation space.

      We now include this new analysis and a corresponding figure (Figure S8) in the revised manuscript:

      (line 532 to 539) “Second, we conducted a variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by three hypothesis-based RDMs and the semantic RDM (Word2Vec RDM), and we still found that real-world size explained unique variance in EEG even after accounting for semantic similarity (Figure S9). And we also observed a substantial shared variance jointly explained by real-world size and semantic similarity and a unique variance of semantic information. These results suggest that real-world size is indeed partially semantic in nature, but also has independent neural representation not fully explained by general semantic similarity.”

      The choice of ANNs lacks a clear motivation. Why these two particular networks? Why pick only 2 somewhat arbitrary layers? If the goal is to identify more semantic representations using CLIP, the comparison between CLIP and vision-only ResNet should be done with models trained on the same training datasets (to exclude the effect of training dataset size & quality; cf Wang et al., 2023). This is necessary to substantiate the claims on page 19 which attributed the differences between models in terms of their EEG correlations to one of them being a 'visual model' vs. 'visual-semantic model'.

      We argee that the choice and comparison of models should be better contextualized.

      First, our motivation for selecting ResNet-50 and CLIP ResNet-50 was not to make a definitive comparison between model classes, but rather to include two widely used representatives of their respective categories—one trained purely on visual information (ResNet-50 on ImageNet) and one trained with joint visual and linguistic supervision (CLIP ResNet-50 on image–text pairs). These models are both highly influential and commonly used in computational and cognitive neuroscience, allowing for relevant comparisons with existing work (line 181-187).

      Second, we recognize that limiting the EEG × ANN correlation analyses to only early and late layers may be viewed as insufficiently comprehensive. To address this point, we have computed the EEG correlations with multiple layers in both ResNet and CLIP models (ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; CLIP: CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool). The results, now included in Figure S4, show a consistent trend: early layers exhibit higher similarity to early EEG time points, and deeper layers show increased similarity to later EEG stages. We chose to highlight early and late layers in the main text to simplify interpretation.

      Third, we appreciate the reviewer’s point that differences in training datasets (ImageNet vs. CLIP's dataset) may confound any attribution of differences in brain alignment to the models' architectural or learning differences. We agree that the comparisons between models trained on matched datasets (e.g., vision-only vs. multimodal models trained on the same image–text corpus) would allow for more rigorous conclusions. Thus, we explicitly acknowledged this limitation in the text:

      (line 443 to 445) “However, it is also possible that these differences between ResNet and CLIP reflect differences in training data scale and domain.”

      The first part of the claim on page 22 based on Figure 4 'The above results reveal that realworld size emerges with later peak neural latencies and in the later layers of ANNs, regardless of image background information' is not valid since no EEG results for images without backgrounds are shown (only ANNs).

      We revised the sentence to clarify that this is a hypothesis based on the ANN results, not an empirical EEG finding:

      (line 491 to 495) “These results show that real-world size emerges in the later layers of ANNs regardless of image background information, and – based on our prior EEG results – although we could not test object-only images in the EEG data, we hypothesize that a similar temporal profile would be observed in the brain, even for object-only images.”

      While we only had the EEG data of human subjects viewing naturalistic images, the ANN results suggest that real-world size representations may still emerge at later processing stages even in the absence of background, consistent with what we observed in EEG under with-background conditions.

      The paper is likely to impact the field by showcasing how using partial correlations in RSA is useful, rather than providing conclusive evidence regarding neural representations of objects and their sizes.

      Additional context important to consider when interpreting this work:

      Page 20, the authors point out similarities of peak correlations between models ('Interestingly, the peaks of significant time windows for the EEG × HYP RSA also correspond with the peaks of the EEG × ANN RSA timecourse (Figure 3D,F)'. Although not explicitly stated, this seems to imply that they infer from this that the ANN-EEG correlation might be driven by their representation of the hypothesized feature spaces. However this does not follow: in EEG-image metric model comparisons it is very typical to see multiple peaks, for any type of model, this simply reflects specific time points in EEG at which visual inputs (images) yield distinctive EEG amplitudes (perhaps due to stereotypical waves of neural processing?), but one cannot infer the information being processed is the same. To investigate this, one could for example conduct variance partitioning or commonality analysis to see if there is variance at these specific timepoints that is shared by a specific combination of the hypothesis and ANN feature spaces.

      Thanks for your thoughtful observation! Upon reflection, we agree that the sentence – "Interestingly, the peaks of significant time windows for the EEG × HYP RSA also correspond with the peaks of the EEG × ANN RSA timecourse" – was speculative and risked implying a causal link that our data do not warrant. As you rightly points out, observing coincident peak latencies across different models does not necessarily imply shared representational content, given the stereotypical dynamics of evoked EEG responses. And we think even variance partitioning analysis would still not suffice to infer that ANN-EEG correlations are driven specifically by hypothesized feature spaces. Accordingly, we have removed this sentence from the manuscript to avoid overinterpretation. 

      Page 22 mentions 'The significant time-window (90-300ms) of similarity between Word2Vec RDM and EEG RDMs (Figure 5B) contained the significant time-window of EEG x real-world size representational similarity (Figure 3B)'. This is not particularly meaningful given that the Word2Vec correlation is significant for the entire EEG epoch (from the time-point of the signal 'arriving' in visual cortex around ~90 ms) and is thus much less temporally specific than the realworld size EEG correlation. Again a stronger test of whether Word2Vec indeed captures neural representations of real-world size could be to identify EEG time-points at which there are unique Word2Vec correlations that are not explained by either ResNet or CLIP, and see if those timepoints share variance with the real-world size hypothesized RDM.

      We appreciate your insightful comment. Upon reflection, we agree that the sentence – "'The significant time-window (90-300ms) of similarity between Word2Vec RDM and EEG RDMs (Figure 5B) contained the significant time-window of EEG x real-world size representational similarity (Figure 3B)" – was speculative. And we have removed this sentence from the manuscript to avoid overinterpretation. 

      Additionally, we conducted two analyses as you suggested in the supplement. First, we calculated the partial correlation between EEG RDMs and the Word2Vec RDM while controlling for four ANN RDMs (ResNet early/late and CLIP early/late) (Figure S8). Even after regressing out these ANN-derived features, we observed significant correlations between Word2Vec and EEG RDMs in the 100–190 ms and 250–300 ms time windows. This result suggests that

      Word2Vec captures semantic structure in the neural signal that is not accounted for by ResNet or CLIP. Second, we conducted an additional variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by four RDMs: real-world depth, retinal size, real-world size, and semantic information (from Word2Vec) (Figure S9). And we found significant shared variance between Word2Vec and real-world size at 130–150 ms and 180–250 ms. These results indicate a partially overlapping representational structure between semantic content and real-world size in the brain.

      We also added these in our revised manuscript:

      (line 525 to 539) “To further probe the relationship between real-world size and semantic information, and to examine whether Word2Vec captures variances in EEG signals beyond that explained by visual models, we conducted two additional analyses. First, we performed a partial correlation between EEG RDMs and the Word2Vec RDM, while regressing out four ANN RDMs (early and late layers of both ResNet and CLIP) (Figure S8). We found that semantic similarity remained significantly correlated with EEG signals across sustained time windows (100-190ms and 250-300ms), indicating that Word2Vec captures neural variance not fully explained by visual or visual-language models. Second, we conducted a variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by three hypothesis-based RDMs and the semantic RDM (Word2Vec RDM), and we still found that real-world size explained unique variance in EEG even after accounting for semantic similarity (Figure S9). And we also observed a substantial shared variance jointly explained by realworld size and semantic similarity and a unique variance of semantic information. These results suggest that real-world size is indeed partially semantic in nature, but also has independent neural representation not fully explained by general semantic similarity.”

      Reviewer #3 (Public Review):

      The authors used an open EEG dataset of observers viewing real-world objects. Each object had a real-world size value (from human rankings), a retinal size value (measured from each image), and a scene depth value (inferred from the above). The authors combined the EEG and object measurements with extant, pre-trained models (a deep convolutional neural network, a multimodal ANN, and Word2vec) to assess the time course of processing object size (retinal and real-world) and depth. They found that depth was processed first, followed by retinal size, and then real-world size. The depth time course roughly corresponded to the visual ANNs, while the real-world size time course roughly corresponded to the more semantic models.

      The time course result for the three object attributes is very clear and a novel contribution to the literature. However, the motivations for the ANNs could be better developed, the manuscript could better link to existing theories and literature, and the ANN analysis could be modernized. I have some suggestions for improving specific methods.

      (1) Manuscript motivations

      The authors motivate the paper in several places by asking " whether biological and artificial systems represent object real-world size". This seems odd for a couple of reasons. Firstly, the brain must represent real-world size somehow, given that we can reason about this question. Second, given the large behavioral and fMRI literature on the topic, combined with the growing ANN literature, this seems like a foregone conclusion and undermines the novelty of this contribution.

      Thanks for your helpful comment. We agree that asking whether the brain represents real-world size is not a novel question, given the existing behavioral and neuroimaging evidence supporting this. Our intended focus was not on the existence of real-world size representations per se, but the nature of these representations, particularly the relationship between the temporal dynamics and potential mechanisms of representations of real-world size versus other related perceptual properties (e.g., retinal size and real-world depth). We revised the relevant sentence to better reflect our focue, shifting from a binary framing (“whether or not size is represented”) to a more mechanistic and time-resolved inquiry (“how and when such representations emerge”):

      (line 144 to 149) “Unraveling the internal representations of object size and depth features in both human brains and ANNs enables us to investigate how distinct spatial properties—retinal size, realworld depth, and real-world size—are encoded across systems, and to uncover the representational mechanisms and temporal dynamics through which real-world size emerges as a potentially higherlevel, semantically grounded feature.”

      While the introduction further promises to "also investigate possible mechanisms of object realworld size representations.", I was left wishing for more in this department. The authors report correlations between neural activity and object attributes, as well as between neural activity and ANNs. It would be nice to link the results to theories of object processing (e.g., a feedforward sweep, such as DiCarlo and colleagues have suggested, versus a reverse hierarchy, such as suggested by Hochstein, among others). What is semantic about real-world size, and where might this information come from? (Although you may have to expand beyond the posterior electrodes to do this analysis).

      We thank the reviewer for this insightful comment. We agree that understanding the mechanisms underlying real-world size representations is a critical question. While our current study does not directly test specific theoretical frameworks such as the feedforward sweep model or the reverse hierarchy theory, our results do offer several relevant insights: The temporal dynamics revealed by EEG—where real-world size emerges later than retinal size and depth—suggest that such representations likely arise beyond early visual feedforward stages, potentially involving higherlevel semantic processing. This interpretation is further supported by the fact that real-world size is strongly captured by late layers of ANNs and by a purely semantic model (Word2Vec), suggesting its dependence on learned conceptual knowledge.

      While we acknowledge that our analyses were limited to posterior electrodes and thus cannot directly localize the cortical sources of these effects, we view this work as a first step toward bridging low-level perceptual features and higher-level semantic representations. We hope future work combining broader spatial sampling (e.g., anterior EEG sensors or source localization) and multimodal recordings (e.g., MEG, fMRI) can build on these findings to directly test competing models of object processing and representation hierarchy.

      We also added these to the Discussion section:

      (line 619 to 638) “Although our study does not directly test specific models of visual object processing, the observed temporal dynamics provide important constraints for theoretical interpretations. In particular, we find that real-world size representations emerge significantly later than low-level visual features such as retinal size and depth. This temporal profile is difficult to reconcile with a purely feedforward account of visual processing (e.g., DiCarlo et al., 2012), which posits that object properties are rapidly computed in a sequential hierarchy of increasingly complex visual features. Instead, our results are more consistent with frameworks that emphasize recurrent or top-down processing, such as the reverse hierarchy theory (Hochstein & Ahissar, 2002), which suggests that high-level conceptual information may emerge later and involve feedback to earlier visual areas. This interpretation is further supported by representational similarities with late-stage artificial neural network layers and with a semantic word embedding model (Word2Vec), both of which reflect learned, abstract knowledge rather than low-level visual features. Taken together, these findings suggest that real-world size is not merely a perceptual attribute, but one that draws on conceptual or semantic-level representations acquired through experience. While our EEG analyses focused on posterior electrodes and thus cannot definitively localize cortical sources, we see this study as a step toward linking low-level visual input with higher-level semantic knowledge. Future work incorporating broader spatial coverage (e.g., anterior sensors), source localization, or complementary modalities such as MEG and fMRI will be critical to adjudicate between alternative models of object representation and to more precisely trace the origin and flow of real-world size information in the brain.”

      Finally, several places in the manuscript tout the "novel computational approach". This seems odd because the computational framework and pipeline have been the most common approach in cognitive computational neuroscience in the past 5-10 years.

      We have revised relevant statements throughout the manuscript to avoid overstating novelty and to better reflect the contribution of our study.

      (2) Suggestion: modernize the approach

      I was surprised that the computational models used in this manuscript were all 8-10 years old. Specifically, because there are now deep nets that more explicitly model the human brain (e.g., Cornet) as well as more sophisticated models of semantics (e.g., LLMs), I was left hoping that the authors had used more state-of-the-art models in the work. Moreover, the use of a single dCNN, a single multi-modal model, and a single word embedding model makes it difficult to generalize about visual, multimodal, and semantic features in general.

      Thanks for your suggestion. Indeed, our choice of ResNet and CLIP was motivated by their widespread use in the cognitive and computational neuroscience area. These models have served as standard benchmarks in many studies exploring correspondence between ANNs and human brain activity. To address you concern, we have now added additional results from the more biologically inspired model, CORnet, in the supplementary (Figure S10). The results for CORnet show similar patterns to those observed for ResNet and CLIP, providing converging evidence across models.

      Regarding semantic modeling, we intentionally chose Word2Vec rather than large language models (LLMs), because our goal was to examine concept-level, context-free semantic representations. Word2Vec remains the most widely adopted approach for obtaining noncontextualized embeddings that reflect core conceptual similarity, as opposed to the contextdependent embeddings produced by LLMs, which are less directly suited for capturing stable concept-level structure across stimuli.

      (3) Methodological considerations

      (a) Validity of the real-world size measurement

      I was concerned about a few aspects of the real-world size rankings. First, I am trying to understand why the scale goes from 100-519. This seems very arbitrary; please clarify. Second, are we to assume that this scale is linear? Is this appropriate when real-world object size is best expressed on a log scale? Third, the authors provide "sand" as an example of the smallest realworld object. This is tricky because sand is more "stuff" than "thing", so I imagine it leaves observers wondering whether the experimenter intends a grain of sand or a sandy scene region. What is the variability in real-world size ratings? Might the variability also provide additional insights in this experiment?

      We now clarify the origin, scaling, and interpretation of the real-world size values obtained from the THINGS+ dataset.

      In their experiment, participants first rated the size of a single object concept (word shown on the screen) by clicking on a continuous slider of 520 units, which was anchored by nine familiar real-world reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) that spanned the full expected size range on a logarithmic scale. Importantly, participants were not shown any numerical values on the scale—they were guided purely by the semantic meaning and relative size of the anchor objects. After the initial response, the scale zoomed in around the selected region (covering 160 units of the 520-point scale) and presented finer anchor points between the previous reference objects. Participants then refined their rating by dragging from the lower to upper end of the typical size range for that object. If the object was standardized in size (e.g., “soccer ball”), a single click sufficed. These size judgments were collected across at least 50 participants per object, and final scores were derived from the central tendency of these responses. Although the final size values numerically range from 0 to 519 (after scaling), this range is not known to participants and is only applied post hoc to construct the size RDMs.

      Regarding the term “sand”: the THINGS+ dataset distinguished between object meanings when ambiguity was present. For “sand,” participants were instructed to treat it as “a grain of sand”— consistent with the intended meaning of a discrete, minimal-size reference object. 

      Finally, we acknowledge that real-world size ratings may carry some degree of variability across individuals. However, the dataset includes ratings from 2010 participants across 1854 object concepts, with each object receiving at least 50 independent ratings. Given this large and diverse sample, the mean size estimates are expected to be stable and robust across subjects. While we did not include variability metrics in our main analysis, we believe the aggregated ratings provide a reliable estimate of perceived real-world size.

      We added these details in the Materials and Method section:

      (line 219 to 230) “In the THINGS+ dataset, 2010 participants (different from the subjects in THINGS EEG2) did an online size rating task and completed a total of 13024 trials corresponding to 1854 object concepts using a two-step procedure. In their experiment, first, each object was rated on a 520unit continuous slider anchored by familiar reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) representing a logarithmic size range. Participants were not shown numerical values but used semantic anchors as guides. In the second step, the scale zoomed in around the selected region to allow for finer-grained refinement of the size judgment. Final size values were derived from aggregated behavioral data and rescaled to a range of 0–519 for consistency across objects, with the actual mean ratings across subjects ranging from 100.03 (‘grain of sand’) to 423.09 (‘subway’).”

      (b) This work has no noise ceiling to establish how strong the model fits are, relative to the intrinsic noise of the data. I strongly suggest that these are included.

      We have now computed noise ceiling estimates for the EEG RDMs across time. The noise ceiling was calculated by correlating each participant’s EEG RDM with the average EEG RDM across the remaining participants (leave-one-subject-out), at each time point. This provides an upper-bound estimate of the explainable variance, reflecting the maximum similarity that any model—no matter how complex—could potentially achieve, given the intrinsic variability in the EEG data.

      Importantly, the observed EEG–model similarity values are substantially below this upper bound. This outcome is fully expected: Each of our model RDMs (e.g., real-world size, ANN layers) captures only a specific aspect of the neural representational structure, rather than attempting to account for the totality of the EEG signal. Our goal is not to optimize model performance or maximize fit, but to probe which components of object information are reflected in the spatiotemporal dynamics of the brain’s responses.

      For clarity and accessibility of the main findings, we present the noise ceiling time courses separately in the supplementary materials (Figure S7). Including them directly in the EEG × HYP or EEG × ANN plots would conflate distinct interpretive goals: the model RDMs are hypothesis-driven probes of specific representational content, whereas the noise ceiling offers a normative upper bound for total explainable variance. Keeping these separate ensures each visualization remains focused and interpretable. 

      Reviewer #1 (Recommendations For The Authors)::

      Some analyses are incomplete, which would be improved if the authors showed analyses with other layers of the networks and various additional partial correlation analyses.

      Clarity

      (1) Partial correlations methods incomplete - it is not clear what is being partialled out in each analysis. It is possible to guess sometimes, but it is not entirely clear for each analysis. This is important as it is difficult to assess if the partial correlations are sensible/correct in each case. Also, the Figure 1 caption is short and unclear.

      For example, ANN-EEG partial correlations - "Finally, we directly compared the timepoint-bytimepoint EEG neural RDMs and the ANN RDMs (Figure 3F). The early layer representations of both ResNet and CLIP were significantly correlated with early representations in the human brain" What is being partialled out? Figure 3F says partial correlation

      We apologize for the confusion. We made several key clarifications and corrections in the revised version.

      First, we identified and corrected a labeling error in both Figure 1 and Figure 3F. Specifically, our EEG × ANN analysis used Spearman correlation, not partial correlation as mistakenly indicated in the original figure label and text. We conducted parital correlations for EEG × HYP and ANN × HYP. But for EEG × ANN, we directly calculated the correlation between EEG RDMs and ANN RDM corresponding to different layers respectively. We corrected these errors: (1) In Figure 1, we removed the erroneous “partial” label from the EEG × ANN path and updated the caption to clearly outline which comparisons used partial correlation. (2) In Figure 3F, we corrected the Y-axis label to “(correlation)”.

      Second, to improve clarity, we have now revised the Materials and Methods section to explicitly describe what is partialled out in each parital correlation analysis:

      (line 284 to 286) “In EEG × HYP partial correlation (Figure 3D), we correlated EEG RDMs with one hypothesis-based RDM (e.g., real-world size), while controlling for the other two (retinal size and real-world depth).”

      (line 303 to 305) “In ANN (or W2V) × HYP partial correlation (Figure 3E and Figure 5A), we correlated ANN (or W2V) RDMs with one hypothesis-based RDM (e.g., real-world size), while partialling out the other two.”

      Finally, the caption of Figure 1 has been expanded to clarify the full analysis pipeline and explicitly specify the partial correlation or correlation in each comparison.

      (line 327 to 332) “Figure 1 Overview of our analysis pipeline including constructing three types of RDMs and conducting comparisons between them. We computed RDMs from three sources: neural data (EEG), hypothesized object features (real-world size, retinal size, and real-world depth), and artificial models (ResNet, CLIP, and Word2Vec). Then we conducted cross-modal representational similarity analyses between: EEG × HYP (partial correlation, controlling for other two HYP features), ANN (or W2V) × HYP (partial correlation, controlling for other two HYP features), and EEG × ANN (correlation).”

      We believe these revisions now make all analytic comparisons and correlation types full clear and interpretable.

      Issues / open questions

      (2) Semantic representations vs hypothesized (hyp) RDMs (real-world size, etc) - are the representations explained by variables in hyp RDMs or are there semantic representations over and above these? E.g., For ANN correlation with the brain, you could partial out hyp RDMs - and assess whether there is still semantic information left over, or is the variance explained by the hyp RDMs?

      Thank for this suggestion. As you suggested, we conducted the partial correlation analysis between EEG RDMs and ANN RDMs, controlling for the three hypothesis-based RDMs. The results (Figure S6) revealed that the EEG×ANN representational similarity remained largely unchanged, indicating that ANN representations capture much more additional representational structure not accounted for by the current hypothesized features. This is also consistent with the observation that EEG×HYP partial correlations were themselves small, but EEG×ANN correlations were much greater.

      We also added this statement to the main text:

      (line 446 to 451) “To contextualize how much of the shared variance between EEG and ANN representations is driven by the specific visual object features we tested above, we conducted a partial correlation analysis between EEG RDMs and ANN RDMs controlling for the three hypothesis-based RDMs (Figure S6). The EEG×ANN similarity results remained largely unchanged, suggesting that ANN representations capture much more additional rich representational structure beyond these features. ”

      (3) Why only early and late layers? I can see how it's clearer to present the EEG results. However, the many layers in these networks are an opportunity - we can see how simple/complex linear/non-linear the transformation is over layers in these models. It would be very interesting and informative to see if the correlations do in fact linearly increase from early to later layers, or if the story is a bit more complex. If not in the main text, then at least in the supplement.

      Thank you for the thoughtful suggestion. To address this point, we have computed the EEG correlations with multiple layers in both ResNet and CLIP models (ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; CLIP:CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool). The results, now included in Figure S4 and S5, show a consistent trend: early layers exhibit higher similarity to early EEG time points, and deeper layers show increased similarity to later EEG stages. We chose to highlight early and late layers in the main text to simplify interpretation, but now provide the full layerwise profile for completeness.

      (4) Peak latency analysis - Estimating peaks per ppt is presumably noisy, so it seems important to show how reliable this is. One option is to find the bootstrapped mean latencies per subject.

      Thanks for your suggestion. To estimate the robustness of peak latency values, we implemented a bootstrap procedure by resampling the pairwise entries of the EEG RDM with replacement. For each bootstrap sample, we computed a new EEG RDM and recalculated the partial correlation time course with the hypothesis RDMs. We then extracted the peak latency within the predefined significant time window. Repeating this process 1000 times allowed us to get the bootstrapped mean latencies per subject as the more stable peak latency result. Notably, the bootstrapped results showed minimal deviation from the original latency estimates, confirming the robustness of our findings. Accordingly, we updated the Figure 3D and added these in the Materials and Methods section:

      (line 289 to 298) “To assess the stability of peak latency estimates for each subject, we performed a bootstrap procedure across stimulus pairs. At each time point, the EEG RDM was vectorized by extracting the lower triangle (excluding the diagonal), resulting in 19,900 unique pairwise values. For each bootstrap sample, we resampled these 19,900 pairwise entries with replacement to generate a new pseudo-RDM of the same size. We then computed the partial correlation between the EEG pseudo-RDM and a given hypothesis RDM (e.g., real-world size), controlling for other feature RDMs, and obtained a time course of partial correlations. Repeating this procedure 1000 times and extracting the peak latency within the significant time window yielded a distribution of bootstrapped latencies, from which we got the bootstrapped mean latencies per subject.”

      (5) "Due to our calculations being at the object level, if there were more than one of the same objects in an image, we cropped the most complete one to get a more accurate retinal size. " Did EEG experimenters make sure everyone sat the same distance from the screen? and remain the same distance? This would also affect real-world depth measures.

      Yes, the EEG dataset we used (THINGS EEG2; Gifford et al., 2022) was collected under carefully controlled experimental conditions. We have confirmed that all participants were seated at a fixed distance of 0.6 meters from the screen throughout the experiment. We also added this information in the method (line 156 to 157).

      Minor issues/questions - note that these are not raised in the Public Review

      (6) Title - less about rigor/quality of the work but I feel like the title could be improved/extended. The work tells us not only about real object size, but also retinal size and depth. In fact, isn't the most novel part of this the real-world depth aspect? Furthermore, it feels like the current title restricts its relevance and impact... Also doesn't touch on the temporal aspect, or processing stages, which is also very interesting. There may be something better, but simply adding something like"...disentangled features of real-world size, depth, and retinal size over time OR processing stages".

      Thanks for your suggestion! We changed our title – “Human EEG and artificial neural networks reveal disentangled representations and processing timelines of object real-world size and depth in natural images”.

      (7) "Each subject viewed 16740 images of objects on a natural background for 1854 object concepts from the THINGS dataset (Hebart et al., 2019). For the current study, we used the 'test' dataset portion, which includes 16000 trials per subject corresponding to 200 images." Why test images? Worth explaining.

      We chose to use the “test set” of the THINGS EEG2 dataset for the following two reasons:

      (1) Higher trial count per condition: In the test set, each of the 200 object images was presented 80 times per subject, whereas in the training set, each image was shown only 4 times. This much higher trial count per condition in the test set allows for substantially higher signal-tonoise ratio in the EEG data.

      (2) Improved decoding reliability: Our analysis relies on constructing EEG RDMs based on pairwise decoding accuracy using linear SVM classifiers. Reliable decoding estimates require a sufficient number of trials per condition. The test set design is thus better suited to support high-fidelity decoding and robust representational similarity analysis.

      We also added these explainations to our revised manuscript (line 161 to 164).

      (8) "For Real-World Size RDM, we obtained human behavioral real-world size ratings of each object concept from the THINGS+ dataset (Stoinski et al., 2022).... The range of possible size ratings was from 0 to 519 in their online size rating task..." How were the ratings made? What is this scale - do people know the numbers? Was it on a continuous slider?

      We should clarify how the real-world size values were obtained from the THINGS+ dataset.

      In their experiment, participants first rated the size of a single object concept (word shown on the screen) by clicking on a continuous slider of 520 units, which was anchored by nine familiar real-world reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) that spanned the full expected size range on a logarithmic scale. Importantly, participants were not shown any numerical values on the scale—they were guided purely by the semantic meaning and relative size of the anchor objects. After the initial response, the scale zoomed in around the selected region (covering 160 units of the 520-point scale) and presented finer anchor points between the previous reference objects. Participants then refined their rating by dragging from the lower to upper end of the typical size range for that object. If the object was standardized in size (e.g., “soccer ball”), a single click sufficed. These size judgments were collected across at least 50 participants per object, and final scores were derived from the central tendency of these responses. Although the final size values numerically range from 0 to 519 (after scaling), this range is not known to participants and is only applied post hoc to construct the size RDMs.

      We added these details in the Materials and Method section:

      (line 219 to 230) “In the THINGS+ dataset, 2010 participants (different from the subjects in THINGS EEG2) did an online size rating task and completed a total of 13024 trials corresponding to 1854 object concepts using a two-step procedure. In their experiment, first, each object was rated on a 520unit continuous slider anchored by familiar reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) representing a logarithmic size range. Participants were not shown numerical values but used semantic anchors as guides. In the second step, the scale zoomed in around the selected region to allow for finer-grained refinement of the size judgment. Final size values were derived from aggregated behavioral data and rescaled to a range of 0–519 for consistency across objects, with the actual mean ratings across subjects ranging from 100.03 (‘grain of sand’) to 423.09 (‘subway’).”

      (9) "For Retinal Size RDM, we applied Adobe Photoshop (Adobe Inc., 2019) to crop objects corresponding to object labels from images manually... " Was this by one person? Worth noting, and worth sharing these values per image if not already for other researchers as it could be a valuable resource (and increase citations).

      Yes, all object cropping were performed consistently by one of the authors to ensure uniformity across images. We agree that this dataset could be a useful resource to the community. We have now made the cropped object images publicly available https://github.com/ZitongLu1996/RWsize.

      We also updated the manuscript accordingly to note this (line 236 to 239).

      (10) "Neural RDMs. From the EEG signal, we constructed timepoint-by-timepoint neural RDMs for each subject with decoding accuracy as the dissimilarity index " Decoding accuracy is presumably a similarity index. Maybe 1-accuracy (proportion correct) for dissimilarity?

      Decoding accuracy is a dissimilarity index instead of a similarity index, as higher decoding accuracy between two conditions indicates that they are more distinguishable – i.e., less similar – in the neural response space. This approach aligns with prior work using classification-based representational dissimilarity measures (Grootswagers et al., 2017; Xie et al., 2020), where better decoding implies greater dissimilarity between conditions. Therefore, there is no need to invert the decoding accuracy values (e.g., using 1 - accuracy).

      Grootswagers, T., Wardle, S. G., & Carlson, T. A. (2017). Decoding dynamic brain patterns from evoked responses: A tutorial on multivariate pattern analysis applied to time series neuroimaging data. Journal of Cognitive Neuroscience, 29(4), 677-697.

      Xie, S., Kaiser, D., & Cichy, R. M. (2020). Visual imagery and perception share neural representations in the alpha frequency band. Current Biology, 30(13), 2621-2627.

      (11) Figure 1 caption is very short - Could do with a more complete caption. Unclear what the partial correlations are (what is being partialled out in each case), what are the comparisons "between them" - both in the figure and the caption. Details should at least be in the main text.

      Related to your comment (1). We revised the caption and the corresponding text.

      Reviewer #2 (Recommendations For The Authors):

      (1) Intro:

      Quek et al., (2023) is referred to as a behavioral study, but it has EEG analyses.

      We corrected this – “…, one recent study (Quek et al., 2023) …”

      The phrase 'high temporal resolution EEG' is a bit strange - isn't all EEG high temporal resolution? Especially when down-sampling to 100 Hz (40 time points/epoch) this does not qualify as particularly high-res.

      We removed this phrasing in our manuscript.

      (2) Methods:

      It would be good to provide more details on the EEG preprocessing. Were the data low-pass filtered, for example?

      We added more details to the manuscript:

      (line 167 to 174) “The EEG data were originally sampled at 1000Hz and online-filtered between 0.1 Hz and 100 Hz during acquisition, with recordings referenced to the Fz electrode. For preprocessing, no additional filtering was applied. Baseline correction was performed by subtracting the mean signal during the 100 ms pre-stimulus interval from each trial and channel separately. We used already preprocessed data from 17 channels with labels beginning with “O” or “P” (O1, Oz, O2, PO7, PO3, POz, PO4, PO8, P7, P5, P3, P1, Pz, P2) ensuring full coverage of posterior regions typically involved in visual object processing. The epoched data were then down-sampled to 100 Hz.”

      It is important to provide more motivation about the specific ANN layers chosen. Were these layers cherry-picked, or did they truly represent a gradual shift over the course of layers?

      We appreciate the reviewer’s concern and fully agree that it is important to ensure transparency in how ANN layers were selected. The early and late layers reported in the main text were not cherry-picked to maximize effects, but rather intended to serve as illustrative examples representing the lower and higher ends of the network hierarchy. To address this point directly, we have computed the EEG correlations with multiple layers in both ResNet and CLIP models (ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; CLIP: CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool). The results, now included in Figure S4, show a consistent trend: early layers exhibit higher similarity to early EEG time points, and deeper layers show increased similarity to later EEG stages.

      It is important to provide more specific information about the specific ANN layers chosen. 'Second convolutional layer': is this block 2, the ReLu layer, the maxpool layer? What is the 'last visual layer'?

      Apologize for the confusing! We added more details about the layer chosen:

      (line 255 to 257) “The early layer in ResNet refers to ResNet.maxpool layer, and the late layer in ResNet refers to ResNet.avgpool layer. The early layer in CLIP refers to CLIP.visual.avgpool layer, and the late layer in CLIP refers to CLIP.visual.attnpool layer.”

      Again the claim 'novel' is a bit overblown here since the real-world size ratings were also already collected as part of THINGS+, so all data used here is available.

      We removed this phrasing in our manuscript.

      Real-world size ratings ranged 'from 0 - 519'; it seems unlikely this was the actual scale presented to subjects, I assume it was some sort of slider?

      You are correct. We should clarify how the real-world size values were obtained from the THINGS+ dataset.

      In their experiment, participants first rated the size of a single object concept (word shown on the screen) by clicking on a continuous slider of 520 units, which was anchored by nine familiar real-world reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) that spanned the full expected size range on a logarithmic scale. Importantly, participants were not shown any numerical values on the scale—they were guided purely by the semantic meaning and relative size of the anchor objects. After the initial response, the scale zoomed in around the selected region (covering 160 units of the 520-point scale) and presented finer anchor points between the previous reference objects. Participants then refined their rating by dragging from the lower to upper end of the typical size range for that object. If the object was standardized in size (e.g., “soccer ball”), a single click sufficed. These size judgments were collected across at least 50 participants per object, and final scores were derived from the central tendency of these responses. Although the final size values numerically range from 0 to 519 (after scaling), this range is not known to participants and is only applied post hoc to construct the size RDMs.

      We added these details in the Materials and Method section:

      (line 219 to 230) “In the THINGS+ dataset, 2010 participants (different from the subjects in THINGS EEG2) did an online size rating task and completed a total of 13024 trials corresponding to 1854 object concepts using a two-step procedure. In their experiment, first, each object was rated on a 520unit continuous slider anchored by familiar reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) representing a logarithmic size range. Participants were not shown numerical values but used semantic anchors as guides. In the second step, the scale zoomed in around the selected region to allow for finer-grained refinement of the size judgment. Final size values were derived from aggregated behavioral data and rescaled to a range of 0–519 for consistency across objects, with the actual mean ratings across subjects ranging from 100.03 (‘grain of sand’) to 423.09 (‘subway’).”

      Why is conducting a one-tailed (p<0.05) test valid for EEG-ANN comparisons? Shouldn't this be two-tailed?

      Our use of one-tailed tests was based on the directional hypothesis that representational similarity between EEG and ANN RDMs would be positive, as supported by prior literature showing correspondence between hierarchical neural networks and human brain representations (e.g., Cichy et al., 2016; Kuzovkin et al., 2014). This is consistent with a large number of RSA studies which conduct one-tailed tests (i.e., testing the hypothesis that coefficients were greater than zero: e.g., Kuzovkin et al., 2018; Nili et al., 2014; Hebart et al., 2018; Kaiser et al., 2019; Kaiser et al., 2020; Kaiser et al., 2022). Thus, we specifically tested whether the similarity was significantly greater than zero.

      Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports, 6(1), 27755.

      Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J. P., Baciu, M., Kahane, P., ... & Aru, J. (2018). Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications biology, 1(1), 107.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS computational biology, 10(4), e1003553.

      Hebart, M. N., Bankson, B. B., Harel, A., Baker, C. I., & Cichy, R. M. (2018). The representational dynamics of task and object processing in humans. Elife, 7, e32816.

      Kaiser, D., Turini, J., & Cichy, R. M. (2019). A neural mechanism for contextualizing fragmented inputs during naturalistic vision. elife, 8, e48182.

      Kaiser, D., Inciuraite, G., & Cichy, R. M. (2020). Rapid contextualization of fragmented scene information in the human visual system. Neuroimage, 219, 117045.

      Kaiser, D., Jacobs, A. M., & Cichy, R. M. (2022). Modelling brain representations of abstract concepts. PLoS Computational Biology, 18(2), e1009837.

      Importantly, we note that using a two-tailed test instead would not change the significance of our results. However, we believe the one-tailed test remains more appropriate given our theoretical prediction of positive similarity between ANN and brain representations.

      The sentence on the partial correlation description (page 11 'we calculated partial correlations with one-tailed test against the alternative hypothesis that the partial correlation was positive (greater than zero)') didn't make sense to me; are you referring to the null hypothesis here?

      We revised this sentence to clarify that we tested against the null hypothesis that the partial correlation was less than or equal to zero, using a one-tailed test to assess whether the correlation was significantly greater than zero.

      (line 281 to 284) “…, we calculated partial correlations and used a one-tailed test against the null hypothesis that the partial correlation was less than or equal to zero, testing whether the partial correlation was significantly greater than zero.”

      (3) Results:

      I would prevent the use of the word 'pure', your measurement is one specific operationalization of this concept of real-world size that is not guaranteed to result in unconfounded representations. This is in fact impossible whenever one is using a finite set of natural stimuli and calculating metrics on those - there can always be a factor or metric that was not considered that could explain some of the variance in your measurement. It is overconfident to claim to have achieved some form of Platonic ideal here and to have taken into account all confounds.

      Your point is well taken. Our original use of the term “pure” was intended to reflect statistical control for known confounding factors, but we recognize that this wording may imply a stronger claim than warranted. In response, we revised all relevant language in the manuscript to instead describe the statistically isolated or relatively unconfounded representation of real-world size, clarifying that our findings pertain to the unique contribution of real-world size after accounting for retinal size and real-world depth.

      Figure 2C: It's not clear why peak latencies are computed on the 'full' correlations rather than the partial ones.

      No. The peak latency results in Figure 2C were computed on the partial correlation results – we mentioned this in the figure caption – “Temporal latencies for peak similarity (partial Spearman correlations) between EEG and the 3 types of object information.”

      SEM = SEM across the 10 subjects?

      Yes. We added this in the figure caption.

      Figure 3F y-axis says it's partial correlations but not clear what is partialled out here.

      We identified and corrected a labeling error in both Figure 1 and Figure 3F. Specifically, our EEG × ANN analysis used Spearman correlation, not partial correlation as mistakenly indicated in the original figure label and text. We conducted parital correlations for EEG × HYP and ANN × HYP. But for EEG × ANN, we directly calculated the correlation between EEG RDMs and ANN RDM corresponding to different layers respectively. We corrected these errors: (1) In Figure 1, we removed the erroneous “partial” label from the EEG × ANN path and updated the caption to clearly outline which comparisons used partial correlation. (2) In Figure 3F, we corrected the Y-axis label to “(correlation)”.

      Reviewer #3 (Recommendations For The Authors):

      (1) Several methodologies should be clarified:

      (a) It's stated that EEG was sampled at 100 Hz. I assume this was downsampled? From what original frequency?

      Yes. We added more detailed about EEG data:

      (line 167 to 174) “The EEG data were originally sampled at 1000Hz and online-filtered between 0.1 Hz and 100 Hz during acquisition, with recordings referenced to the Fz electrode. For preprocessing, no additional filtering was applied. Baseline correction was performed by subtracting the mean signal during the 100 ms pre-stimulus interval from each trial and channel separately. We used already preprocessed data from 17 channels with labels beginning with “O” or “P” (O1, Oz, O2, PO7, PO3, POz, PO4, PO8, P7, P5, P3, P1, Pz, P2) ensuring full coverage of posterior regions typically involved in visual object processing. The epoched data were then down-sampled to 100 Hz.”

      (b) Why was decoding accuracy used as the human RDM method rather than the EEG data themselves?

      Thanks for your question! We would like to address why we used decoding accuracy for EEG RDMs rather than correlation. While fMRI RDMs are typically calculated using 1 minus correlation coefficient, decoding accuracy is more commonly used for EEG RDMs (Grootswager et al., 2017; Xie et al., 2020). The primary reason is that EEG signals are more susceptible to noise than fMRI data. Correlation-based methods are particularly sensitive to noise and may not reliably capture the functional differences between EEG patterns for different conditions. Decoding accuracy, by training classifiers to focus on task-relevant features, can effectively mitigate the impact of noisy signals and capture the representational difference between two conditions.

      Grootswagers, T., Wardle, S. G., & Carlson, T. A. (2017). Decoding dynamic brain patterns from evoked responses: A tutorial on multivariate pattern analysis applied to time series neuroimaging data. Journal of Cognitive Neuroscience, 29(4), 677-697.

      Xie, S., Kaiser, D., & Cichy, R. M. (2020). Visual imagery and perception share neural representations in the alpha frequency band. Current Biology, 30(13), 2621-2627.

      We added this explanation to the manuscript:

      (line 204 to 209) “Since EEG has a low SNR and includes rapid transient artifacts, Pearson correlations computed over very short time windows yield unstable dissimilarity estimates (Kappenman & Luck, 2010; Luck, 2014) and may thus fail to reliably detect differences between images. In contrast, decoding accuracy - by training classifiers to focus on task-relevant features - better mitigates noise and highlights representational differences.”

      (c) How were the specific posterior electrodes selected?

      The 17 posterior electrodes used in our analyses were pre-selected and provided in the THINGS EEG2 dataset, and corresponding to standard occipital and parietal sites based on the 10-10 EEG system. Specifically, we included all 17 electrodes with labels beginning with “O” or “P”, ensuring full coverage of posterior regions typically involved in visual object processing (Page 7).

      (d) The specific layers should be named rather than the vague ("last visual")

      Apologize for the confusing! We added more details about the layer information:

      (line 255 to 257) “The early layer in ResNet refers to ResNet.maxpool layer, and the late layer in ResNet refers to ResNet.avgpool layer. The early layer in CLIP refers to CLIP.visual.avgpool layer, and the late layer in CLIP refers to CLIP.visual.attnpool layer.”

      (line 420 to 434) “As shown in Figure 3F, the early layer representations of both ResNet and CLIP (ResNet.maxpool layer and CLIP.visual.avgpool) showed significant correlations with early EEG time windows (early layer of ResNet: 40-280ms, early layer of CLIP: 50-130ms and 160-260ms), while the late layers (ResNet.avgpool layer and CLIP.visual.attnpool layer) showed correlations extending into later time windows (late layer of ResNet: 80-300ms, late layer of CLIP: 70-300ms). Although there is substantial temporal overlap between early and late model layers, the overall pattern suggests a rough correspondence between model hierarchy and neural processing stages.

      We further extended this analysis across intermediate layers of both ResNet and CLIP models (from early to late, ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; from early to late, CLIP: CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool).”

      (e) p19: please change the reporting of t-statistics to standard APA format.

      Thanks for the suggestion. We changed the reporting format accordingly:

      (line 392 to 394) “The representation of real-word size had a significantly later peak latency than that of both retinal size, t(9)=4.30, p=.002, and real-world depth, t(9)=18.58, p<.001. And retinal size representation had a significantly later peak latency than real-world depth, t(9)=3.72, p=.005.”

      (2) "early layer of CLIP: 50-130ms and 160-260ms), while the late layer representations of twoANNs were significantly correlated with later representations in the human brain (late layer of ResNet: 80-300ms, late layer of CLIP: 70-300ms)."

      This seems a little strong, given the large amount of overlap between these models.

      We agree that our original wording may have overstated the distinction between early and late layers, given the substantial temporal overlap in their EEG correlations. We revised this sentence to soften the language to reflect the graded nature of the correspondence, and now describe the pattern as a general trend rather than a strict dissociation:

      (line 420 to 427) “As shown in Figure 3F, the early layer representations of both ResNet and CLIP (ResNet.maxpool layer and CLIP.visual.avgpool) showed significant correlations with early EEG time windows (early layer of ResNet: 40-280ms, early layer of CLIP: 50-130ms and 160-260ms), while the late layers (ResNet.avgpool layer and CLIP.visual.attnpool layer) showed correlations extending into later time windows (late layer of ResNet: 80-300ms, late layer of CLIP: 70-300ms). Although there is substantial temporal overlap between early and late model layers, the overall pattern suggests a rough correspondence between model hierarchy and neural processing stages.”

      (3) "Also, human brain representations showed a higher similarity to the early layer representation of the visual model (ResNet) than to the visual-semantic model (CLIP) at an early stage. "

      This has been previously reported by Greene & Hansen, 2020 J Neuro.

      Thanks! We added this reference.

      (4) "ANN (and Word2Vec) model RDMs"

      Why not just "model RDMs"? Might provide more clarity.

      We chose to use the phrasing “ANN (and Word2Vec) model RDMs” to maintain clarity and avoid ambiguity. In the literature, the term “model RDMs” is sometimes used more broadly to include hypothesis-based feature spaces or conceptual models, and we wanted to clearly distinguish our use of RDMs derived from artificial neural networks and language models. Additionally, explicitly referring to ANN or Word2Vec RDMs improves clarity by specifying the model source of each RDM. We hope this clarification justifies our choice to retain the original phrasing for clarity.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This study presents cryoEM-derived structures of the Trypanosome aquaporin AQP2, in complex with its natural ligand, glycerol, as well as two trypanocidal drugs, pentamidine and melarsoprol, which use AQP2 as an uptake route. The structures are high quality, and the density for the drug molecules is convincing, showing a binding site in the centre of the AQP2 pore. 

      The authors then continue to study this system using molecular dynamics simulations. Their simulations indicate that the drugs can pass through the pore and identify a weak binding site in the centre of the pore, which corresponds with that identified through cryoEM analysis. They also simulate the effect of drug resistance mutations, which suggests that the mutations reduce the affinity for drugs and therefore might reduce the likelihood that the drugs enter into the centre of the pore, reducing the likelihood that they progress through into the cell. 

      While the cryoEM and MD studies are well conducted, it is a shame that the drug transport hypothesis was not tested experimentally. For example, did they do cryoEM with AQP2 with drug resistance mutations and see if they could see the drugs in these maps? They might not bind, but another possibility is that the binding site shifts, as seen in Chen et al. 

      TbAQP2 from the drug-resistant mutants does not transport either melarsoprol or pentamidine and there was thus no evidence to suggest that the mutant TbAQP2 channels could bind either drug. Moreover, there is not a single mutation that is characteristic for drug resistance in TbAQP2: references 12–15 show a plethora of chimeric AQP2/3 constructs in addition to various point mutations in laboratory strains and field isolates. In reference 17 we describe a substantial number of SNPs that reduced pentamidine and melarsoprol efficacy to levels that would constitute clinical resistance to acceptable dosage regimen. It thus appears that there are many and diverse mutations that are able to modify the protein sufficiently to induce resistance, and likely in multiple different ways, including the narrowing of the pore, changes to interacting amino acids, access to the pore etc. We therefore did not attempt to determine the structures of the mutant channels because we did not think that in most cases we would see any density for the drugs in the channel, and we would be unable to define ‘the’ resistance mechanism if we did in the case of one individual mutant TbAQP2. Our MD data suggests that pentamidine binding affinity is in the range of 50-300 µM for the mutant TbAQP2s selected for that test (I110W and L258Y/L264R), i.e. >1000-fold higher than TbAQP2WT. Thus these structures will be exceedingly challenging to determine with pentamidine in the pore but, of course, until the experiment has been tried we will not know for sure.

      Do they have an assay for measuring drug binding? 

      We tried many years ago to develop a <sup>3</sup>H-pentamidine binding assay to purified wild type TbAQP2 but we never got satisfactory results even though the binding should be in the doubledigit nanomolar range. This may be for any number of technical reasons and could also be partly because flexible di-benzamidines bind non-specifically to proteins at µM concentrations giving rise to high background. Measuring binding to the mutants was not tested given that they would be binding pentamidine in the µM range. If we were to pursue this further, then isothermal titration calorimetry (ITC) may be one way forward as this can measure µM affinity binding using unlabelled compounds, although it uses a lot of protein and background binding would need to be carefully assessed; see for example our work on measuring tetracycline binding to the tetracycline antiporter TetAB (https://doi.org/10.1016/j.bbamem.2015.06.026 ). Membrane proteins are also particularly tricky for this technique as the chemical activity of the protein solution must be identical to the chemical activity of the substrate solution which titrates in the molecule binding to the protein; this can be exceedingly problematic if any free detergent remains in the purified membrane protein. Another possibility may be fluorescence polarisation spectroscopy, although this would require fluorescently labelling the drugs which would very likely affect their affinity for TbAQP2 and how they interact with the wild type and mutant proteins – see the detailed SAR analysis in Alghamdi et al. 2020 (ref. 17). As you will appreciate, it would take considerable time and effort to set up an assay for measuring drug binding to mutants and is beyond the current scope of the current work.

      I think that some experimental validation of the drug binding hypothesis would strengthen this paper. Without this, I would recommend the authors to soften the statement of their hypothesis (i.e, lines 65-68) as this has not been experimentally validated.

      We agree with the referee that direct binding of drugs to the mutants would be very nice to have, but we have neither the time nor resources to do this. We have therefore softened the statement on lines 65-68 to read ‘Drug-resistant TbAQP2 mutants are still predicted to bind pentamidine, but the much weaker binding in the centre of the channel observed in the MD simulations would be insufficient to compensate for the high energy processes of ingress and egress, hence impairing transport at pharmacologically relevant concentrations.’ 

      Reviewer #2 (Public review): 

      Summary: 

      The authors present 3.2-3.7 Å cryo-EM structures of Trypanosoma brucei aquaglyceroporin-2 (TbAQP2) bound to glycerol, pentamidine, or melarsoprol and combine them with extensive allatom MD simulations to explain drug recognition and resistance mutations. The work provides a persuasive structural rationale for (i) why positively selected pore substitutions enable diamidine uptake, and (ii) how clinical resistance mutations weaken the high-affinity energy minimum that drives permeation. These insights are valuable for chemotherapeutic re-engineering of diamidines and aquaglyceroporin-mediated drug delivery. 

      My comments are on the MD part. 

      Strengths: 

      The study 

      (1) Integrates complementary cryo-EM, equilibrium, applied voltage MD simulations, and umbrella-sampling PMFs, yielding a coherent molecular-level picture of drug permeation. 

      (2) Offers direct structural rationalisation of long-standing resistance mutations in trypanosomes, addressing an important medical problem. 

      Weaknesses: 

      Unphysiological membrane potential. A field of 0.1 V nm ¹ (~1 V across the bilayer) was applied to accelerate translocation. From the traces (Figure 1c), it can be seen that the translocation occurred really quickly through the channel, suggesting that the field might have introduced some large changes in the protein. The authors state that they checked visually for this, but some additional analysis, especially of the residues next to the drug, would be welcome. 

      This is a good point from the referee, and we thank them for raising it. It is common to use membrane potentials in simulations that are higher than the physiological value, although these are typically lower than used here. The reason we used the higher value was to speed sampling and it still took 1,400 ns for transport in the physiologically correct direction, and even then, only in 1/3 repeats. Hence this choice of voltage was probably necessary to see the effect. The exceedingly slow rate of pentamidine permeation seen in the MD simulation was consistent with the experimental observations, as discussed in Alghamdi et al (2020) [ref. 17] where we estimated that TbAQP2-mediated pentamidine uptake in T. brucei bloodstream forms proceeds at just 9.5×10<sup>5</sup> molecules/cell/h; the number of functional TbAQP2 units in the plasma membrane is not known but their location is limited to the small flagellar pocket (Quintana et al. PLoS Negl Trop Dis 14, e0008458 (2020)). 

      The referee is correct that it is important to make sure that the applied voltage is not causing issues for the protein, especially for residues in contact with the drug. We have carried out RMSF analysis to better test this. The data show that comparing our simulations with the voltage applied to the monomeric MD simulations + PNTM with no voltage reveals little difference in the dynamics of the drug-contacting residues. 

      We have added these new data as Supplementary Fig12b with a new legend (lines1134-1138) 

      ‘b, RMSF calculations were run on monomeric TbAQP2 with either no membrane voltage or a 0.1V nm<sup>-1</sup> voltage applied (in the physiological direction). Shown are residues in contact with the pentamidine molecule, coloured by RMSF value. RMSF values are shown for residues Leu122, Phe226, Ile241, and Leu264. The data suggest the voltage has little impact on the flexibility or stability of the pore lining residues.’

      We have also added the following text to the manuscript (lines 524-530):

      ‘Membrane potential simulations were run using the computational electrophysiology protocol. An electric field of 0.1 V/nm was applied in the z-axis dimension only, to create a membrane potential of about 1 V (see Fig. S10a). Note that this is higher than the physiological value of 87.1 ± 2.1 mV at pH 7.3 in bloodstream T. brucei, and was chosen to improve the sampling efficiency of the simulations. The protein and lipid molecules were visually confirmed to be unaffected by this voltage, which we quantify using RMSF analysis on pentamidine-contacting residues (Fig. S12b).’ 

      Based on applied voltage simulations, the authors argue that the membrane potential would help get the drug into the cell, and that a high value of the potential was applied merely to speed up the simulation. At the same time, the barrier for translocation from PMF calculations is ~40 kJ/mol for WT. Is the physiological membrane voltage enough to overcome this barrier in a realistic time? In this context, I do not see how much value the applied voltage simulations have, as one can estimate the work needed to translocate the substrate on PMF profiles alone. The authors might want to tone down their conclusions about the role of membrane voltage in the drug translocation.

      We agree that the PMF barriers are considerable, however we highlight that other studies have seen similar landscapes, e.g. PMID 38734677 which saw a barrier of ca. 10-15 kcal/mol (ca. 4060 kJ/mol) for PNTM transversing the channel. This was reduced by ca. 4 kcal/mol when a 0.4 V nm ¹ membrane potential was applied, so we expect a similar effect to be seen here. 

      We have updated the Results to more clearly highlight this point and added the following text (lines 274-275):

      We note that previous studies using these approaches saw energy barriers of a similar size, and that these are reduced in the presence of a membrane voltage[17,31].’ 

      Pentamidine charge state and protonation. The ligand was modeled as +2, yet pKa values might change with the micro-environment. Some justification of this choice would be welcome. 

      Pentamidine contains two diamidine groups and each are expected to have a pKa above 10 in solution (PMID: 20368397), suggesting that the molecule will carry a +2 charge. Using the +2 charge is also in line with previous MD studies (PMID: 32762841). We have added the following text to the Methods (lines 506-509):

      ‘The pentamidine molecule used existing parameters available in the CHARMM36 database under the name PNTM with a charge state of +2 to reflect the predicted pKas of >10 for these groups [73] and in line with previous MD studies[17].’

      We note that accounting for the impact of the microenvironment is an excellent point – future studies might employ constant pH calculations to address this.

      The authors state that this RMSD is small for the substrate and show plots in Figure S7a, with the bottom plot being presumably done for the substrate (the legends are misleading, though), levelling off at ~0.15 nm RMSD. However, in Figure S7a, we see one trace (light blue) deviating from the initial position by more than 0.2 nm - that would surely result in an RMSD larger than 0.15, but this is somewhat not reflected in the RMSD plots. 

      The bottom plot of Fig. S9a (previously Fig. S7a) is indeed the RMSD of the drug (in relation to the protein). We have clarified the legend with the following text (lines 1037-1038): ‘… or for the pentamidine molecule itself, i.e. in relation to the Cα of the channel (bottom).’ 

      With regards the second comment, we assume the referee is referring to the light blue trace from Fig S9c. These data are actually for the monomeric channel rather than the tetramer. We apologise for not making this clearer in the legend. We have added the word ‘monomeric’ (line 1041).

      Reviewer #3 (Public review): 

      Summary: 

      Recent studies have established that trypanocidal drugs, including pentamidine and melarsoprol, enter the trypanosomes via the glyceroaquaporin AQP2 (TbAQP2). Interestingly, drug resistance in trypanosomes is, at least in part, caused by recombination with the neighbouring gene, AQP3, which is unable to permeate pentamidine or melarsoprol. The effect of the drugs on cells expressing chimeric proteins is significantly reduced. In addition, controversy exists regarding whether TbAQP2 permeates drugs like an ion channel, or whether it serves as a receptor that triggers downstream processes upon drug binding. In this study the authors set out to achieve three objectives: 

      (1) to determine if TbAQP2 acts as a channel or a receptor,

      We should clarify here that this was not an objective of the current manuscript as the transport activity has already been extensively characterised in the literature, as described in the introduction.

      (2) to understand the molecular interactions between TbAQP2 and glycerol, pentamidine, and melarsoprol, and 

      (3) to determine the mechanism by which mutations that arise from recombination with TbAQP3 result in reduced drug permeation. 

      Indeed, all three objectives are achieved in this paper. Using MD simulations and cryo-EM, the authors determine that TbAQP2 likely permeates drugs like an ion channel. The cryo-EM structures provide details of glycerol and drug binding, and show that glycerol and the drugs occupy the same space within the pore. Finally, MD simulations and lysis assays are employed to determine how mutations in TbAQP2 result in reduced permeation of drugs by making entry and exit of the drug relatively more energy-expensive. Overall, the strength of evidence used to support the author's claims is solid. 

      Strengths: 

      The cryo-EM portion of the study is strong, and while the overall resolution of the structures is in the 3.5Å range, the local resolution within the core of the protein and the drug binding sites is considerably higher (~2.5Å). 

      I also appreciated the MD simulations on the TbAQP2 mutants and the mechanistic insights that resulted from this data. 

      Weaknesses: 

      (1) The authors do not provide any empirical validation of the drug binding sites in TbAQP2. While the discussion mentions that the binding site should not be thought of as a classical fixed site, the MD simulations show that there's an energetically preferred slot (i.e., high occupancy interactions) within the pore for the drugs. For example, mutagenesis and a lysis assay could provide us with some idea of the contribution/importance of the various residues identified in the structures to drug permeation. This data would also likely be very valuable in learning about selectivity for drugs in different AQP proteins.

      On a philosophical level, we disagree with the requirement for ‘validation’ of a structure by mutagenesis. It is unclear what such mutagenesis would tell us beyond what was already shown experimentally through <sup>3</sup>H-pentamidine transport, drug sensitivity and lysis assays i.e. a given mutation will impact permeation to a certain extent. But on the structural level, what does mutagenesis tell us? If a bulky aromatic residue that makes many van der Waals interactions with the substrate is changed to an alanine residue and transport is reduced, what does this mean? It would confirm that the phenylalanine residue is very likely indeed making van der Waals contacts to the substrate, but we knew that already from the WT structure. And if it doesn’t have any effect? Well, it could mean that the van der Waals interactions with that particular residue are not that important or it could be that the substrate has changed its positions slightly in the channel and the new pose has similar energy of interactions to that observed in the wild type channel. Regardless of the result, any data from mutagenesis would be open to interpretation and therefore would not impact on the conclusions drawn in this manuscript. We might not learn anything new unless all residues interacting with the substrate are mutated, the structure of each mutant was determined and MD simulations were performed for all, which is beyond the scope of this work. Even then, the value for understanding clinical drug resistance would be limited, as this phenomenon has been linked to various chimeric rearrangements with adjacent TbAQP3 (references 12–15), each with a structure distinct from TbAQP2 with a single SNP. We also note that the recent paper by Chen et al. did not include any mutagenesis of the drug binding sites in TbAQP2 in their analysis of TbAQP2, presumably for similar reasons as discussed above.

      (2) Given the importance of AQP3 in the shaping of AQP2-mediated drug resistance, I think a figure showing a comparison between the two protein structures/AlphaFold structures would be beneficial and appropriate

      We agree that the comparison is of considerably interest and would contribute further to our understanding of the unique permeation capacities of TbAQP2. As such, we followed the reviewer’s suggestion and made an AlphaFold model of TbAQP3 and compared it to our structures of TbAQP2. The RMSD is 0.6 Å to the pentamidine-bound TbAQP2, suggesting that the fold of TbAQP3 has been predicted well, although the side chain rotamers cannot be assessed for their accuracy. Previous work has defined the selectivity filter of TbAQP3 to be formed by W102, R256, Y250. The superposition of the TbAQP3 model and the TbAQP2 pentamidine-bound structure shows that one of the amine groups is level with R256 and that there is a clash with Y250 and the backbone carbonyl of Y250, which deviates in position from the backbone of TbAQP2 in this region. There is also a clash with Ile252. 

      Although these observations are indeed interesting, on their own they are highly preliminary and extensive further work would be necessary to draw any convincing conclusions regarding these residues in preventing uptake of pentamidine and melarsoprol. The TbAQP3 AlphaFold model would need to be verified by MD simulations and then we would want to look at how pentamidine would interact with the channel under different experimental conditions like we have done with TbAQP2. We would then want to mutate to Ala each of the residues singly and in combination and assess them in uptake assays to verify data from the MD simulations. This is a whole new study and, given the uncertainties surrounding the observations of just superimposing TbAQP2 structure and the TbAQP3 model, we feel that, regrettably, this is just too speculative to add to our manuscript. 

      (3) A few additional figures showing cryo-EM density, from both full maps and half maps, would help validate the data. 

      Two new Supplementary Figures have been made, on showing the densities for each of the secondary structure elements (the new Figure S5) and one for the half maps showing the ligands (the new Figure S6). All the remaining supplementary figures have been renamed accordingly.

      (4) Finally, this paper might benefit from including more comparisons with and analysis of data published in Chen et al (doi.org/10.1038/s41467-024-48445-4), which focus on similar objectives. Looking at all the data in aggregate might reveal insights that are not obvious from either paper on their own. For example, melarsoprol binds differently in structures reported in the two respective papers, and this may tell us something about the energy of drug-protein interactions within the pore. 

      We already made the comparisons that we felt were most pertinent and included a figure (Fig. 5) to show the difference in orientation of melarsoprol in the two structures. We do not feel that any additional comparison is sufficiently interesting to be included. As we point out, the structures are virtually identical (RMSD 0.6 Å) and therefore there are no further mechanistic insights we would like to make beyond the thorough discussion in the Chen et al paper.

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 65 - I don't think that the authors have tested binding experimentally, and so rather than 'still bind', I think that 'are still predicted to bind' is more appropriate. 

      Changed as suggested

      (2) Line 69 - remove 'and' 

      Changed as suggested

      (3) Line 111 - clarify that it is the protein chain which is 'identical'. Ligands not. 

      Changed to read ‘The cryo-EM structures of TbAQP2 (excluding the drugs/substrates) were virtually identical…

      (4) Line 186 - make the heading of this section more descriptive of the conclusion than the technique? 

      We have changed the heading to read: ‘Molecular dynamics simulations show impaired pentamidine transport in mutants’

      Reviewer #2 (Recommendations for the authors): 

      (1) Methods - a rate of 1 nm per ns is mentioned for pulling simulations, is that right? 

      Yes, for the generation of the initial frames for the umbrella sampling a pull rate of 1 nm/ns was used in either an upwards or downwards z-dimension

      (2) Figure S9 and S10 have their captions swapped. 

      The captions have been swapped to their proper positions.

      (3) Methods state "40 ns per window" yet also that "the first 50 ns of each window was discarded as equilibration". 

      Well spotted - this line should have read “the first 5 ns of each window was discarded as equilibration”. This has been corrected (line 541).

      Reviewer #3 (Recommendations for the authors): 

      (1) Abstract, line 68-70: incomplete sentence.

      The sentence has been re-written: ‘The structures of drug-bound TbAQP2 represent a novel paradigm for drug-transporter interactions and are a new mechanism for targeting drugs in pathogens and human cells.

      (2) Line 312-313: The paper you mention here came out in May 2024 - a year ago. I appreciate that they reported similar structural data, but for the benefit of the readers and the field, I would recommend a more thorough account of the points by which the two pieces of work differ. Is there some knowledge that can be gleaned by looking at all the data in the two papers together? For example, you report a glycerol-bound structure while the other group provides an apo one. Are there any mechanistic insights that can be gained from a comparison?

      We already made the comparisons that we felt were most pertinent and included a figure (Fig. 5) to show the difference in orientation of melarsoprol in the two structures. We do not feel that any additional comparison is sufficiently interesting to be included. As we point out, the structures are virtually identical (RMSD 0.6 Å) and therefore there are no further mechanistic insights we would like to make beyond the thorough discussion in the Chen et al paper.

      (3) Similarly, you can highlight the findings from your MD simulations on the TbAQP2 drug resistance mutants, which are unique to your study. How can this data help with solving the drug resistance problem?

      New drugs will need to be developed that can be transported by the mutant chimera AQP2s and the models from the MD simulations will provide a starting point for molecular docking studies. Further work will then be required in transport assays to optimise transport rather than merely binding. However, the fact that drug resistance can also arise through deletion of the AQP2 gene highlights the need for developing new drugs that target other proteins.

      (4) A glaring question that one has as a reader is why you have not attempted to solve the structures of the drug resistance mutants, either in complex with the two compounds or in their apo/glycerol-bound form? To be clear, I am not requesting this data, but it might be a good idea to bring this up in the discussion.

      TbAQP2 containing the drug-resistant mutants does not transport either melarsoprol or pentamidine (Munday et al., 2014; Alghamdi et al., 2020); there was thus no evidence to suggest that the mutant TbAQP2 channels could bind either drug. We therefore did not attempt to determine the structures of the mutant channels because we did not think that we would see any density for the drugs in the channel. Our MD data suggests that pentamidine binding affinity is in the range of 50-300 µM for the mutant TbAQP2, supporting the view that getting these structures would be highly challenging, but of course until the experiment is tried we will not know for sure.

      We also do not think we would learn anything new about doing structures of the drug-free structures of the transport-negative mutants of TbAQP2. The MD simulations have given novel insights into why the drugs are not transported and we would rather expand effort in this direction and look at other mutants rather than expend further effort in determining new structures.

      (5) Line 152-156: Is there a molecular explanation for why the TbAQP2 has 2 glycerol molecules captured in the selectivity filter while the PfAQP2 and the human AQP7 and AQP10 have 3?

      The presence of glycerol molecules represents local energy minima for binding, which will depend on the local disposition of appropriate hydrogen bonding atoms and hydrophobic regions, in conjunction with the narrowness of the channel to effectively bind glycerol from all sides. It is noticeable that the extracellular region of the channel is wider in TbAQP2 than in AQP7 and AQP10, so this may be one reason why additional ordered glycerol molecules are absent, and only two are observed. Note also that the other structures were determined by X-ray crystallography, and the environment of the crystal lattice may have significantly decreased the rate of diffusion of glycerol, increasing the likelihood of observing their electron densities.

      (6) I would also think about including the 8JY7 (TbAQP2 apo) structure in your analysis.

      We included 8JY7 in our original analyses, but the results were identical to 8JY6 and 8JY8 in terms of the protein structure, and, in the absence of any modelled substrates in 8JY7 (the interesting part for our manuscript), we therefore have not included the comparison.

      (7) I also think, given the importance of AQP3 in this context, it would be really useful to have a comparison with the AQP3 AlphaFold structure in order to examine why it does not permeate drugs.

      We made an AlphaFold model of TbAQP3 and compared it to our structures of TbAQP2. The RMSD is 0.6 Å to the pentamidine-bound TbAQP2, suggesting that the fold of TbAQP3 has been predicted well, although the side chain rotamers cannot be assessed for their accuracy. Previous work has defined the selectivity filter of TbAQP3 to be formed by W102, R256, Y250. The superposition of the TbAQP3 model and the TbAQP2 pentamidine-bound structure shows that one of the amine groups is level with R256 and that there is a clash with Y250 and the backbone carbonyl of Y250, which deviates in position from the backbone of TbAQP2 in this region. There is also a clash with Ile252. 

      Although these observations are interesting, on their own they are preliminary in the extreme and extensive further work will be necessary to draw any convincing conclusions regarding these residues in preventing uptake of pentamidine and melarsoprol. The TbAQP3 AlphaFold model would need to be verified by MD simulations and then we would want to look at how pentamidine would interact with the channel under different experimental conditions like we have done with TbAQP2. We would then want to mutate to Ala each of the residues singly and in combination and assess them in uptake assays to verify data from the MD simulations. This is a whole new study and, given the uncertainties surrounding the observations of just superimposing TbAQP2 structure and the TbAQP3 model, we feel this is just too speculative to add to our manuscript. 

      (8) To validate the densities representing glycerol and the compounds, you should show halfmap densities for these. 

      A new figure, Fig S6 has been made to show the half-map densities for the glycerol and drugs.

      (9) I would also like to see the density coverage of the individual helices/structural elements. 

      A new figure, Fig S5 has been made to show the densities for the structural elements.

      (10) While the LigPlot figure is nice, I think showing the data (including the cryo-EM density) is necessary validation.

      The LigPlot figure is a diagram (an interpretation of data) and does not need the densities as these have already been shown in Fig. 1c (the data).

      (11) I would recommend including a figure that illustrates the points described in lines 123-134.

      All of the points raised in this section are already shown in Fig. 2a, which was referred to twice in this section. We have added another reference to Fig.2a on lines 134-135 for completeness.

      (12) Line 202: I would suggest using "membrane potential/voltage" to avoid confusion with mitochondrial membrane potential. 

      We have changed this to ‘plasma membrane potential’ to differentiate it from mitochondrial membrane potential.

      (13) Figure 4: Label C.O.M. in the panels so that the figure corresponds to the legend. 

      We have altered the figure and added and explanation in the figure legend (lines 716-717):

      ‘Cyan mesh shows the density of the molecule across the MD simulation. and the asterisk shows the position of the centre of mass (COM).’

      (14) Figure S2: Panels d and e appear too similar, and it is difficult to see the stick representation of the compound. I would recommend either using different colours or showing a close-up of the site.

      We have clarified the figure by including two close-up views of the hot-spot region, one with melarsoprol overlaid and one with pentamidine overlaid

      (15) Figure S2: Typo in legend: 8YJ7 should be 8JY7.

      Changed as suggested  

      (16) Figure S3 and Figure S4: Please clarify which parts of the process were performed in cryoSPARC and which in Relion. 

      Figure S3 gives an overview of the processing and has been simplified to give the overall picture of the procedures. All of the details were included in the Methods section as other programmes are used, not just cryoSPARC and Relion. Given the complexities of the processing, we have referred the readers to the Methods section rather than giving confusing information in Fig. S3.

      We have updated the figure legend to Fig. S4 as requested.

      (17) Figure S9 and Figure S10: The legends are swapped in these two figures.

      The captions have been swapped to their proper positions.

      (18) For ease of orientation and viewing, I would recommend showing a vertical HOLE plot aligned with an image of the AQP2 pore. 

      The HOLE plot has been re-drawn as suggest (Fig. S2)

    1. Author response:

      Reviewer #1:

      Indicated the paper provided a strong analysis of RNAseq databases to provide a biological context and resource for the massive amounts of data in the field on RNA editing. The reviewer noted that future studies will be important to define the functional consequences of the individual edits and why the RNA editing rules we identified exist. We address these comments below.

      (1) The reviewer wondered about the role of noncanonical editing to neuronal protein expression.

      Indeed, the role of noncanonical editing has been poorly studied compared to the more common A-to-I ADAR-dependent editing. Most non-canonical coding edits we found actually caused silent changes at the amino acid level, suggesting evolutionary selection against this mechanism as a pathway for generating protein diversity. As such, we suspect that most of these edits are not altering neuronal function in significant ways. Two potential exceptions to this were non-canonical edits that altered conserved residues in the synaptic proteins Arc1 and Frequenin 1. The C-to-T coding edit in the activity-regulated Arc1 mRNA that encodes a retroviral-like Gag protein involved in synaptic plasticity resulted in a P124L amino acid change (see Author response image 1 panel A below). ~50% of total Arc1 mRNA was edited at this site in both Ib and Is neurons, suggesting a potentially important role if the P124L change alters Arc1 structure or function. Given Arc1 assembles into higher order viral-like capsids, this change could alter capsid formation or structure. Indeed, P124 lies in the hinge region separating the N- and C-terminal capsid assembly regions (panel B) and we hypothesize this change will alter the ability of Arc1 capsids to assemble properly. We plan to experimentally test this by rescuing Arc1 null mutants with edited versus unedited transgenes to see how the previously reported synaptic phenotypes are modified. We also plan to examine the ability of the change to alter Arc1 capsid assembly in a collaboration using CyroEM.

      Author response image 1.

      A. AlphaFold predictions of Drosophila Arc1 and Frq1 with edit site noted. B. Structure of the Drosophila Arc1 capsid. Monomeric Arc1 conformation within the capsid is shown on the right with the location of the edit site indicated.

      The other non-canonical edit (G-to-A) that stood out was in Frequenin 1 (Frq1), a multi-EF hand containing Ca<sup>2+</sup> binding protein that regulates synaptic transmission, that resulted in a G2E amino acid substitution (location within Frq1shown in panel A above). This glycine residue is conserved in all Frq homologs and is the site of N-myristoylation, a co-translational lipid modification to the glycine after removal of the initiator methionine by an aminopeptidase. Myristoylation tethers Frq proteins to the plasma membrane, with a Ca<sup>2+</sup>-myristoyl switch allowing some family members to cycle on and off membranes when the lipid domain is sequestered in the absence of Ca<sup>2+</sup>. Although the G2E edit is found at lower levels (20% in Ib MNs and 18% in Is MNs), it could create a pool of soluble Frq1 that alters it’s signaling. We plan to functionally assay the significance of this non-canonical edit as well. Compared to edits that alter amino acid sequence, determining how non canonical editing of UTRs might regulate mRNA dynamics is a harder question at this stage and will require more experimental follow-up.

      (2) The reviewer noted the last section of the results might be better split into multiple parts as it reads as a long combination of two thoughts.

      We agree with the reviewer that the last section is important, but it was disconnected a bit from the main story and was difficult for us to know exactly where to put it. All the data to that point in the paper was collected from our own PatchSeq analysis from individual larval motoneurons. We wanted to compare these results to other large RNAseq datasets obtained from pooled neuronal populations and felt it was best to include this at the end of the results section, as it no longer related to the rules of RNA editing within single neurons. We used these datasets to confirm many of our edits, as well as find evidence for some developmental and neuron-specific cell type edits. We also took advantage of RNAseq from neuronal datasets with altered activity to explore how activity might alter the editing machinery. We felt it better to include that data in this final section given it was not collected from our original PatchSeq approach.

      Reviewer #2:

      Noted the study provided a unique opportunity to identify RNA editing sites and rates specific to individual motoneuron subtypes, highlighting the RNAseq data was robustly analyzed and high-confidence hits were identified and compared to other RNAseq datasets. The reviewer provided some suggestions for future experiments and requested a few clarifications.

      (1) The reviewer asked about Figure 1F and the average editing rate per site described later in the paper.

      Indeed, Figure 1F shows the average editing rate for each individual gene for all the Ib and Is cells, so we primarily use that to highlight the variability we find in overall editing rate from around 20% for some sites to 100% for others. The actual editing rate for each site for individual neurons is shown in Figure 4D that plots the rate for every edit site and the overall sum rate for that neuron in particular.

      (2) The reviewer also noted that it was unclear where in the VNC the individual motoneurons were located and how that might affect editing.

      The precise segment of the larvae for every individual neuron that was sampled by Patch-seq was recorded and that data is accessible in the original Jetti et al 2023 paper if the reader wants to explore any potential anterior to posterior differences in RNA editing. Due to the technical difficulty of the Patch-seq approach, we pooled all the Ib and Is neurons from each segment together to get more statistical power to identify edit sites. We don’t believe segmental identify would be a major regulator of RNA editing, but cannot rule it out.

      (3) The reviewer also wondered if including RNAs located both in the nucleus and cytoplasm would influence editing rate.

      Given our Patch-seq approach requires us to extract both the cytoplasm and nucleus, we would be sampling both nuclear and cytoplasmic mRNAs. However, as shown in Figure 8 – figure supplement 3 D-F, the vast majority of our edits are found in both polyA mRNA samples and nascent nuclear mRNA samples from other datasets, indicating the editing is occurring co-transcriptionally and within the nucleus. As such, we don't think the inclusion of cytoplasmic mRNA is altering our measured editing rates for most sites. This may not be true for all non-canonical edits, as we did see some differences there, indicating some non-canonical editing may be happening in the cytoplasm as well.

      Reviewer #3:

      indicated the work provided a valuable resource to access RNA editing in single neurons. The reviewer suggested the value of future experiments to demonstrate the effects of editing events on neuronal function. This will be a major effort for us going forwards, as we indeed have already begun to test the role of editing in mRNAs encoding several presynaptic proteins that regulate synaptic transmission. The reviewer also had several other comments as discussed below.

      (1) The reviewer noted that silent mutations could alter codon usage that would result in translational stalling and altered protein production.

      This is an excellent point, as silent mutations in the coding region could have a more significant impact if they generate non-preferred rare codons. This is not something we have analyzed, but it certainly is worth considering in future experiments. Our initial efforts are on testing the edits that cause predictive changes in presynaptic proteins based on the amino acid change and their locale in important functional domains, but it is worth considering the silent edits as well as we think about the larger picture of how RNA editing is likely to impact not only protein function but also protein levels.

      (2) The reviewer noted future studies could be done using tools like Alphafold to test if the amino acid changes are predicted to alter the structure of proteins with coding edits.

      This is an interesting approach, though we don’t have much expertise in protein modeling at that level. We could consider adding this to future studies in collaboration with other modeling labs.

      (3) The reviewer wondered if the negative correlation between edits and transcript abundance could indicate edits might be destabilizing the transcripts.

      This is an interesting idea, but would need to be experimentally tested. For the few edits we have generated already to begin functionally testing, including our published work with editing in the C-terminus of Complexin, we haven’t seen a change in mRNA levels causes by these edits. However, it would not be surprising to see some edits reducing transcript levels. A set of 5’UTR edits we have generated in Syx1A seem to be reducing protein production and may be acting in such a manner.

      (4) The reviewer wondered if the proportion of edits we report in many of the figures is normalized to the length of the transcript, as longer transcripts might have more edits by chance.

      The figures referenced by the reviewer (1, 2 and 7) show the number of high-confidence editing sites that fall into the 5’ UTR, 3’ UTR, or CDS categories. Our intention here was to highlight that the majority of the high confidence edits that made it through the stringent filtering process were in the coding region. This would still be true if we normalized to the length of the given gene region. However, it would be interesting to know if these proportions match the expected proportions of edits in these gene regions given a random editing rate per gene region length across the Drosophila genome, although we did not do this analysis.    

      (5) The reviewer noted that future studies could expand on the work to examine miRNA or other known RBP binding sites that might be altered by the edits.

      This is another avenue we could pursue in the future. We did do this analysis for a few of the important genes encoding presynaptic proteins (these are the most interesting to us given the lab’s interest in the synaptic vesicle fusion machinery), but did not find anything obvious for this smaller subset of targets.

      (6) The reviewer suggested sequence context for Adar could also be investigated for the hits we identified.

      We haven’t pursued this avenue yet, but it would be of interest to do in the future. In a similar vein, it would be informative to identify intron-exon base pairing that could generate the dsDNA template on which ADAR acts.

      (7) The reviewer noted the disconnect between Adar mRNA levels and overall editing levels reported in Figure 4A/B.

      Indeed, the lack of correlation between overall editing levels and Adar mRNA abundance has been noted previously in many studies. For the type of single cell Patch-seq approach we took to generate our RNAseq libraries, the absolute amount of less abundant transcripts obtained from a single neuron can be very noisy. As such, the few neurons with no detectable Adar mRNA are likely to represent that single neuron noise in the sampling. Per the reviewer’s question, these figure panels only show A-to-I edits, so they are specific to ADAR.

      (8) The reviewer notes the scale in Figure 5D can make it hard to visualize the actual impact of the changes.

      The intention of Figure 5D was to address the question of whether sites with high Ib/Is editing differences were simply due to higher Ib or Is mRNA expression levels. If this was the case, then we would expect to see highly edited sites have large Ib/Is TPM differences. Instead, as the figure shows, the vast majority of highly-edited sites were in mRNAs that were NOT significantly different between Ib and Is (red dots in graph) and are therefore clustered together near “0 Difference in TPMs”. TPMs and editing levels for all edit sites can be found in Table 1, and a visualization of these data for selected sites is shown in Figure 5E.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary:

      The authors use the theory of planned behavior to understand whether or not intentions to use sex as a biological variable (SABV), as well as attitude (value), subjective norm (social pressure), and behavioral control (ability to conduct behavior), across scientists at a pharmacological conference. They also used an intervention (workshop) to determine the value of this workshop in changing perceptions and misconceptions. Attempts to understand the knowledge gaps were made.

      Strengths:

      The use of SABV is limited in terms of researchers using sex in the analysis as a variable of interest in the models (and not a variable to control). To understand how we can improve on the number of researchers examining the data with sex in the analyses, it is vital we understand the pressure points that researchers consider in their work. The authors identify likely culprits in their analyses. The authors also test an intervention (workshop) to address the main bias or impediments for researchers' use of sex in their analyses. 

      Weaknesses:

      There are a number of assumptions the authors make that could be revisited: 

      (1) that all studies should contain across sex analyses or investigations. It is important to acknowledge that part of the impetus for SABV is to gain more scientific knowledge on females. This will require within sex analyses and dedicated research to uncover how unique characteristics for females can influence physiology and health outcomes. This will only be achieved with the use of female-only studies. The overemphasis on investigations of sex influences limits the work done for women's health, for example, as within-sex analyses are equally important.

      The Sex and Gender Equity in Research (SAGER) guidelines (1) provide guidance that “Where the subjects of research comprise organisms capable of differentiation by sex, the research should be designed and conducted in a way that can reveal sex-related differences in the results, even if these were not initially expected.”.  This is a default position of inclusion where the sex can be determined and analysis assessing for sex related variability in response. This position underpins many of the funding bodies new policies on inclusion.   

      However, we need to place this in the context of the driver of inclusion. The most common reason for including male and female samples is for those studies that are exploring the effect of a treatment and then the goal of inclusion is to assess the generalisability of the treatment effect (exploratory sex inclusion)(2). The second scenario is where sex is included because sex is one of the variables of interest and this situation will arise because there is a hypothesized sex difference of interest (confirmatory sex inclusion).  

      We would argue that the SABV concept was introduced to address the systematic bias of only studying one sex when assessing treatment effect to improve the generalisability of the research.  Therefore, it isn’t directly to gain more scientific knowledge on females.  However, this strategy will highlight when the effect is very different between male and female subjects which will potentially generate sex specific hypotheses.  

      Where research has a hypothesis that is specific to a sex (e.g. it is related to oestrogen levels) it would be appropriate to study only the sex of interest, in this case females. The recently published Sex Inclusive Research Framework gives some guidance here and allows an exemption for such a scenario classifying such proposals “Single sex study justified” (3).

      We have added an additional paragraph to the introduction to clarify the objectives behind inclusion and how this assists the research process. 

      (2) It should be acknowledged that although the variability within each sex is not different on a number of characteristics (as indicated by meta-analyses in rats and mice), this was not done on all variables, and behavioral variables were not included. In addition, across-sex variability may very well be different, which, in turn, would result in statistical sex significance. In addition, on some measures, there are sex differences in variability, as human males have more variability in grey matter volume than females. PMID: 33044802. 

      The manuscript was highlighting the common argument used to exclude the use of females, which is that females are inherently more variable as an absolute truth. We agree there might be situations, where the variance is higher in one sex or another depending on the biology.  We have extended the discussion here to reflect this, and we also linked to the Sex Inclusive Research Framework (3) which highlights that in these situations researchers can utlise this argument provided it is supported with data for the biology of interest. 

      (3) The authors need to acknowledge that it can be important that the sample size is increased when examining more than one sex. If the sample size is too low for biological research, it will not be possible to determine whether or not a difference exists. Using statistical modelling, researchers have found that depending on the effect size, the sample size does need to increase. It is important to bare this in mind as exploratory analyses with small sample size will be extremely limiting and may also discourage further study in this area (or indeed as seen the literature - an exploratory first study with the use of males and females with limited sample size, only to show there is no "significance" and to justify this as an reason to only use males for the further studies in the work. 

      The reviewer raises a common problem: where researchers have frequently argued that if they find no sex differences in a pilot then they can proceed to study only one sex. The SAGER guidelines (1), and now funder guidelines (4, 5), challenge that position. Instead, the expectation is for inclusion as the default in all experiments (exploratory inclusion strategy) to allow generalisable results to be obtained. When the results are very different between the male and female samples, then this can be determined. This perspective shift (2) requires a change in mindset and understanding that the driver behind inclusion is of generalisability not exploration of sex differences. This has been added to the introduction as an additional paragraph exploring the drivers behind inclusion.  

      We agree with the reviewer that if the researcher is interested in sex differences in an effect (confirmatory inclusion strategy, aka sex as a primary variable) then the N will need to be higher.  However, in this situation, one, of course, must have male and female samples in the same experiment to allow the simultaneous exploration to assess the dependency on sex. 

      Reviewer #2 (Public review): 

      Summary:

      The investigators tested a workshop intervention to improve knowledge and decrease misconceptions about sex inclusive research. There were important findings that demonstrate the difficulty in changing opinions and knowledge about the importance of studying both males and females. While interventions can improve knowledge and decrease perceived barriers, the impact was small. 

      Strengths:

      The investigators included control groups and replicated the study in a second population of scientists. The results appear to be well substantiated. These are valuable findings that have practical implications for fields where sex is included as a biological variable to improve rigor and reproducibility. 

      Thank you for assessment and highlighting these strengths.  We appreciate your recognition of the value and practical implications of this work. 

      Weaknesses:

      I found the figures difficult to understand and would have appreciated more explanation of what is depicted, as well as greater space between the bars representing different categories. 

      We have improved the figures and figure legends to improve clarity. 

      Reviewer #3 (Public review):

      Summary:

      This manuscript aims to determine cultural biases and misconceptions in inclusive sex research and evaluate the efficacy of interventions to improve knowledge and shift perceptions to decrease perceived barriers for including both sexes in basic research. 

      Overall, this study demonstrates that despite the intention to include both sexes and a general belief in the importance of doing so, relatively few people routinely include both sexes. Further, the perceptions of barriers to doing so are high, including misconceptions surrounding sample size, disaggregation, and variability of females. There was also a substantial number of individuals without the statistical knowledge to appropriately analyze data in studies inclusive of sex. Interventions increased knowledge and decreased perception of barriers. 

      Strengths:

      (1) This manuscript provides evidence for the efficacy of interventions for changing attitudes and perceptions of research.

      (2) This manuscript also provides a training manual for expanding this intervention to broader groups of researchers.

      Thank you for highlighting these strengths. We appreciate your recognition that the intervention was effect in changing attitudes and perception. We deliberately chose to share the material to provide the resources to allow a wider engagement.  

      Weaknesses:

      The major weakness here is that the post-workshop assessment is a single time point, soon after the intervention. As this paper shows, intention for these individuals is already high, so does decreasing perception of barriers and increasing knowledge change behavior, and increase the number of studies that include both sexes? Similarly, does the intervention start to shift cultural factors? Do these contribute to a change in behavior? 

      Measuring change in behaviour following an intervention is challenging and hence we had implemented an intention score as a proxy for behaviour. We appreciate the benefit of a long-term analysis, but it was beyond the scope of this study and would need a larger dataset size to allow for attrition. We agree that the strategy implemented has weaknesses. We have extended the limitation section in the discussion to include these. 

      Reviewer #1 (Recommendations for the authors):  

      I would ask them to think about alternative explanations and ask for free-form responses, and to revise with the caveats written above - sample size does need to be increased depending on effect size, and that within sex studies are also important. Not all studies should focus on sex influences.  

      The inclusion of the additional paragraph in the introduction to clarify the objective of inclusion and the resulting impact on experimental design should address these recommendations.   

      We have also added the free-form responses as an additional supplementary file.  

      Reviewer #2 (Recommendations for the authors):  

      This is an important set of studies. My only recommendation to improve the data presentation so that it is clear what is depicted and how the analyses were conducted. I know it is in the methods, but reminding the reader would be helpful.  

      We have revisited the figures and included more information in the legends to explain the analysis and improve clarity.   

      Reviewer #3 (Recommendations for the authors):  

      There are parts in the introduction which read as contradictory and as such are confusing - for example, in the 3rd paragraph it states that little progress on sex inclusive research has been made, and in the following sentences it states that the proportion of published studies across sex has improved. The references in these two statements are from the same time range, so has this improved? Or not?  

      The introduction does include a summation statement on the position: “Whilst a positive step forward, this proportion still represents a minority of studies, and notably this inclusion was not associated with an increase in the proportion of studies that included data analysed by sex.” We have reworded the text to ensure it is internally consistent with this summary statement and this should increase clarity.

      In discussing the results, it is sometimes confusing what the percentages mean. For example, "the researchers reported only conducting sex inclusive research in <=55% of their studies over the past 5 years (55% in study 1 general population and 35% study 2 pre-assessment)." Does that mean 55% of people are conducting sex inclusive research, or does this mean only half of their studies? These two options have very different implications.

      We agree that the sentence is confusing and it has been reworded.  

      Addressing long-term assessments in attitude and action (ie, performing sex inclusive research) is a crucial addition, with data if possible, but at least substantive discussion.  

      We have add this to the limitation section in the discussion

      One minor but confusing point is the analogy comparing sex inclusive studies with attending the gym. The point is well taken - knowledge is not enough for behavior change. However, the argument here is that to increase sex inclusive research requires cultural change. To go to the gym, requires motivation.This seems like an oranges-to-lemons comparison (same family, different outcome when you bite into it).

      At the core, both scenarios involve the challenge of changing established habits and cultural norms in action based on knowledge (the right thing to do). The exercise scenario is a primary example provided by the original authors to describe how aspects of the theory of planned behaviour (perceived behavioural control, attitude, and social norms) may influence behavioural change. Understanding which of these aspects may drive or influence change is why we used this framework to understand our study population.  We disagree that is an oranges-to-lemons comparison.

      References

      (1) Heidari S, Babor TF, De Castro P, Tort S, Curno M. Sex and Gender Equity in Research: rationale for the SAGER guidelines and recommended use. Res Integr Peer Rev. 2016;1:2.

      (2) Karp NA. Navigating the paradigm shift of sex inclusive preclinical research and lessons learnt. Commun Biol. 2025;8(1):681.

      (3) Karp NA, Berdoy M, Gray K, Hunt L, Jennings M, Kerton A, et al. The Sex Inclusive Research Framework to address sex bias in preclinical research proposals. Nat Commun. 2025;16(1):3763.

      (4) MRC. Sex in experimental design - Guidance on new requirements https://www.ukri.org/councils/mrc/guidance-for-applicants/policies-and-guidance-forresearchers/sex-in-experimental-design/: UK Research and Innovation; 2022 [

      (5) Clayton JA, Collins FS. Policy: NIH to balance sex in cell and animal studies. Nature. 2014;509(7500):282-3.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      General Statements

      We would like to thank the referees for their time and effort in giving feedback on our work, and their overall positive attitude towards the manuscript. Most of the referees' points were of clarifying and textual nature. We have identified three points which we think require more attention in the form of additional analyses, simulations or significant textual changes:

      Within the manuscript we state that conserved non coding sequences (CNSs) are a proxy for cis regulatory elements (CREs). We proceed to use these terms interchangeably without explaining the underlying assumption, which is inaccurate. To improve on this point we ensured in the new text that we are explicit about when we mean CNS or CRE. Secondly, we added a section to the discussion (‘Limitations of CNSs as CREs’) dedicated to this topic. During stabilising selection (maintaining the target phenotype) DSD can occur fully neutrally, or through the evolution of either mutational or developmental robustness. We describe the evolutionary trajectories of our simulations as neutral once fitness mostly plateaued; however, as reviewer 3 points out, small gains in median fitness still occur, indicating that either development becomes more robust to noisy gene expression and tissue variation, and/or the GRNs become more robust to mutations. To discern between fully neutral evolution where the fitness distribution of the population does not change, and the higher-order emergence of robustness, we performed additional analysis of the given results. Preliminary results showed that many (near-)neutral mutations affect the mutational robustness and developmental robustness, both positively and negatively. To investigate this further we will run an additional set of simulations without developmental stochasticity, which will take about a week. These simulations should allow us to more closely examine the role of stabilising selection (of developmental robustness) in DSD by removing the need to evolve developmental robustness. Additionally, we will set up simulations in which we changed the total number of genes, and the number of genes under selection to investigate how this modelling choice influences DSD. In the section on rewiring (‘Network redundancy creates space for rewiring’) we will analyse the mechanism allowing for rewiring in more depth, especially in the light of gene duplications and redundancy. We will extend this section with an additional analysis aimed to highlight how and when rewiring is facilitated. We will describe the planned and incorporated revisions in detail below; we believe these have led to a greatly improved manuscript.

      Kind regards,

      Pjotr van der Jagt, Steven Oud and Renske Vroomans

      Description of the planned revisions

      Referee cross commenting (Reviewer 4)

      Reviewer 3's concern about DSD resulting from stabilising selection for robustness is something I missed -- this is important and should be addressed.

      We understand this concern, and agree that we should be more thorough in our analysis of DSD by assessing the higher-order effects of stabilising selection on mutational robustness and/or environmental (developmental) robustness (McColgan & DiFrisco 2024).

      We will 1) extend our analysis of fitness under DSD by computing the mutational and developmental robustness (similar to Figure 2F) over time for a number of ancestral lineages. By comparing these two measures over evolutionary time we will gain a much more fine grained image of the evolutionary dynamics and should be able to find adaptive trends through gain of either type of robustness. Preliminary results suggest that during the plateaued fitness phase both mutational robustness and developmental robustness undergo weak gains and losses, likely due to the pleiotropic nature of our GPM. Collectively, these weak gains and losses result in the gain observed in Figure S3. So, rather than fully neutral we should discern (near-)neutral regimes in which clear adaptive steps are absent, but in which the sum of them is a net gain. These are interesting findings we initially missed, and give insights into how this high-dimensional fitness landscape is traversed, and will be included in a future revised version of the manuscript.

      2) We will run extra simulations without stochasticity to investigate DSD in the absence of adaptation through developmental robustness, and include the comparison between these and our original simulations in a future revised version.

      Finally 3) we will address stabilising selection more prominently in the introduction and discussion to accommodate these additional simulations.

      Reviewer 3 suggests that the model construction may favor DSD because there are many genes (14) of which only two determine fitness. I agree that some discussion on this point is warranted, though I am not sure enough is known about "the possible difference in constraints between the model and real development" for such a discussion to be on firm biological footing. A genetic architecture commonly found in quantitative genetic studies is that a small number of genes have large effects on the phenotype/fitness, whereas a very large number of genes have effects that are individually small but collectively large (see, e.g. literature surrounding the "omnigenic model" of complex traits). Implementing such an architecture is probably beyond the scope of the study here. More generally, would be natural to assume that the larger the number of genes, and the smaller the number of fitness-determining genes, the more likely DSD / re-wiring is to occur. That being said, I think the authors' choice of a 14-gene network is biologically defensible. It could be argued that the restriction of many modeling studies to small networks (often including just 3 genes) on the ground of convenience artificially ensures that DSD will not occur in these networks.

      The choice of 14 genes does indeed stem from a compromise between constraining the number of available genes, but at the same time allowing for sufficient degrees of freedom and redundancy. We have added a ‘modelling choices’ section in the discussion in which we address this point. Additionally, it is important to note that, while the fitness criterion only measures the pattern of 2 genes, throughout the evolutionary lineage additional genes become highly important for the fitness of an individual, because these genes evolved to help generate the target pattern (see for example Figure 4); the other genes indeed reflect reviewer 4’s point that most genes have a small effect. Crucially, we observe that even the genes and interactions that are important for fitness undergo DSD.

      Nevertheless, we think it is interesting to investigate this point of the influence of this particular modelling choice on the potential for DSD, and have set up an extra set of simulations with fewer gene types, and one with additional fitness genes.

      Furthermore, we discuss the choice of our network architecture more in depth in a discussion section on our modelling choices: ‘Modelling assumptions and choices’.

      Reviewer 1

      The observation of DSD in the computational models remains rather high-level in the sense that no motifs, mechanisms, subgraphs, mutations or specific dynamics are reported to be associated to it ---with the exception of gene expression domains overlapping. Perhaps the authors feel it is beyond this study, but a Results section with a more in-depth "mechanistic" analysis on what enables DSD would (a) make a better case for the extensive and expensive computational models and (b) would push this paper to a next level. As a starting point, it could be nice to check Ohno's intuition that gene duplications are a creative "force" in evolution. Are they drivers of DSD? Or are TFBS mutations responsible for the majority of cases?

      We agree that some mechanistic analysis would strengthen the manuscript, and will therefore extend the section ‘Network redundancy creates space for rewiring’ to address how this redundancy is facilitated. For instance, in the rewiring examples given in Figure 4 we can highlight how this new interaction emerges, if this is through a gene mutation followed by rewiring and loss of a redundant gene, or if the gain, redundancy and loss are all on the level of TFBS mutations. Effectively we will investigate which route of the three in the following schematic is most prominent:

      Additionally, we will do analysis on the different effects of the transcription dynamics for each of these routes. (note that this is not an exhaustive schematic, and combinations could be possible).

      l171. You discuss an example here, would it be possible to generalize this analysis and quantify the amount of DSD amongst all cloned populations? And related question: of the many conserved interactions in Fig 4A, how many do the two clonal lineages share? None? All?

      We agree that this is a good idea. In a new supplementary figure, we will show the number of times a conserved interaction gets lost, and a new interaction is gained as a metric for DSD in every cloned population.

      The populations in Fig 4A are cloned at generation 50.000, any interaction starting before then and still present at a point in time is shared. Any interactions starting after 50.000 are unique (or independently gained at least).

      - l269. What about phenotypic plasticity due to stochastic gene expression? Does it play a role in DSD in your model? I am thinking about https://pubmed.ncbi.nlm.nih.gov/24884746/ and https://pubmed.ncbi.nlm.nih.gov/21211007/

      We agree that this is an interesting point which should be included into the discussion. Following the comments of reviewer 3 we have set up extra simulations to investigate this in more detail, we will make sure to include these citations in the revised discussion when we have the results of those simulations.

      Reviewer 3

      Issue One: Interpretation of fitness gains under stabilising selection

      A central issue concerns how the manuscript defines and interprets developmental systems drift (DSD) in relation to evolution on the fitness landscape. The authors define DSD as the conservation of a trait despite changes in its underlying genetic basis, which is consistent with the literature. However, the manuscript would benefit from clarifying the relationship between DSD, genotype-to-phenotype maps, and fitness landscapes. Very simply, we can say that (i) DSD can operate along neutral paths in the fitness landscape, (ii) DSD can operate along adaptive paths in the fitness landscape. During DSD, these neutral or adaptive paths along the fitness landscape are traversed by mutations that change the gene regulatory network (GRN) and consequent gene expression patterns whilst preserving the developmental outcome, i.e., the phenotype. While this connection between DSD and fitness landscapes is referenced in the introduction, it is not fully elaborated upon. A complete elaboration is critical because, when I read the manuscript, I got the impression that the manuscript claims that DSD is prevalent along neutral paths in the fitness landscape, not just adaptive ones. If I am wrong and this is not what the authors claim, it should be explicitly stated in the results and discussed. Nevertheless, claiming DSD operates along neutral paths is a much more interesting statement than claiming it operates along adaptive paths. However, it requires sufficient evidence, which I have an issue with.

      The issue I have is about adaptations under stabilising selection. Stabilising selection occurs when there is selection to preserve the developmental outcome. Stabilising selection is essential to the results because evolutionary change in the GRN under stabilising selection should be due to DSD, not adaptations that change the developmental outcome. To ensure that the populations are under stabilising selection, the authors perform clonal experiments for 100,000 generations for 8 already evolved populations, 5 clones for each population. They remove 10 out of 40 clones because the fitness increase is too large, indicating that the developmental outcome changes over the 100,000 generations. However, the remaining 30 clonal experiments exhibit small but continual fitness increases over 100,000 generations. The authors claim that the remaining 30 are predominantly evolving due to drift, not adaptations (in the main text, line 137: "indicating predominantly neutral evolution", and section M: "too shallow for selection to outweigh drift"). The author's evidence for this claim is a mathematical analysis showing that the fitness gains are too small to be caused by beneficial adaptations, so evolution must be dominated by drift. I found this explanation strange, given that every clone unequivocally increases in fitness throughout the 100,000 generations, which suggests populations are adapting. Upon closer inspection of the mathematical analysis (section M), I believe it will miss many kinds of adaptations possible in their model, as I now describe.

      The mathematical analysis treats fitness as a constant, but it's a random variable in the computational model. Fitness is a random variable because gene transcription and protein translation are stochastic (Wiener terms in Eqs. (1)-(5)) and cell positions change for each individual (Methods C). So, for a genotype G, the realised fitness F is picked from a distribution with mean μ_G and higher order moments (e.g., variance) that determine the shape of the distribution. I think these assumptions lead to two problems.

      The first problem with the mathematical analysis is that F is replaced by an absolute number f_q, with beneficial mutations occurring in small increments denoted "a", representing an additive fitness advantage. The authors then take a time series of the median population fitness from their simulations and treat its slope as the individual's additive fitness advantage "a". The authors claim that drift dominates evolution because this slope is lower than a drift-selection barrier, which they derive from the mathematical analysis. This analysis ignores that the advantage "a" is a distribution, not a constant, which means that it does not pick up adaptations that change the shape of the distribution. Adaptations that change the shape of the distribution can be adaptations that increase robustness to stochasticity. Since there are multiple sources of noise in this model, I think it is highly likely that robustness to noise is selected for during these 100,000 generations.

      The second problem is that the mathematical analysis ignores traits that have higher-order effects on fitness. A trait has higher-order effects when it increases the fitness of the lineage (e.g., offspring) but not the parent. One possible trait that can evolve in this model with higher-order effects is mutational robustness, i.e., traits that lower the expected mutational load of descendants. Since many kinds of mutations occur in this model (Table 2), mutational robustness may be also evolving.

      Taken together, the analysis in Section M is set up to detect only immediate, deterministic additive gains in a single draw of fitness. It therefore cannot rule out weak but persistent adaptive evolution of robustness (to developmental noise and/or to mutations), and is thus insufficient evidence that DSD is occurring along neutral paths instead of adaptive paths. The small but monotonic fitness increases observed in all 40 clones are consistent with such adaptation (Fig. S3). The authors also acknowledge the evolution of robustness in lines 129-130 and 290-291, but the possibility of these adaptations driving DSD instead of neutral evolution is not discussed.

      To address the issue I have with adaptations during stabilising selection, the authors should, at a minimum, state clearly in their results that DSD is driven by both the evolution of robustness and drift. Moreover, a paragraph in the discussion should be dedicated to why this is the case, and why it is challenging to separate DSD through neutral evolution vs DSD through adaptations such as those that increase robustness.

      [OPTIONAL] A more thorough approach would be to make significant changes to the manuscript by giving sufficient evidence that the experimental clones are evolving by drift, or changing the model construction. One possible way to provide sufficient evidence is to improve the mathematical analysis. Another way is to show that the fitness distributions (both without and with mutations, like in Fig. 2F) do not significantly change throughout the 100,000 generations in experimental clones. It seems more likely that the model construction makes it difficult to separate the evolution of robustness from evolution by drift in the stabilising selection regime. Thus, I think the model should be constructed differently so that robustness against mutations and noise is much less likely to evolve after a "fitness plateau" is reached. This could be done by removing sources of noise from the model or reducing the kinds of possible mutations (related to issue two). In fact, I could not find justification in the manuscript for why these noise terms are included in the model, so I assume they are included for biological realism. If this is why noise is included, or if there is a separate reason why it is necessary, please write that in the model overview and/or the methods.

      We agree that we should be more precise about whether DSD operates along neutral vs adaptive paths in the fitness landscape, and have expanded our explanation of this distinction in the introduction. We also agree that it is worthwhile to distinguish between neutral evolution that does not change the fitness distribution of the population (either through changes in developmental or mutational robustness), higher-order evolutionary processes that increase developmental robustness, and drift along a neutral path in the fitness landscape towards regions of greater connectivity, resulting in mutational robustness (as described in Huynen et al., 1999). We have performed a preliminary analysis to identify changes in mutational robustness and developmental robustness over evolutionary time in the populations in which the maximum fitness has already plateaued. This analysis shows frequent weak gains and losses, in which clear adaptive steps are absent but a net gain can be seen in robustness, as consistent with higher-order fitness effects.

      To investigate the role of stabilising selection more in depth we will run simulations without developmental noise in the form of gene expression noise and tissue connectivity variation, thus removing the effect of the evolution of developmental robustness. We will compare the evolutionary dynamics of the GRNs with our original set of simulations, and include both these types of analyses in a supplementary figure of the revised manuscript.

      Furthermore, we now discuss the limitations of the mathematical analysis with regard to adaptation vs neutrality in our simulations, in the supplementary section.

      Issue two: The model construction may favour DSD

      In this manuscript, fitness is determined by the expression pattern of two types of genes (genes 12 and 13 in Table 1). There are 14 types of genes in total that can all undergo many kinds of mutations, including duplications (Table 2). Thus, gene regulatory networks (GRNs) encoded by genomes in this model tend to contain large numbers of interactions. The results show that most of these interactions have minimal effect on reaching the target pattern in high fitness individuals (e.g. Fig. 2F). A consequence of this is that only a minimal number of GRN interactions are conserved through evolution (e.g. Fig. 2D). From these model constructions and results from evolutionary simulations, we can deduce that there are very few constraints on the GRN. By having very few constraints on the GRN, I think it makes it easy for a new set of pattern-producing traits to evolve and subsequently for an old set of pattern-producing traits to be lost, i.e., DSD. Thus, I believe that the model construction may favour DSD.

      I do not have an issue with the model favouring DSD because it reflects real multicellular GRNs, where it is thought that a minority fraction of interactions are critical for fitness and the majority are not. However, it is unknown whether the constraints GRNs face in the model are more or less constrained than real GRNs. Thus, it is not known whether the prevalence of DSD in this model applies generally to real development, where GRN constraints depend on so many factors. At a minimum, the possible difference in constraints between the model and real development should be discussed as a limitation of the model. A more thorough change to the manuscript would be to test the effect of changing the constraints on the GRN. I am sure there are many ways to devise such a test, but I will give my recommendation here.

      [OPTIONAL] My recommendation is that the authors should run additional simulations with simplified mutational dynamics by constraining the model to N genes (no duplications and deletions), of which M out of these N genes contribute to fitness via the specific pattern (with M=2 in the current model). The authors should then test the effect of changing N and M independently, and how this affects the prevalence of DSD. If the prevalence of DSD is robust to changes in N and M, it supports the authors argument that DSD is highly prevalent in developmental evolution. If DSD prevalence is highly dependent on M and/or N, then the claims made in the manuscript about the prevalence of DSD must change accordingly. I acknowledge that these simulations may be computationally expensive, and I think it would be great if the authors knew (or devised) a more efficient way to test the effect of GRN constraints on DSD prevalence. Nevertheless, these additional simulations would make for a potentially very interesting manuscript.

      We agree that these modelling choices likely influence the potential for DSD. We think that our model setup, where most transcription factors are not under direct selection for a particular pattern, more accurately reflects biological development, where the outcome of the total developmental process (a functional organism) is what is under selection, rather than each individual gene pattern. As also mentioned by the referee, in real multicellular development the majority of interactions is not crucial for fitness, similar to our model. We also observe that, as fitness increases, additional genes experience emergent selection for particular expression patterns or interaction structures in the GRN, resulting in their conservation. Nevertheless, we do agree that the effect of model construction on DSD is an unexplored avenue and this work lends itself to addressing this. We will run additional sets of simulations: one in which we reduce the size of the network (‘N’), and a second set where we double the number of fitness contributing genes (‘M’), and show the effect on the extent of DSD in a future supplementary figure.

      Description of the revisions that have already been incorporated in the transferred manuscript

      Referee cross commenting (Reviewer 4)

      Overall I agree with the comments of Reviewer 1, 2 and 3. I note that reviewers 1, 3, and 4 each pointed out the difficulties with assuming that CNSs = CREs, so this needs to be addressed. Two reviewers (3 and 4) also point out problems with equating bulk RNAseq with a conserved phenotype.

      We agree that caution is warranted with the assumption of CNSs = CREs. We have added a section to the discussion in which we discuss this more thoroughly, see ‘Limitations of CNSs as CREs’ in the revised manuscript.

      Additionally, we made textual changes to the statement of significance, abstract and results to better reflect when we talk about CNSs or CREs.

      I agree with Reviewer 1's hesitancy about the rhetorical framing of the paper potentially generalising too far from a computational model of plant meristem patterning.

      We agree that the title should reflect the scope of the manuscript, and our short title reflects that better than ubiquitous, which implies we investigated beyond plant (meristem) development. We have changed the title in the revised version, to ‘System drift in the evolution of plant meristem development’.

      Reviewer 1

      It is system drift, not systems drift (see True and Haag 2001). No 's' after system.

      Thank you for catching this – we corrected this throughout.

      - I am afraid I have a problem with the manuscript title. I think "Ubiquitoes" is misplaced, because it strongly suggests you have a long list of case studies across plants and animals, and some quantification of DSD in these two kingdoms. That would have been an interesting result, but it is not what you report. I suggest something along the lines of "System drift in the evolution of plant meristem development", similar to the short title used in the footer.

      - Alternatively, the authors may aim to say that DSD happens all over the place in computational models of development? In that case the title should reflect that the claim refers to modeling. (But what then about the data analysis part?)

      As remarked in the summary (point 2), we agree with this assessment and have changed the title to ‘System drift in the evolution of plant meristem development’’

      Multiple times in the Abstract and Introduction the authors make statements on "cis-regulatory elements" that are actually "conserved non-coding sequences" (CNS). Even if it is not uncommon for CNSs to harbor enhancers etc., I would be very hesitant to use the two as synonyms. As the authors state themselves, sequences, even non-coding, can be conserved for many reasons other than CREs. I would ask the authors to support better their use of "CREs" or adjust language. As roughly stated in their Discussion (lines 310-319), one way forward could be to show for a few CNS that are important in the analysis (of Fig 5), that they have experimentally-verified enhancers. Is that do-able or a bridge too far?

      We changed the text such that we use CNS instead of CRE when discussing the bioinformatic analysis. Additionally we added a section in the discussion to clarify the relationship between CNS and CRE.

      line 7. evo-devo is jargon

      We changed this to ‘…evolution of development (evo-devo) research…

      l9. I would think "using a computational model and data analysis"

      Yes, corrected.

      l13. Strictly speaking you did not look at CREs, but at conserved non-coding sequences.

      Indeed, we changed this to CNS.

      l14. "widespread" is exaggerated here, since you show for a single organ in a handful of plant species. You may extrapolate and argue that you do not see why it should not be widespread, but you did not show it. Or tie in all the known cases that can be found in literature.

      We understand that ‘widespread’ seems to suggest that we have investigated a broader range of species and organs. To be more accurate we changed the wording to ‘prevalent’.

      l16. "simpler" than what?

      We added the example of RNA folding.

      l27. Again the tension between CREs and non-coding sequence.

      Changed to conserved non coding sequence.

      l28. I don't understand the use of "necessarily" here.

      This is indeed confusing and unnecessary, removed

      l34-35. A very general biology statement is backed up by two modeling studies. I would have expected also a few based on comparative analyses (e.g., fossils, transcriptomics, etc).

      We added extra citations and a discussion of more experimental work

      l36. I was missing the work on "phenogenetic drift" by Weiss; and Pavlicev & Wagner 2012 on compensatory mutations.

      Changed the text to:

      This phenomenon is called developmental system drift (DSD) (True and Haag, 2001; McColgan and DiFrisco, 2024), or phenogenetic drift (Weiss and Fullerton, 2000), and can occur when multiple genotypes which are separated by few mutational steps encode the same phenotype, forming a neutral (Wagner, 2008a; Crombach et al., 2016); or adaptive path (Johnson and Porter, 2007; Pavlicev and Wagner, 2012) .

      l38. Kimura and Wagner never had a developmental process in mind, which is much bigger than a single nucleotide or a single gene, respectively. First paper that I am aware of that explicitly connects DSD to evolution on genotype networks is my own work (Crombach 2016), since the editor of that article (True, of True and Haag 2001) highlighted that point in our communications.

      Added citation and moved Kimura to the theoretical examples of protein folding DSD.

      l40. While Hunynen and Hogeweg definitely studied the GP map in many of their works, the term goes back to Pere Alberch (1991).

      Added citation.

      l54-55. I'm missing some motivation here. If one wants to look at multicellular structures that display DSD, vulva development in C. elegans and related worms is an "old" and extremely well-studied example. Also, studies on early fly development by Yogi Jaeger and his co-workers are not multicellular, but at least multi-nuclear. Obviously these are animal-based results, so to me it would make sense to make a contrast animal-plant regarding DSD research and take it from there.

      Indeed, DSD has been found in these species and we now reference some of this work; the principle is better known in animals. Nevertheless, within the theoretical literature there is a continuing debate on the importance/extent of DSD.

      Changed text:

      ‘For other GPMs, such as those resulting from multicellular development, it has been suggested that complex phenotypes are sparsely distributed in genotype space, and have low potential for DSD because the number of neutral mutations anti-correlates with phenotypic complexity (Orr, 2000; Hagolani et al., 2021). On the other hand, theoretical and experimental studies in nematodes and fruit flies have shown that DSD is present in a phenotypically complex context (Verster et al., 2014; Crombach et al., 2016; Jaeger, 2018). It therefore remains debated how much DSD actually occurs in species undergoing multicellular development. DSD in plants has received little attention. One multicellular structure which …’

      l66-86. It is a bit of a style-choice, but this is a looong summary of what is to come. I would not have done that. Instead, in the Introduction I would have expected a bit more digging into the concept of DSD, mention some of the old animal cases, perhaps summarize where in plants it should be expected. More context, basically.

      We extended the paragraph on empirical examples of DSD by adding the animal cases and condensed our summary.

      l108. Could you quantify the conserved interactions shared between the populations? Or is each simulation so different that they are pretty much unique?

      Each simulation here is independent of the other simulations, so a per interaction comparison would be uninformative. After cloning they do share ancestry, but that is much later in the manuscript and here the quantification of the conserved interactions would be the inverse of the divergence as shown in, for instance Figure 3B.

      l169. "DSD driving functional divergence" needs some context, since DSD is supposed to not affect function (of the final phenotype). Or am I misunderstanding?

      This is indeed a confusing sentence. We mean to say that DSD allows for divergence to such an extent that the underlying functional pathway is changed. So instead of a mere substitution of the underlying network, in which the topology and relative functions stay conserved, a different network structure is found. We have modified the line to read “Taken together, we found that DSD can drive functional divergence in the underlying GRN resulting in novel spatial expression dynamics of the genes not directly under selection.

      l176. Say which interaction it is. Is it 0->8, as mentioned in the next paragraph?

      It is indeed 0->8, we have clarified this in the text.

      l197. Bulk RNAseq has the problem of averaging gene expression over the population of cells. How do you think that impacts your test for rewiring? If you would do a similar "bulk RNA" style test on your computational models, would you pick up DSD?

      The rewiring is based on the CNSs, whereas the RNAseq is used as phenotype, so it does not impact the test for rewiring.

      The averaging of bulk RNAseq does however, mean that we cannot show conservation/divergence of the phenotype within the tissues, only between the different tissues.

      The most important implication of doing this in our model would be the definition of the ‘phenotype’ which undergoes DSD. Currently the phenotype is a gene expression pattern on a cellular level, for bulk RNA this phenotype would change to tissue-level gene expression.

      This change in what we measure as phenotype implicates how we interpret our results, but would not hinder us in picking up DSD, it just has a different meaning than DSD on a cellular - and single tissue scale.

      We added clarification of the roles of the datasets at the start of the paragraph.

      ‘The Conservatory Project collects conserved non-coding sequences (CNSs) across plant genomes, which we used to investigate the extent of GRN rewiring in flowering plants. Schuster et al. measured gene expression in different homologous tissues of several species via bulk RNAseq, which we used to test for gene expression (phenotype) conservation, and how this relates to the GRN rewiring inferred from the CNSs.’

      l202. I do not understand the "within" of a non-coding sequence within an orthogroup. How are non-coding sequences inside an orthogroup of genes?

      We clarify this sentence by saying ‘A CNS is defined as a non-coding sequence conserved within the upstream/downstream region of genes within an orthogroup’, to more clearly separate the CNS from the orthogroup of genes. We also updated Figure 5A to reflect this better.

      l207-217. This paragraph is difficult to read and would benefit of a rephrasing. Plant-specific jargon, numbers do not add up (line 211), statements are rather implicit (9 deeply conserved CNS are the 3+6? Where do I see them in Fig 5B? And where do I see the lineage-specific losses?).

      We added extra annotations to the figure to make the plant jargon (angiosperm, eudicot, Brassicaceae) clear, and show the loss more clearly in the figure. We also clarified the text by splitting up 9 to 3 and 6.

      l223. Looking at the shared CNS between SEP1-2, can you find a TF binding site or another property that can be interpreted as regulatory importance?

      Reliably showing an active TF binding site would require experimental data, which we don’t have. We do mention in the discussion the need for datasets which could help address this gap.

      l225. My intuition says that the continuity of the phenotype may not be necessary if its loss can be compensated for somehow by another part of the organism. I.e., DSD within DSD. It is a poorly elaborated thought, I leave it here for your information. Perhaps a Discussion point?

      Although very interesting we think this discussion might be outside of the scope of this work, and would benefit from a standalone discussion – especially since the capacity for such compensation might differ between animals and plants (which are more “modular” organisms). This is our interpretation:

      First, let’s take a step back from ‘genotype’ and ‘phenotype’ and redefine DSD more generally: in a system with multiple organisational levels, where a hierarchical mapping between them exists, DSD is changes on one organisational level which do not alter the outcome of the ‘higher’ organisational level. In other words, DSD can exist any many-to-one mapping in which a set of many (which map to the same one) are within a certain distance in space, which we generally define as a single mutational step.

      Within this (slightly) more general definition we can extend the definition of DSD to the level of phenotype and function, in which phenotype describes the ‘many’ layer, and multiple phenotypes can fulfill the same function. When we are freed from the constraint of ‘genotype’ and ‘phenotype’, and DSD is defined at the level of this mapping, than it becomes an easy exercise to have multiple mappings (genotype→phenotype→function) and thus ‘DSD within DSD’.

      l233. "rarely"? I don't see any high Pearson distances.

      True in the given example there are no high Pearson distances, however some of the supplementary figures do so rarely felt like the most honest description. We changed the text to refer to these supplementary figures.

      Fig 4. Re-order of panels? I was expecting B at C and vice versa.

      Agreed, we swapped the order of the panels

      Fig 5B. Red boxes not explained. Mention that it is an UpSetplot?

      We added clarification to the figure caption.

      Fig 5D. It would be nice to quantify the minor and major diffs between orthologs and paralogs.

      We quantify the similarities (and thus differences) in Figure F, but we do indeed not show orthologs vs paralogs explicitly. We have extended Figure F to distinguish which comparisons are between orthologs vs paralogs with different tick marks, which shows their different distributions quite clearly.

      - l247. Over-generalization. In a specific organ of plants...

      Changed to vascular plant meristem.

      - l249. Where exactly is this link between diverse expression patterns and the Schuster dataset made? I suggest the authors to make it more explicit in the Results.

      We are slightly overambitious in this sentence. The Schuster dataset confirms the preservation of expression where the CNS dataset shows rewiring. That this facilitates diversification of expression patterns in traits not under selection is solely an outcome of the computational model. We have changed the text to reflect this more clearly.

      - l268. Final sentence of the paragraph left me puzzled. Why talk about opposite function?

      The goal here was to highlight regulatory rewiring which, in the most extreme case, would achieve an opposite function for a given TF within development. We agree that this was formulated vaguely so we rewrote this to be more to the point.

      These examples demonstrate that whilst the function of pathways is conserved, their regulatory wiring often is not.

      - l269. What about time scales generated by the system? Looking at Fig 2C and 2D, the elbow pattern is pretty obvious. That means interactions sort themselves into either short-lived or long-lived. Worth mentioning?

      Added a sentence to highlight this.

      - l291. Evolution in a *constant* fitness landscape increases robustness.

      Changed

      - l296. My thoughts, for your info: I suspect morphogenesis as single parameters instead of as mechanisms makes for a brittle landscape, resulting in isolated parts of the same phenotype.

      We agree, and now include citations to different models in which morphogenesis evolves which seem to display a more connected landscape.

      Reviewer 2

      Every computational model necessarily makes some simplifying assumptions. It would be nice if the authors could summarise in a paragraph in the Discussion the main assumptions made by their model, and which of those are most worth revisiting in future studies. In the current draft, some assumptions are described in different places in the manuscript, which makes it hard for a non-expert to evaluate the limitations of this model.

      We added a section to the discussion: ‘Modelling assumptions and choices’

      I did not find any mention of potential energetic constraints or limitations in this model. For example, I would expect high levels of gene expression to incur significant energy costs, resulting in evolutionary trade-offs. Could the authors comment on how taking energy limitations into account might influence their results?

      This would put additional constraints on the evolution/fitness landscape. Some paths/regions of the fitness landscape which are currently accessible will not be traversable anymore. On the other hand, an energy constraint might reduce certain high fitness areas to a more even plane and thus make it more traversable. During analysis of our data there were no signs of extremely high gene expression levels.

      Figure 3C lists Gene IDs 1, 2, 8, and 11, but the caption refers to genes 1, 2, 4, and 11.

      Thank you for catching this.

      Reviewer 3

      The authors present an analysis correlating conserved non-coding sequence (CNS) composition with gene expression to investigate developmental systems drift. One flaw of this analysis is that it uses deeply conserved sequences as a proxy for the entire cis-regulatory landscape. The authors acknowledge this flaw in the discussion.

      Another potential flaw is equating the bulk RNA-seq data with a conserved phenotype. In lines 226-227 of the manuscript, it is written that "In line with our computational model, we compared gene expression patterns to measure changes in phenotype." I am not sure if there is an equivalence between the two. In the computational model, the developmental outcome determining fitness is a spatial pattern, i.e., an emergent product of gene expression and cell interactions. In contrast, the RNA-seq data shows bulk measurements in gene expression for different organs. It is conceivable that, despite having very similar bulk measurements, the developmental outcome in response to gene expression (such as a spatial pattern or morphological shape) changes across species. I think this difference should be explicitly addressed in the discussion. The authors may have intended to discuss this in lines 320-326, although it is unclear to me.

      It is correct that the CNS data and RNA-seq data has certain limitations, and the brief discussion of some of these limitations in lines 320-326 is not sufficient. We have been more explicit on this point in the discussion.

      The gene expression data used in this study represents bulk expression at the organ level, such as the vegetative meristem (Schuster et al., 2024). This limits our analysis of the phenotypic effects of rewiring to comparisons between organs, which is different to our computational simulations where we look at within organ gene expression. Additionally, the bulk RNA-seq does not allow us to discern whether the developmental outcome of similar gene expression is the same in all these species. More fine-grained approaches, such as single-cell RNA sequencing or spatial transcriptomics, will provide a more detailed understanding of how gene expression is modulated spatially and temporally within complex tissues of different organisms, allowing for a closer alignment between computational predictions and experimental observations.

      Can the authors justify using these six species in the discussion or the results? Are there any limitations with choosing four closely related and two distantly related species for this analysis, in contrast to, say, six distantly related species? If so, please elaborate in the discussion.

      The use of these six species is mainly limited by the datasets we have available. Nevertheless, the combination of four closely related species, and two more distantly related species gives a better insight into the short vs long term divergence dynamics than six distantly related species would. We have noted this when introducing the datasets:

      This set of species contains both closely (A. thaliana, A. lyrata, C. rubella, E. salsugineum) and more distantly related species (M. truncatula, B. distachyon), which should give insight in short and long term divergence.

      In Figure S7, some profiles show no conservation across the six species. Can we be sure that a stabilising selection pressure conserves any CNSs? Is it possible that the deeply conserved CNSs mentioned in the main text are conserved by chance, given the large number of total CNSs? A brief comment on these points in the results or discussion would be helpful.

      In our simulations, we find that even CREs that were under selection for a long time can disappear; however, in our neutral simulations, CREs were not conserved, suggesting that deep conservation is the result of selection. When it comes to CNSs, the assumption is that they often contain CREs that are under selection.We have added a more elaborate section on CNSs in the discussion. See ‘Limitations of CNSs as CREs

      Line 7-8: I thought this was a bit difficult to read. The connection between (i) evolvability of complex phenotypes, (ii) neutral/beneficial change hindered by deleterious mutations, and (iii) DSD might not be so simple for many readers, so I think it should be rewritten. The abstract was well written, though.

      We made the connection to DSD and evolvability clearer and removed the specific mutational outcomes:

      *A key open question in evolution of development (evo-devo) is the evolvability of complex phenotypes. Developmental system drift (DSD) may contribute to evolvability by exploring different genotypes with similar phenotypic outcome, but with mutational neighbourhoods that have different, potentially adaptive, phenotypes. We investigated the potential for DSD in plant development using a computational model and data analysis. *

      Line 274 vs 276: Is there a difference between regulatory dynamics and regulatory mechanisms?

      No, we should use the same terminology. We have changed this to be clearer.

      Figure S4: Do you expect the green/blue lines to approach the orange line in the long term? In some clonal experiments, it seems like it will. In others, it seems like it has plateaued. Under continual DSD, I assume they should converge. It would be interesting to see simulations run sufficiently long to see if this occurs.

      In principle yes, however this might take a considerable amount of time given that some conserved interactions take >75000 generations to be rewired.

      Line 27: Evolutionarily instead of evolutionary?

      Changed

      Line 67-68: References in brackets?

      Changed

      Line 144: Capitalise "fig"

      Changed

      Fig. 3C caption: correct "1, 2, 4, 11" (should be 8)

      Changed

      Line 192: Reference repeated

      Changed

      Fig. 5 caption: Capitalise "Supplementary figure"

      Changed

      Line 277: Correct "A previous model Johnson.."

      Changed

      Line 290: Brackets around reference

      Changed

      Line 299: Correct "will be therefore be"

      Changed

      Line 394: Capitalise "table"

      Changed

      Line 449: Correct "was build using"

      Changed

      Fig. 5B: explain the red dashed boxes in the caption

      Added explanation to the caption

      Some of the Figure panels might benefit from further elaboration in their respective captions, such as 3C and 5F.

      Improved the figure captions.

      Reviewer 4

      Statement of significance. The logical connection between the first two sentences is not clear. What does developmental system drift have to do with neutral/beneficial mutations?

      This is indeed an unclear jump. Changed such that the connection between evolvability of complex phenotypes and DSD is more clear:

      *A key open question in evolution of development (evo-devo) is the evolvability of complex phenotypes. Developmental system drift (DSD) contributes to evolvability by exploring different genotypes with similar phenotypic outcome, but with mutational neighbourhoods that have different, potentially adaptive, phenotypes..We investigated the potential for DSD in plant development using a computational model and data analysis. *

      l 41 - "DSD is found to ... explain the developmental hourglass." Caution is warranted here. Wotton et al 2015 claim that "quantitative system drift" explains the hourglass pattern, but it would be more accurate to say that shifting expression domains and strengths allows compensatory regulatory change to occur with the same set of genes (gap genes). It is far from clear how DSD could explain the developmental hourglass pattern. What does DSD imply about the causes of differential conservation of different developmental stages? It's not clear there is any connection here.

      We should indeed be more cautious here. DSD is indeed not in itself an explanation of the hourglass model, but only a mechanism by which the developmental divergence observed in the hourglass model could have emerged. As per Pavlicev and Wagner, 2012, compensatory changes resulting from other shifts would fall under DSD, and can explain how the patterning outcome of the gap gene network is conserved. However, this does not explain why some stages are under stronger selection than others. We changed the text to reflect this.

      ‘...be a possible evolutionary mechanism involved in the developmental hourglass model (Wotton et al., 2015; Crombach et al., 2016)...’

      ll 51-53 - "Others have found that increased complexity introduces more degrees of freedom, allowing for a greater number of genotypes to produce the same phenotype and potentially allowing for more DSD (Schiffman and Ralph, 2022; Greenbury et al., 2022)." Does this refer to increased genomic complexity or increased phenotypic complexity? It is not clear that increased phenotypic complexity allows a greater number of genotypes to produce the same phenotype. Please explain further.

      The paragraph discusses complexity in the GPM as a whole, where the first few examples in the paragraph regard phenotypic complexity, and the ones in l51-53 refer to genomic complexity. This is currently not clear so we clarified the text.

      ‘For other GPMs, such as those resulting from multicellular development, it has been suggested that complex phenotypes are sparsely distributed in genotype space, and have low potential for DSD because the number of neutral mutations anti-correlates with phenotypic complexity (Orr, 2000; Hagolani et al., 2021). Others have found that increased genomic complexity introduces more degrees of freedom, allowing for a greater number of genotypes to produce the same phenotype and potentially allowing for more DSD (Schiffman and Ralph, 2022; Greenbury et al., 2022).’

      It was not clear why some gene products in the model have the ability to form dimers. What does this contribute to the simulation results? This feature is introduced early on, but is not revisited. Is it necessary?

      *Fitness. The way in which fitness is determined in the model was not completely clear to me. *

      Dimers are not necessary, but as they have been found to play a role in actual SAM development we added them to increase the realism of the developmental simulations. In some simulations the patterning mechanism involves the dimer, in others it does not, suggesting that dimerization is not essential for DSD.

      We have made changes to the methods to clarify fitness.

      Lines 103-104 say: "Each individual is assigned a fitness score based on the protein concentration of two target genes in specific regions of the SAM: one in the central zone (CZ), and one in the organizing center (OC)." How are these regions positionally defined in the simulation?

      We have defined bounding boxes to define cells as either CZ, OC or both. We have added these bounds in the figure description and more clearly in the revised methods.

      F, one reads (l. 385): "Fitness depends on the correct protein concentration of the two fitness genes in each cell, pcz and poc respectively." This sounds like fitness is determined by the state of all cells rather than the state of the two specific regions of the SAM. Please clarify.

      A fitness penalty is given for incorrect expression so it is true that the fitness is determined by the state of all cells. We agree that it is phrased unclearly and have clarified this in the text.

      The authors use conserved non-coding sequences as a proxy for cis-regulatory elements. More specification of how CNSs were assigned to an orthogroup seems necessary in this section. Is assignment based on proximity to the coding region? Of course the authors will appreciate that regulatory elements can be located far from the gene they regulate. This data showed extensive gains and losses of CNS. It might be interesting to consider how much of this is down to transposons, in which case rapid rearrangement is not unexpected. A potential problem with the claim that the data supports the simulation results follows from the fact that DSD is genetic divergence despite trait conservation, but conserved traits appear to have only been defined or identified in the case of the SEP genes. It can't be ruled out that divergence in CNSs and in gene expression captured by the datasets is driven by straightforward phenotypic adaptation, thus not by DSD. Further caution on this point is needed.

      CNSs are indeed assigned based on proximity up to 50kb, the full methods are described in detail in Hendelman et al., (2021). CREs can be located further than 50kb, but evidence suggests that this is rare for species with smaller genomes.

      In the cases where both gene expression and the CNSs diverged it can indeed not be ruled out that there has been phenotypic adaptation. We clarified in the text that the lower Pearson distances are informative for DSD as they highlight conserved phenotypes.

      l. 290-291 - "However, evolution has been shown to increase mutational robustness over time, resulting in the possibility for more neutral change." It is doubtful that there is any such unrestricted trend. If mutational robustness only tended to increase, new mutations would not affect the phenotype, and phenotypes would be unable to adapt to novel environments. Consider rethinking this statement.

      We have reformulated this statement, since it is indeed not expected that this trend is indefinite. Infinite robustness would indeed result in the absence of evolvability; however, it has been shown for other genotype-phenotype maps that mutational robustness, where a proportion of mutations is neutral, aids the evolution of novel traits. The evolution of mutational robustness also depends on population size and mutation rate. This trend will, most probably, also be stronger in modelling work where the fitness function is fixed, compared to a real life scenario where ‘fitness’ is much less defined and subject to continuous change. We added ‘constant’ to the fitness landscape to highlight this disparity.

      ll. 316-317 "experimental work investigating the developmental role of CREs has shown extensive epistasis - where the effect of a mutation depends on the genetic background - supporting DSD." How does extensive epistasis support DSD? One can just as easily imagine scenarios where high interdependence between genes would prevent DSD from occurring. Please explain further.

      We should be more clear. Experimental work has shown that the effect of mutating a particular CRE strongly depends on the genetic background, also known as epistasis. Counterintuitively, this indirectly supports the presence of DSD, since it means that different species or strains have slightly different developmental mechanisms, resulting in these different mutational effects. We have shown how epistatic effects shift over evolutionary time.

      Overall I found the explanation of the Methods, especially the formal aspects, to be unclear at times and would recommend that the authors go back over the text to improve its clarity.

      We rewrote parts of the methods and some of the equations to be more clear and cohesive throughout the text.

      C. Tissue Generation. Following on the comment on fitness above, it would be advisable to provide further details on how cell positions are defined. How much do the cells move over the course of the simulation? What is the advantage of modelling the cells as "springs" rather than as a simple grid?

      The tissue generation is purely a process to generate a database of tissue templates: the random positions, springs and voronoi method serve the purpose of having similar but different tissues to prevent unrealistic overfitting of our GRNs on a single topology. For each individual’s development however, only one, unchanging template is used. We clarified this in the methods.

      E. Development of genotype into phenotype. The diffusion term in the SDE equations is hard to understand as no variable for spatial position (x) is included in the equation. It seems this equation should rather be an SPDE with a position variable and a specified boundary condition (i.e. the parabola shape). In eq. 5 it should be noted that the Wi are independent. Also please justify the choice of how much noise/variance is being stipulated here.

      We have rewritten parts of this section for clarity and added citations.

      F. Fitness function. I must say I found formula 7 to be unclear. It looks like fi is the fitness of cell(s) but, from Section G, fitness is a property of the individual. It seems formula 7 should define fi as a sum over the cell types or should capture the fitness contribution of the cell types.

      Correct. We have rewritten this equation. We’ll define fi as the fitness contribution of a cell, F as the sum of fi, so the fitness of an individual, and use F in function 8.

      What is the basis for the middle terms (fractions) in the equation? After plugging in the values for pcz and poc, this yields a number, but how does that number assign a cell to one of the types? If a reviewer closely scrutinizing this section cannot make sense of it, neither will readers. Please explain further.

      The cell type is assigned based on the spatial location of the cell, and the correct fitness function for each of these cell types is described in this equation. We have clarified the text and functions.

      A minor note: it would be best practice not to re-use variables to refer to different things within the same paper. For example p refers to protein concentration but also probability of mutation.

      Corrected

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      Summary:

      This manuscript uses an Evo-Devo model of the plant apical meristem to explore the potential for developmental systems drift (DSD). DSD occurs when the genetic underpinnings of development change through evolution while reaching the same developmental outcome. The mechanisms underlying DSD are theoretically intriguing and highly relevant for our understanding of how multicellular species evolve. The manuscript shows that DSD occurs extensively and continuously in their evolutionary simulations whilst populations evolve under stabilising selection. The authors examine regulatory rewiring across plant angiosperms to link their theoretical model with real data. The authors claim that, despite the conservation of genetic wiring in angiosperm species over shorter evolutionary timescales, this genetic wiring changes over long evolutionary timescales due to DSD, which is consistent with their theoretical model.

      Major comments:

      I enjoyed reading the author's approach to understanding DSD and the link to empirical data. I think it is a very important line of investigation that deserves more theoretical and experimental attention. All the data and methods are clearly presented, and the software for the research is publicly available. Sufficient information is given to reproduce all results. However, I have two major issues relating to the theoretical part of the research.

      Issue One: Interpretation of fitness gains under stabilising selection

      A central issue concerns how the manuscript defines and interprets developmental systems drift (DSD) in relation to evolution on the fitness landscape. The authors define DSD as the conservation of a trait despite changes in its underlying genetic basis, which is consistent with the literature. However, the manuscript would benefit from clarifying the relationship between DSD, genotype-to-phenotype maps, and fitness landscapes. Very simply, we can say that (i) DSD can operate along neutral paths in the fitness landscape, (ii) DSD can operate along adaptive paths in the fitness landscape. During DSD, these neutral or adaptive paths along the fitness landscape are traversed by mutations that change the gene regulatory network (GRN) and consequent gene expression patterns whilst preserving the developmental outcome, i.e., the phenotype. While this connection between DSD and fitness landscapes is referenced in the introduction, it is not fully elaborated upon. A complete elaboration is critical because, when I read the manuscript, I got the impression that the manuscript claims that DSD is prevalent along neutral paths in the fitness landscape, not just adaptive ones. If I am wrong and this is not what the authors claim, it should be explicitly stated in the results and discussed. Nevertheless, claiming DSD operates along neutral paths is a much more interesting statement than claiming it operates along adaptive paths. However, it requires sufficient evidence, which I have an issue with. The issue I have is about adaptations under stabilising selection. Stabilising selection occurs when there is selection to preserve the developmental outcome. Stabilising selection is essential to the results because evolutionary change in the GRN under stabilising selection should be due to DSD, not adaptations that change the developmental outcome. To ensure that the populations are under stabilising selection, the authors perform clonal experiments for 100,000 generations for 8 already evolved populations, 5 clones for each population. They remove 10 out of 40 clones because the fitness increase is too large, indicating that the developmental outcome changes over the 100,000 generations. However, the remaining 30 clonal experiments exhibit small but continual fitness increases over 100,000 generations. The authors claim that the remaining 30 are predominantly evolving due to drift, not adaptations (in the main text, line 137: "indicating predominantly neutral evolution", and section M: "too shallow for selection to outweigh drift"). The author's evidence for this claim is a mathematical analysis showing that the fitness gains are too small to be caused by beneficial adaptations, so evolution must be dominated by drift. I found this explanation strange, given that every clone unequivocally increases in fitness throughout the 100,000 generations, which suggests populations are adapting. Upon closer inspection of the mathematical analysis (section M), I believe it will miss many kinds of adaptations possible in their model, as I now describe. The mathematical analysis treats fitness as a constant, but it's a random variable in the computational model. Fitness is a random variable because gene transcription and protein translation are stochastic (Wiener terms in Eqs. (1)-(5)) and cell positions change for each individual (Methods C). So, for a genotype G, the realised fitness F is picked from a distribution with mean μ_G and higher order moments (e.g., variance) that determine the shape of the distribution. I think these assumptions lead to two problems. The first problem with the mathematical analysis is that F is replaced by an absolute number f_q, with beneficial mutations occurring in small increments denoted "a", representing an additive fitness advantage. The authors then take a time series of the median population fitness from their simulations and treat its slope as the individual's additive fitness advantage "a". The authors claim that drift dominates evolution because this slope is lower than a drift-selection barrier, which they derive from the mathematical analysis. This analysis ignores that the advantage "a" is a distribution, not a constant, which means that it does not pick up adaptations that change the shape of the distribution. Adaptations that change the shape of the distribution can be adaptations that increase robustness to stochasticity. Since there are multiple sources of noise in this model, I think it is highly likely that robustness to noise is selected for during these 100,000 generations. The second problem is that the mathematical analysis ignores traits that have higher-order effects on fitness. A trait has higher-order effects when it increases the fitness of the lineage (e.g., offspring) but not the parent. One possible trait that can evolve in this model with higher-order effects is mutational robustness, i.e., traits that lower the expected mutational load of descendants. Since many kinds of mutations occur in this model (Table 2), mutational robustness may be also evolving. Taken together, the analysis in Section M is set up to detect only immediate, deterministic additive gains in a single draw of fitness. It therefore cannot rule out weak but persistent adaptive evolution of robustness (to developmental noise and/or to mutations), and is thus insufficient evidence that DSD is occurring along neutral paths instead of adaptive paths. The small but monotonic fitness increases observed in all 40 clones are consistent with such adaptation (Fig. S3). The authors also acknowledge the evolution of robustness in lines 129-130 and 290-291, but the possibility of these adaptations driving DSD instead of neutral evolution is not discussed. To address the issue I have with adaptations during stabilising selection, the authors should, at a minimum, state clearly in their results that DSD is driven by both the evolution of robustness and drift. Moreover, a paragraph in the discussion should be dedicated to why this is the case, and why it is challenging to separate DSD through neutral evolution vs DSD through adaptations such as those that increase robustness. [OPTIONAL] A more thorough approach would be to make significant changes to the manuscript by giving sufficient evidence that the experimental clones are evolving by drift, or changing the model construction. One possible way to provide sufficient evidence is to improve the mathematical analysis. Another way is to show that the fitness distributions (both without and with mutations, like in Fig. 2F) do not significantly change throughout the 100,000 generations in experimental clones. It seems more likely that the model construction makes it difficult to separate the evolution of robustness from evolution by drift in the stabilising selection regime. Thus, I think the model should be constructed differently so that robustness against mutations and noise is much less likely to evolve after a "fitness plateau" is reached. This could be done by removing sources of noise from the model or reducing the kinds of possible mutations (related to issue two). In fact, I could not find justification in the manuscript for why these noise terms are included in the model, so I assume they are included for biological realism. If this is why noise is included, or if there is a separate reason why it is necessary, please write that in the model overview and/or the methods.

      Issue two: The model construction may favour DSD

      In this manuscript, fitness is determined by the expression pattern of two types of genes (genes 12 and 13 in Table 1). There are 14 types of genes in total that can all undergo many kinds of mutations, including duplications (Table 2). Thus, gene regulatory networks (GRNs) encoded by genomes in this model tend to contain large numbers of interactions. The results show that most of these interactions have minimal effect on reaching the target pattern in high fitness individuals (e.g. Fig. 2F). A consequence of this is that only a minimal number of GRN interactions are conserved through evolution (e.g. Fig. 2D). From these model constructions and results from evolutionary simulations, we can deduce that there are very few constraints on the GRN. By having very few constraints on the GRN, I think it makes it easy for a new set of pattern-producing traits to evolve and subsequently for an old set of pattern-producing traits to be lost, i.e., DSD. Thus, I believe that the model construction may favour DSD. I do not have an issue with the model favouring DSD because it reflects real multicellular GRNs, where it is thought that a minority fraction of interactions are critical for fitness and the majority are not. However, it is unknown whether the constraints GRNs face in the model are more or less constrained than real GRNs. Thus, it is not known whether the prevalence of DSD in this model applies generally to real development, where GRN constraints depend on so many factors. At a minimum, the possible difference in constraints between the model and real development should be discussed as a limitation of the model. A more thorough change to the manuscript would be to test the effect of changing the constraints on the GRN. I am sure there are many ways to devise such a test, but I will give my recommendation here. [OPTIONAL] My recommendation is that the authors should run additional simulations with simplified mutational dynamics by constraining the model to N genes (no duplications and deletions), of which M out of these N genes contribute to fitness via the specific pattern (with M=2 in the current model). The authors should then test the effect of changing N and M independently, and how this affects the prevalence of DSD. If the prevalence of DSD is robust to changes in N and M, it supports the authors argument that DSD is highly prevalent in developmental evolution. If DSD prevalence is highly dependent on M and/or N, then the claims made in the manuscript about the prevalence of DSD must change accordingly. I acknowledge that these simulations may be computationally expensive, and I think it would be great if the authors knew (or devised) a more efficient way to test the effect of GRN constraints on DSD prevalence. Nevertheless, these additional simulations would make for a potentially very interesting manuscript.

      Minor comments:

      1. The authors present an analysis correlating conserved non-coding sequence (CNS) composition with gene expression to investigate developmental systems drift. One flaw of this analysis is that it uses deeply conserved sequences as a proxy for the entire cis-regulatory landscape. The authors acknowledge this flaw in the discussion. Another potential flaw is equating the bulk RNA-seq data with a conserved phenotype. In lines 226-227 of the manuscript, it is written that "In line with our computational model, we compared gene expression patterns to measure changes in phenotype." I am not sure if there is an equivalence between the two. In the computational model, the developmental outcome determining fitness is a spatial pattern, i.e., an emergent product of gene expression and cell interactions. In contrast, the RNA-seq data shows bulk measurements in gene expression for different organs. It is conceivable that, despite having very similar bulk measurements, the developmental outcome in response to gene expression (such as a spatial pattern or morphological shape) changes across species. I think this difference should be explicitly addressed in the discussion. The authors may have intended to discuss this in lines 320-326, although it is unclear to me.
      2. Can the authors justify using these six species in the discussion or the results? Are there any limitations with choosing four closely related and two distantly related species for this analysis, in contrast to, say, six distantly related species? If so, please elaborate in the discussion.
      3. In Figure S7, some profiles show no conservation across the six species. Can we be sure that a stabilising selection pressure conserves any CNSs? Is it possible that the deeply conserved CNSs mentioned in the main text are conserved by chance, given the large number of total CNSs? A brief comment on these points in the results or discussion would be helpful.
      4. Line 7-8: I thought this was a bit difficult to read. The connection between (i) evolvability of complex phenotypes, (ii) neutral/beneficial change hindered by deleterious mutations, and (iii) DSD might not be so simple for many readers, so I think it should be rewritten. The abstract was well written, though.
      5. Line 274 vs 276: Is there a difference between regulatory dynamics and regulatory mechanisms?
      6. Figure S4: Do you expect the green/blue lines to approach the orange line in the long term? In some clonal experiments, it seems like it will. In others, it seems like it has plateaued. Under continual DSD, I assume they should converge. It would be interesting to see simulations run sufficiently long to see if this occurs.
      7. Line 27: Evolutionarily instead of evolutionary?
      8. Line 67-68: References in brackets?
      9. Line 144: Capitalise "fig"
      10. Fig. 3C caption: correct "1, 2, 4, 11" (should be 8)
      11. Line 192: Reference repeated
      12. Fig. 5 caption: Capitalise "Supplementary figure"
      13. Line 277: Correct "A previous model Johnson.."
      14. Line 290: Brackets around reference
      15. Line 299: Correct "will be therefore be"
      16. Line 394: Capitalise "table"
      17. Line 449: Correct "was build using"
      18. Fig. 5B: explain the red dashed boxes in the caption
      19. Some of the Figure panels might benefit from further elaboration in their respective captions, such as 3C and 5F.

      Significance

      General Assessment:

      This manuscript tackles a fundamental evolutionary problem of developmental systems drift (DSD). Its primary strength lies in its integrative approach, combining a multiscale evo-devo model with a comparative genomic analysis in angiosperms. This integrative approach provides a new way of investigating how developmental mechanisms can evolve even while the resulting phenotype is conserved. The details of the theoretical model are well defined and succinctly combined across scales. The manuscript employs several techniques to analyse the conservation and divergence of the theoretical model's gene regulatory networks (GRNs), which are rigorous yet easy to grasp. This study provides a strong platform for further integrative approaches to tackle DSD and multicellular evolution.

      The study's main limitations are due to the theoretical model construction and the interpretation of the results. The central claim that DSD occurs extensively through predominantly neutral evolution is not sufficiently supported, as the analysis does not rule out an alternative: DSD is caused by adaptive evolution for increased robustness to developmental or mutational noise. Furthermore, constructing the model with a high-dimensional GRN space and a low-dimensional phenotypic target may create particularly permissive conditions for DSD, raising questions about the generality of the theoretical conclusions. However, these limitations could be resolved by changes to the model and further simulations, although these require extensive research. The genomic analysis uses cis-regulatory elements as a proxy for the entire regulatory landscape, a limitation the authors are aware of and discuss. The genomic analysis uses bulk RNA-seq as a proxy for the developmental outcome, which may not accurately reflect differences in plant phenotypes.

      Advance:

      The concept of DSD is well-established, but mechanistic explorations of its dynamics in complex multicellular models are still relatively rare. This study represents a mechanistic advance by providing a concrete example of how DSD can operate continuously under stabilising selection. I found the evolutionary simulations and subsequent analysis of mechanisms underlying DSD in the theoretical model interesting, and these simulations and analyses open new pathways for studying DSD in theoretical models. To my knowledge, the attempt to directly link the dynamics from such a complex evo-devo model to patterns of regulatory element conservation across a real phylogeny (angiosperms) is novel. However, I think that the manuscript does not have sufficient evidence to show a high prevalence of DSD through neutral evolution in their theoretical model, which would be a highly significant conceptual result. The manuscript does have sufficient evidence to show a high prevalence of DSD through adaptive evolution under stabilising selection, which is a conceptually interesting, albeit somewhat expected, result.

      Audience:

      This work will be of moderate interest to a specialised audience in the fields of evolutionary developmental biology (evo-devo), systems biology, and theoretical/computational biology. Researchers in these areas will be interested in the model and the dynamics of GRN conservation and divergence. The results may interest a broader audience across the fields of evolutionary biology and molecular evolution.

      Expertise:

      My expertise is primarily in theoretical and computational models of biology and biophysics. While I have sufficient background knowledge in bioinformatics to assess the logic of the authors' genomic analysis and its connection to their theoretical model, I do not have sufficient expertise to critically evaluate the technicalities of the bioinformatic methods used for the identification of conserved non-coding sequences (CNSs) or analysis of RNA-seq data. A reviewer with expertise in plant comparative genomics would be better suited to judge the soundness of these specific methods.

    1. AbstractBackground Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their complexity remain speculative, relying on limited data and extrapolation from shallow sequencing. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 Gbp of Nanopore long-read and 122 Gbp of Illumina short-read data to a single forest soil sample.Results Our hybrid assembly reconstructed 837 metagenome-assembled genomes (MAGs), including 466 high- and medium-quality genomes, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that over 10 Tbp would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss the majority of microbial and biosynthetic potential in soil. We further identify over 11,000 biosynthetic gene clusters (BGCs), >99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.Conclusions Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf135), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Ameet Pinto

      The manuscript provides long-read mock community datasets from GridION and PromethION sequencing platforms along with draft genomes of mock community organisms sequenced on the Illumina Platform. The entire dataset is available for reuse by the research community and this is an extremely valuable resource that the authors have made available. While there are some analyses of the data included in the current manuscript, it is largely limited to summary statistics (which seems appropriate for a Data Note type manuscript) and some analyses of interest to the field (e.g., de novo metagenome assembly). It would have been helpful to have a more detailed evaluation of the de novo assembly and parameter optimization, but this may have been outside the scope of a Data Note type manuscript. I have some minor comments below to improve clarity of the manuscript.

      Minor comments: 1. Line 28-29: Would suggest that the authors provide the citation (15) without the statement in parenthesis or revised version of statement in parenthesis.

      "DNA extraction protocol" section 2. The last few lines were a little bit unclear. For instance: "45 ul (Even) and 225ul (Log) of the supernatant retained earlier…" It was a bit confusing. Possibly because the line "The standard was spun…before removing the supernatant and retaining." seems incomplete. I would suggest that the authors consider posting the entire protocol on protocols.io - as is quite possible that other groups may want to reproduce the sequencing step for these mock community standards. This would be particularly helpful as the authors suggest that the protocol was modified to increase fragment length.

      "Illumina sequencing" section: 3. Suggest that the authors improve clarity in this section by re-structuring this paragraph. For instance, early in paragraph it is stated that the pooled library was sequenced on four lanes on Illumina HiSeq 1500, but later stated that the even community was sequenced on a MiSeq.

      "Nanopore sequencing metrics" in results: 4. Table 2, Figure 3a. - please fix this to Figure 1a. 5. Figure 1B: The x-axis is "accuracy" while in this section Figure 1b is referred to as providing "quality scores". Please replace "quality scores" with "accuracy" for consistency. 6. Figure 1C: Please provide a legend mapping colors to "even" and "log". I realize this information is in Figure 1B, but would be helpful for the reader. Finally, there is no significant trend in sequencing speed over time. Considering this, would be easier to remove the Time component and just have a single panel with the GridION and PromethION sequencing speed for both even and log community in the same panel. It would make it easier to compare the different in sequencing speeds visually.

      "Illumina sequencing metrics" in results: 7. Table 5 is mentioned before Tables 3 and 4. Please correct this.

      "Nanopore mapping statistics" in results: 8. For Figure 2, consider also providing figure for the even community. 9. Further, it would be helpful to get clarity on where the data for Figure 2 is coming from. Is this from mapping of long-reads to mock community draft (I think so) or from the kraken analyses.

      "Nanopore metagenome assemblies" in results: 1. It is unclear how the genome completeness was estimated. 2. The consensus accuracy data is provided for all assemblies combined. Would be helpful if there was some discussion on accuracy of assemblies as a function of wtdgb2 parameters tested. There is some discussion of this in the "Discussion section", but would be helpful if this was laid out clearly in the results, with an additional appropriate figure/table.

    1. ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf130), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Jose Espinosa-Carrasco

      The article presents nf-test, a new modular and automated testing framework designed specifically for Nextflow workflows, a widely used workflow management system in bioinformatics. nf-test aims to help developers improve the reliability and maintainability of complex Nextflow pipelines. The framework includes very useful features such as snapshot testing, which assesses the computational repeatability of the results produced by the execution of a pipeline or its components and smart testing which optimises computational resources by only executing tests on the parts of the pipeline that were modified, reducing overall run time. Notably, nf-test can be integrated into CI workflows and has already been adopted by the nf-core community, demonstrating its utility and maturity in real-world scenarios

      General comments:

      The manuscript could benefit from reordering some sections to follow a more consistent structure and by removing redundant explanations. I think it would be nice to include one limitation of nf-test, the fact that reproducing previous results does not necessarily imply biological correctness. This point is not entirely clear in the current version of the manuscript (see my comment below). Another aspect that could improve the manuscript is the inclusion of at least one reference or explanation of how nf-test can be applied outside nf-core pipelines, as all the provided examples are currently restricted to nf-core.

      Specific comments:

      On page 3, the sentence "Thus, maintenance requires substantial time and effort to manually verify that the pipeline continues to produce scientifically valid results" could be more precise. I would argue that identical results across versions do not guarantee scientific validity; they merely confirm consistency with previous outputs. True scientific validity requires comparison against a known ground truth or standard.

      On page 4, in the sentence "It is freely available, and extensive documentation is provided on the website", I think it would be nice to include the link to the documentation.

      In the "Evaluation and Validation" section (page 8), it would be helpful to briefly state the goal of each evaluated test, as is done with the nf-gwas example. ou could include something similar for the nf-core/fetchngs and modules examples (e.g. to assess resource optimization through smart testing). Also, the paragraph references the "--related-tests" option, which could benefit from a short explanation of what it does. Lastly, the order in which the pipelines are presented in this section differs from the order in the Results, which makes the structure a bit confusing.

      The sections titled "Unit testing in nf-test", "Test case execution", "Smart testing and parallelization", "Snapshot testing", and "Extensions for bioinformatics" seem more appropriate for the Materials and Methods section, as they describe the design and functionality of nf-test rather than reporting actual results. Please ignore this comment if the current structure follows specific journal formatting requirements that I may not be aware of.

      The Snapshot testing discussion in the Results section feels somewhat repetitive with its earlier explanation. Consider combining both discussions or restructuring the content to reduce duplication.

      On page 11, the sentence "In these cases, MD5 sums cannot be used and validating the dynamic output content can be time-intensive" is not entirely clear to me, does it mean that it is time consuming to implement the test for this kind of files or that the validation of the files is time consuming?

      On page 12, the sentence "Second, we analyzed the last 500 commits..." is confusing because this is actually the third point in the "Evaluation and Validation" section, as mentioned before. reordering would improve clarity.

      On page 14, the authors state "However, changes (b) and (c) lead to incorrect output results without breaking the pipeline. Thus, these are the worst-case scenarios for a pipeline developer." While this is mostly true, I would also add that a change in parameters may produce different, but not necessarily incorrect, results—some may even be more biologically meaningful. I suggest to acknowledge this.

      Typos:

      In the abstract: "Build on a similar syntax as Nextflow DSL2" should be corrected to "Built on a similar syntax as Nextflow DSL2".

      In the legend of Figure 2 (page 19): "nf-tet" should be "nf-test".

      In the legend of Table 2: "Time savings areis calculated..." should be "Time savings are calculated..."

      Recommendation:

      Given the relevance and technical contributions of the manuscript, I recommend its publication after addressing the minor revisions summarized above.

    1. ABSTRACTNanopore sequencing is a widespread and important method in genomics science. The raw electrical current signal data from a typical nanopore sequencing experiment is large and complex. This can be stored in two alternative file formats that are presently supported: POD5 is a signal data file format used by default on instruments from Oxford Nanopore Technologies (ONT); SLOW5 is an open-source file format originally developed as an alternative to ONT’s previous file format, which was known as FAST5. The choice of format may have important implications for the cost, speed and simplicity of nanopore signal data analysis, management and storage. To inform this choice, we present a comparative evaluation of POD5 vs SLOW5. We conducted benchmarking experiments assessing file size, analysis performance and usability on a variety of different computer architectures. SLOW5 showed superior performance during sequential and non-sequential (random access) file reading on most systems, manifesting in faster, cheaper basecalling and other analysis, and we could find no instance in which POD5 file reading was significantly faster than SLOW5. We demonstrate that SLOW5 file writing is highly parallelisable, thereby meeting the demands of data acquisition on ONT instruments. Our analysis also identified differences in the complexity and stability of the software libraries for SLOW5 (slow5lib) and POD5 (pod5), including a large discrepancy in the number of underlying software dependencies, which may complicate the pod5 compilation process. In summary, many of the advantages originally conceived for SLOW5 remain relevant today, despite the replacement of FAST5 with POD5 as ONT’s core file format.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf118), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Wouter De Coster

      The authors describe the SLOW5 format and its benefits compared to the standard POD5 format for storing raw sequencing data from nanopore sequencers. The paper is well written and easy to understand. The advantages of the SLOW5 format are clear, and the comparison is adequately executed and described. However, the developers seem unable to persuade others to adopt it widely, and change might need to come from ONT themselves, who may be most concerned about disrupting their existing workflows, especially for parallel writing during sequencing. Nevertheless, the authors seem to have also addressed that issue, as demonstrated with a simulation experiment.

      Please find my specific suggestions below.

      Sincerely, Wouter De Coster

      Major: While I understand that the software name SLOW5 was an initial variation of the FAST5 format, I don't think that the words 'slow' or the number '5' are particularly appropriate descriptions or helpful in making a case for using the file format, as it is neither slow nor related to HDF5. However, once a name is chosen, I understand the reluctance to change it. Additionally, it seems the evaluations are conducted using the binary BLOW5 format. Wouldn't it then make more sense to emphasize BLOW5 in the text and title?

      Minor: I would italicize the 'make' tool for users unfamiliar with build tools in the Usability section, as it is a rather strange sentence if reading 'make' as a verb, not a tool. Perhaps the same could be applied to other dependencies in that section for consistency. Then again, the primary target audience will probably understand what 'make' means in this context.

      There is a typo in the benchmarking procedure section: 'confoudning'.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      This study investigates the sex determination mechanism in the clonal ant Ooceraea biroi, focusing on a candidate complementary sex determination (CSD) locus-one of the key mechanisms supporting haplodiploid sex determination in hymenopteran insects. Using whole genome sequencing, the authors analyze diploid females and the rarely occurring diploid males of O. biroi, identifying a 46 kb candidate region that is consistently heterozygous in females and predominantly homozygous in diploid males. This region shows elevated genetic diversity, as expected under balancing selection. The study also reports the presence of an lncRNA near this heterozygous region, which, though only distantly related in sequence, resembles the ANTSR lncRNA involved in female development in the Argentine ant, Linepithema humile (Pan et al. 2024). Together, these findings suggest a potentially conserved sex determination mechanism across ant species. However, while the analyses are well conducted and the paper is clearly written, the insights are largely incremental. The central conclusion - that the sex determination locus is conserved in ants - was already proposed and experimentally supported by Pan et al. (2024), who included O. biroi among the studied species and validated the locus's functional role in the Argentine ant. The present study thus largely reiterates existing findings without providing novel conceptual or experimental advances.

      Although it is true that Pan et al., 2024 demonstrated (in Figure 4 of their paper) that the synteny of the region flanking ANTSR is conserved across aculeate Hymenoptera (including O. biroi), Reviewer 1’s claim that that paper provides experimental support for the hypothesis that the sex determination locus is conserved in ants is inaccurate. Pan et al., 2024 only performed experimental work in a single ant species (Linepithema humile) and merely compared reference genomes of multiple species to show synteny of the region, rather than functionally mapping or characterizing these regions.

      Other comments:

      The mapping is based on a very small sample size: 19 females and 16 diploid males, and these all derive from a single clonal line. This implies a rather high probability for false-positive inference. In combination with the fact that only 11 out of the 16 genotyped males are actually homozygous at the candidate locus, I think a more careful interpretation regarding the role of the mapped region in sex determination would be appropriate. The main argument supporting the role of the candidate region in sex determination is based on the putative homology with the lncRNA involved in sex determination in the Argentine ant, but this argument was made in a previous study (as mentioned above).

      Our main argument supporting the role of the candidate region in sex determination is not based on putative homology with the lncRNA in L. humile. Instead, our main argument comes from our genetic mapping (in Fig. 2), and the elevated nucleotide diversity within the identified region (Fig. 4). Additionally, we highlight that multiple genes within our mapped region are homologous to those in mapped sex determining regions in both L. humile and Vollenhovia emeryi, possibly including the lncRNA.

      In response to the Reviewer’s assertion that the mapping is based on a small sample size from a single clonal line, we want to highlight that we used all diploid males available to us. Although the primary shortcoming of a small sample size is to increase the probability of a false negative, small sample sizes can also produce false positives. We used two approaches to explore the statistical robustness of our conclusions. First, we generated a null distribution by randomly shuffling sex labels within colonies and calculating the probability of observing our CSD index values by chance (shown in Fig. 2). Second, we directly tested the association between homozygosity and sex using Fisher’s Exact Test (shown in Supplementary Fig. S2). In both cases, the association of the candidate locus with sex was statistically significant after multiple-testing correction using the Benjamini-Hochberg False Discovery Rate. These approaches are clearly described in the “CSD Index Mapping” section of the Methods.

      We also note that, because complementary sex determination loci are expected to evolve under balancing selection, our finding that the mapped region exhibits a peak of nucleotide diversity lends orthogonal support to the notion that the mapped locus is indeed a complementary sex determination locus.

      The fourth paragraph of the results and the sixth paragraph of the discussion are devoted to explaining the possible reasons why only 11/16 genotyped males are homozygous in the mapped region. The revised manuscript will include an additional sentence (in what will be lines 384-388) in this paragraph that includes the possible explanation that this locus is, in fact, a false positive, while also emphasizing that we find this possibility to be unlikely given our multiple lines of evidence.

      In response to Reviewer 1’s suggestion that we carefully interpret the role of the mapped region in sex determination, we highlight our careful wording choices, nearly always referring to the mapped locus as a “candidate sex determination locus” in the title and throughout the manuscript. For consistency, the revised manuscript version will change the second results subheading from “The O. biroi CSD locus is homologous to another ant sex determination locus but not to honeybee csd” to “O. biroi’s candidate CSD locus is homologous to another ant sex determination locus but not to honeybee csd,” and will add the word “candidate” in what will be line 320 at the beginning of the Discussion, and will change “putative” to “candidate” in what will be line 426 at the end of the Discussion.

      In the abstract, it is stated that CSD loci have been mapped in honeybees and two ant species, but we know little about their evolutionary history. But CSD candidate loci were also mapped in a wasp with multi-locus CSD (study cited in the introduction). This wasp is also parthenogenetic via central fusion automixis and produces diploid males. This is a very similar situation to the present study and should be referenced and discussed accordingly, particularly since the authors make the interesting suggestion that their ant also has multi-locus CSD and neither the wasp nor the ant has tra homologs in the CSD candidate regions. Also, is there any homology to the CSD candidate regions in the wasp species and the studied ant?

      In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of diploid males being produced via losses of heterozygosity during asexual reproduction, the revised manuscript will include (in what will be lines 123-126) the highlighted portion of the following sentence: “Therefore, if O. biroi uses CSD, diploid males might result from losses of heterozygosity at sex determination loci (Fig. 1C), similar to what is thought to occur in other asexual Hymenoptera that produce diploid males (Rabeling and Kronauer 2012; Matthey-Doret et al. 2019).”

      We note, however, that in their 2019 study, Matthey-Doret et al. did not directly test the hypothesis that diploid males result from losses of heterozygosity at CSD loci during asexual reproduction, because the diploid males they used for their mapping study came from inbred crosses in a sexual population of that species.

      We address this further below, but we want to emphasize that we do not intend to argue that O. biroi has multiple CSD loci. Instead, we suggest that additional, undetected CSD loci is one possible explanation for the absence of diploid males from any clonal line other than clonal line A. In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of multilocus CSD, the revised manuscript version will include the following additional sentence in the fifth paragraph of the discussion (in what will be lines 372-374): “Multi-locus CSD has been suggested to limit the extent of diploid male production in asexual species under some circumstances (Vorburger 2013; Matthey-Doret et al. 2019).”

      Regarding Reviewer 2’s question about homology between the putative CSD loci from the (Matthey-Doret et al. 2019) study and O. biroi, we note that there is no homology. The revised manuscript version will have an additional Supplementary Table (which will be the new Supplementary Table S3) that will report the results of this homology search. The revised manuscript will also include the following additional sentence in the Results, in what will be lines 172-174: “We found no homology between the genes within the O. biroi CSD index peak and any of the genes within the putative L. fabarum CSD loci (Supplementary Table S3).”

      The authors used different clonal lines of O. biroi to investigate whether heterozygosity at the mapped CSD locus is required for female development in all clonal lines of O. biroi (L187-196). However, given the described parthenogenesis mechanism in this species conserves heterozygosity, additional females that are heterozygous are not very informative here. Indeed, one would need diploid males in these other clonal lines as well (but such males have not yet been found) to make any inference regarding this locus in other lines.

      We agree that a full mapping study including diploid males from all clonal lines would be preferable, but as stated earlier in that same paragraph, we have only found diploid males from clonal line A. We stand behind our modest claim that “Females from all six clonal lines were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.” In the revised manuscript version, this sentence (in what will be lines 199-201) will be changed slightly in response to a reviewer comment below: “All females from all six clonal lines (including 26 diploid females from clonal line B) were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.”

      Reviewer #2 (Public review):

      The manuscript by Lacy et al. is well written, with a clear and compelling introduction that effectively conveys the significance of the study. The methods are appropriate and well-executed, and the results, both in the main text and supplementary materials, are presented in a clear and detailed manner. The authors interpret their findings with appropriate caution.

      This work makes a valuable contribution to our understanding of the evolution of complementary sex determination (CSD) in ants. In particular, it provides important evidence for the ancient origin of a non-coding locus implicated in sex determination, and shows that, remarkably, this sex locus is conserved even in an ant species with a non-canonical reproductive system that typically does not produce males. I found this to be an excellent and well-rounded study, carefully analyzed and well contextualized.

      That said, I do have a few minor comments, primarily concerning the discussion of the potential 'ghost' CSD locus. While the authors acknowledge (line 367) that they currently have no data to distinguish among the alternative hypotheses, I found the evidence for an additional CSD locus presented in the results (lines 261-302) somewhat limited and at times a bit difficult to follow. I wonder whether further clarification or supporting evidence could already be extracted from the existing data. Specifically:

      We agree with Reviewer 2 that the evidence for a second CSD locus is limited. In fact, we do not intend to advocate for there being a second locus, but we suggest that a second CSD locus is one possible explanation for the absence of diploid males outside of clonal line A. In our initial version, we intentionally conveyed this ambiguity by titling this section “O. biroi may have one or multiple sex determination loci.” However, we now see that this leads to undue emphasis on the possibility of a second locus. In the revised manuscript, we will split this into two separate sections: “Diploid male production differs across O. biroi clonal lines” and “O. biroi lacks a tra-containing CSD locus.”

      (1) Line 268: I doubt the relevance of comparing the proportion of diploid males among all males between lines A and B to infer the presence of additional CSD loci. Since the mechanisms producing these two types of males differ, it might be more appropriate to compare the proportion of diploid males among all diploid offspring. This ratio has been used in previous studies on CSD in Hymenoptera to estimate the number of sex loci (see, for example, Cook 1993, de Boer et al. 2008, 2012, Ma et al. 2013, and Chen et al., 2021). The exact method might not be applicable to clonal raider ants, but I think comparing the percentage of diploid males among the total number of (diploid) offspring produced between the two lineages might be a better argument for a difference in CSD loci number.

      We want to re-emphasize here that we do not wish to advocate for there being two CSD loci in O. biroi. Rather, we want to explain that this is one possible explanation for the apparent absence of diploid males outside of clonal line A. We hope that the modifications to the manuscript described in the previous response help to clarify this.

      Reviewer 2 is correct that comparing the number of diploid males to diploid females does not apply to clonal raider ants. This is because males are vanishingly rare among the vast numbers of females produced. We do not count how many females are produced in laboratory stock colonies, and males are sampled opportunistically. Therefore, we cannot report exact numbers. However, we will add the highlighted portion of the following sentence (in what will be lines 268-270) to the revised manuscript: “Despite the fact that we maintain more colonies of clonal line B than of clonal line A in the lab, all the diploid males we detected came from clonal line A.”

      (2) If line B indeed carries an additional CSD locus, one would expect that some females could be homozygous at the ANTSR locus but still viable, being heterozygous only at the other locus. Do the authors detect any females in line B that are homozygous at the ANTSR locus? If so, this would support the existence of an additional, functionally independent CSD locus.

      We thank the reviewer for this suggestion, and again we emphasize that we do not want to argue in favor of multiple CSD loci. We just want to introduce it as one possible explanation for the absence of diploid males outside of clonal line A.

      The 26 sequenced diploid females from clonal line B are all heterozygous at the mapped locus, and the revised manuscript will clarify this in what will be lines 199-201. Previously, only six of those diploid females were included in Supplementary Table S2, and that will be modified accordingly.

      (3) Line 281: The description of the two tra-containing CSD loci as "conserved" between Vollenhovia and the honey bee may be misleading. It suggests shared ancestry, whereas the honey bee csd gene is known to have arisen via a relatively recent gene duplication from fem/tra (10.1038/nature07052). It would be more accurate to refer to this similarity as a case of convergent evolution rather than conservation.

      In the sentence that Reviewer 2 refers to, we are representing the assertion made in the (Miyakawa and Mikheyev 2015) paper in which, regarding their mapping of a candidate CSD locus that contains two linked tra homologs, they write in the abstract: “these data support the prediction that the same CSD mechanism has indeed been conserved for over 100 million years.” In that same paper, Miyakawa and Mikheyev write in the discussion section: “As ants and bees diverged more than 100 million years ago, sex determination in honey bees and V. emeryi is probably homologous and has been conserved for at least this long.”

      As noted by Reviewer 2, this appears to conflict with a previously advanced hypothesis: that because fem and csd were found in Apis mellifera, Apis cerana, and Apis dorsata, but only fem was found in Mellipona compressipes, Bombus terrestris, and Nasonia vitripennis, that the csd gene evolved after the honeybee (Apis) lineage diverged from other bees (Hasselmann et al. 2008). However, it remains possible that the csd gene evolved after ants and bees diverged from N. vitripennis, but before the divergence of ants and bees, and then was subsequently lost in B. terrestris and M. compressipes. This view was previously put forward based on bioinformatic identification of putative orthologs of csd and fem in bumblebees and in ants [(Schmieder et al. 2012), see also (Privman et al. 2013)]. However, subsequent work disagreed and argued that the duplications of tra found in ants and in bumblebees represented convergent evolution rather than homology (Koch et al. 2014). Distinguishing between these possibilities will be aided by additional sex determination locus mapping studies and functional dissection of the underlying molecular mechanisms in diverse Aculeata.

      Distinguishing between these competing hypotheses is beyond the scope of our paper, but the revised manuscript will include additional text to incorporate some of this nuance. We will include these modified lines below (in what will be lines 287-295), with the additions highlighted:

      “A second QTL region identified in V. emeryi (V.emeryiCsdQTL1) contains two closely linked tra homologs, similar to the closely linked honeybee tra homologs, csd and fem (Miyakawa and Mikheyev 2015). This, along with the discovery of duplicated tra homologs that undergo concerted evolution in bumblebees and ants (Schmieder et al. 2012; Privman et al. 2013) has led to the hypothesis that the function of tra homologs as CSD loci is conserved with the csd-containing region of honeybees (Schmieder et al. 2012; Miyakawa and Mikheyev 2015). However, other work has suggested that tra duplications occurred independently in honeybees, bumblebees, and ants (Hasselmann et al. 2008; Koch et al. 2014), and it remains to be demonstrated that either of these tra homologs acts as a primary CSD signal in V. emeryi.”

      (4) Finally, since the authors successfully identified multiple alleles of the first CSD locus using previously sequenced haploid males, I wonder whether they also observed comparable allelic diversity at the candidate second CSD locus. This would provide useful supporting evidence for its functional relevance.

      As is already addressed in the final paragraph of the results and in Supplementary Fig. S4, there is no peak of nucleotide diversity in any of the regions homologous to V.emeryiQTL1, which is the tra-containing candidate sex determination locus (Miyakawa and Mikheyev 2015). In the revised manuscript, the relevant lines will be 307-310. We want to restate that we do not propose that there is a second candidate CSD locus in O. biroi, but we simply raise the possibility that multi-locus CSD *might* explain the absence of diploid males from clonal lines other than clonal line A (as one of several alternative possibilities).

      Overall, these are relatively minor points in the context of a strong manuscript, but I believe addressing them would improve the clarity and robustness of the authors' conclusions.

      Reviewer #3 (Public review):

      Summary:

      The sex determination mechanism governed by the complementary sex determination (CSD) locus is one of the mechanisms that support the haplodiploid sex determination system evolved in hymenopteran insects. While many ant species are believed to possess a CSD locus, it has only been specifically identified in two species. The authors analyzed diploid females and the rarely occurring diploid males of the clonal ant Ooceraea biroi and identified a 46 kb CSD candidate region that is consistently heterozygous in females and predominantly homozygous in males. This region was found to be homologous to the CSD locus reported in distantly related ants. In the Argentine ant, Linepithema humile, the CSD locus overlaps with an lncRNA (ANTSR) that is essential for female development and is associated with the heterozygous region (Pan et al. 2024). Similarly, an lncRNA is encoded near the heterozygous region within the CSD candidate region of O. biroi. Although this lncRNA shares low sequence similarity with ANTSR, its potential functional involvement in sex determination is suggested. Based on these findings, the authors propose that the heterozygous region and the adjacent lncRNA in O. biroi may trigger female development via a mechanism similar to that of L. humile. They further suggest that the molecular mechanisms of sex determination involving the CSD locus in ants have been highly conserved for approximately 112 million years. This study is one of the few to identify a CSD candidate region in ants and is particularly noteworthy as the first to do so in a parthenogenetic species.

      Strengths:

      (1) The CSD candidate region was found to be homologous to the CSD locus reported in distantly related ant species, enhancing the significance of the findings.

      (2) Identifying the CSD candidate region in a parthenogenetic species like O. biroi is a notable achievement and adds novelty to the research.

      Weaknesses

      (1) Functional validation of the lncRNA's role is lacking, and further investigation through knockout or knockdown experiments is necessary to confirm its involvement in sex determination.

      See response below.

      (2) The claim that the lncRNA is essential for female development appears to reiterate findings already proposed by Pan et al. (2024), which may reduce the novelty of the study.

      We do not claim that the lncRNA is essential for female development in O. biroi, but simply mention the possibility that, as in L. humile, it is somehow involved in sex determination. We do not have any functional evidence for this, so this is purely based on its genomic position immediately adjacent to our mapped candidate region. We agree with the reviewer that the study by Pan et al. (2024) decreases the novelty of our findings. Another way of looking at this is that our study supports and bolsters previous findings by partially replicating the results in a different species.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      L307-308 should state homozygous for either allele in THE MAJORITY of diploid males.

      This will be fixed in the revised manuscript, in what will be line 321.

      Reviewer #3 (Recommendations for the authors):

      The association between heterozygosity in the CSD candidate region and female development in O. biroi, along with the high sequence homology of this region to CSD loci identified in two distantly related ant species, is not sufficient to fully address the evolution of the CSD locus and the mechanisms of sex determination.

      Given that functional genetic tools, such as genome editing, have already been established in O. biroi, I strongly recommend that the authors investigate the role of the lncRNA through knockout or knockdown experiments and assess its impact on the sex-specific splicing pattern of the downstream tra gene.

      Although knockout experiments of the lncRNA would be illuminating, the primary signal of complementary sex determination is heterozygosity. As is clearly stated in our manuscript and that of (Pan et al. 2024), it does not appear to be heterozygosity within the lncRNA that induces female development, but rather heterozygosity in non-transcribed regions linked to the lncRNA. Therefore, future mechanistic studies of sex determination in O. biroi, L. humile, and other ants should explore how homozygosity or heterozygosity of this region impacts the sex determination cascade, rather than focusing (exclusively) on the lncRNA.

      With this in mind, we developed three sets of guide RNAs that cut only one allele within the mapped CSD locus, with the goal of producing deletions within the highly variable region within the mapped locus. This would lead to functional hemizygosity or homozygosity within this region, depending on how the cuts were repaired. We also developed several sets of PCR primers to assess the heterozygosity of the resultant animals. After injecting 1,162 eggs over several weeks and genotyping the hundreds of resultant animals with PCR, we confirmed that we could induce hemizygosity or homozygosity within this region, at least in ~1/20 of the injected embryos. Although it is possible to assess the sex-specificity of the splice isoform of tra as a proxy for sex determination phenotypes (as done by (Pan et al. 2024)), the ideal experiment would assess male phenotypic development at the pupal stage. Therefore, over several more weeks, we injected hundreds more eggs with these reagents and reared the injected embryos to the pupal stage. However, substantial mortality was observed, with only 12 injected eggs developing to the pupal stage. All of these were female, and none of them had been successfully mutated.

      In conclusion, we agree with the reviewer that functional experiments would be useful, and we made extensive attempts to conduct such experiments. However, these experiments turned out to be extremely challenging with the currently available protocols. Ultimately, we therefore decided to abandon these attempts.  

      We opted not to include these experiments in the paper itself because we cannot meaningfully interpret their results. However, we are pleased that, in this response letter, we can include a brief description for readers interested in attempting similar experiments.

      Since O. biroi reproduces parthenogenetically and most offspring develop into females, observing a shift from female- to male-specific splicing of tra upon early embryonic knockout of the lncRNA would provide much stronger evidence that this lncRNA is essential for female development. Without such functional validation, the authors' claim (lines 36-38) seems to reiterate findings already proposed by Pan et al. (2024) and, as such, lacks sufficient novelty.

      We have responded to the issue of “lack of novelty” above. But again, the actual CSD locus in both O. biroi and L. humile appears to be distinct from (but genetically linked to) the lncRNA, and we have no experimental evidence that the putative lncRNA in O. biroi is involved in sex determination at all. Because of this, and given the experimental challenges described above, we do not currently intend to pursue functional studies of the lncRNA.

      References

      Hasselmann M, Gempe T, Schiøtt M, Nunes-Silva CG, Otte M, Beye M. 2008. Evidence for the evolutionary nascence of a novel sex determination pathway in honeybees. Nature 454:519–522.

      Koch V, Nissen I, Schmitt BD, Beye M. 2014. Independent Evolutionary Origin of fem Paralogous Genes and Complementary Sex Determination in Hymenopteran Insects. PLOS ONE 9:e91883.

      Matthey-Doret C, van der Kooi CJ, Jeffries DL, Bast J, Dennis AB, Vorburger C, Schwander T. 2019. Mapping of multiple complementary sex determination loci in a parasitoid wasp. Genome Biology and Evolution 11:2954–2962.

      Miyakawa MO, Mikheyev AS. 2015. QTL mapping of sex determination loci supports an ancient pathway in ants and honey bees. PLOS Genetics 11:e1005656.

      Pan Q, Darras H, Keller L. 2024. LncRNA gene ANTSR coordinates complementary sex determination in the Argentine ant. Science Advances 10:eadp1532.

      Privman E, Wurm Y, Keller L. 2013. Duplication and concerted evolution in a master sex determiner under balancing selection. Proceedings of the Royal Society B: Biological Sciences 280:20122968.

      Rabeling C, Kronauer DJC. 2012. Thelytokous parthenogenesis in eusocial Hymenoptera. Annual Review of Entomology 58:273–292.

      Schmieder S, Colinet D, Poirié M. 2012. Tracing back the nascence of a new sex-determination pathway to the ancestor of bees and ants. Nature Communications 3:1–7.

      Vorburger C. 2013. Thelytoky and Sex Determination in the Hymenoptera: Mutual Constraints. Sexual Development 8:50–58.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This paper describes a number of patterns of epistasis in a large fitness landscape dataset recently published by Papkou et al. The paper is motivated by an important goal in the field of evolutionary biology to understand the statistical structure of epistasis in protein fitness landscapes, and it capitalizes on the unique opportunities presented by this new dataset to address this problem. 

      The paper reports some interesting previously unobserved patterns that may have implications for our understanding of fitness landscapes and protein evolution. In particular, Figure 5 is very intriguing. However, I have two major concerns detailed below. First, I found the paper rather descriptive (it makes little attempt to gain deeper insights into the origins of the observed patterns) and unfocused (it reports what appears to be a disjointed collection of various statistics without a clear narrative. Second, I have concerns with the statistical rigor of the work. 

      (1) I think Figures 5 and 7 are the main, most interesting, and novel results of the paper. However, I don't think that the statement "Only a small fraction of mutations exhibit global epistasis" accurately describes what we see in Figure 5. To me, the most striking feature of this figure is that the effects of most mutations at all sites appear to be a mixture of three patterns. The most interesting pattern noted by the authors is of course the "strong" global epistasis, i.e., when the effect of a mutation is highly negatively correlated with the fitness of the background genotype. The second pattern is a "weak" global epistasis, where the correlation with background fitness is much weaker or non-existent. The third pattern is the vertically spread-out cluster at low-fitness backgrounds, i.e., a mutation has a wide range of mostly positive effects that are clearly not correlated with fitness. What is very interesting to me is that all background genotypes fall into these three groups with respect to almost every mutation, but the proportions of the three groups are different for different mutations. In contrast to the authors' statement, it seems to me that almost all mutations display strong global epistasis in at least a subset of backgrounds. A clear example is C>A mutation at site 3. 

      (1a) I think the authors ought to try to dissect these patterns and investigate them separately rather than lumping them all together and declaring that global epistasis is rare. For example, I would like to know whether those backgrounds in which mutations exhibit strong global epistasis are the same for all mutations or whether they are mutation- or perhaps positionspecific. Both answers could be potentially very interesting, either pointing to some specific site-site interactions or, alternatively, suggesting that the statistical patterns are conserved despite variation in the underlying interactions. 

      (1b) Another rather remarkable feature of this plot is that the slopes of the strong global epistasis patterns seem to be very similar across mutations. Is this the case? Is there anything special about this slope? For example, does this slope simply reflect the fact that a given mutation becomes essentially lethal (i.e., produces the same minimal fitness) in a certain set of background genotypes? 

      (1c) Finally, how consistent are these patterns with some null expectations? Specifically, would one expect the same distribution of global epistasis slopes on an uncorrelated landscape? Are the pivot points unusually clustered relative to an expectation on an uncorrelated landscape? 

      (1d) The shapes of the DFE shown in Figure 7 are also quite interesting, particularly the bimodal nature of the DFE in high-fitness (HF) backgrounds. I think this bimodality must be a reflection of the clustering of mutation-background combinations mentioned above. I think the authors ought to draw this connection explicitly. Do all HF backgrounds have a bimodal DFE? What mutations occupy the "moving" peak? 

      (1e) In several figures, the authors compare the patterns for HF and low-fitness (LF) genotypes. In some cases, there are some stark differences between these two groups, most notably in the shape of the DFE (Figure 7B, C). But there is no discussion about what could underlie these differences. Why are the statistics of epistasis different for HF and LF genotypes? Can the authors at least speculate about possible reasons? Why do HF and LF genotypes have qualitatively different DFEs? I actually don't quite understand why the transition between bimodal DFE in Figure 7B and unimodal DFE in Figure 7C is so abrupt. Is there something biologically special about the threshold that separates LF and HF genotypes? My understanding was that this was just a statistical cutoff. Perhaps the authors can plot the DFEs for all backgrounds on the same plot and just draw a line that separates HF and LF backgrounds so that the reader can better see whether the DFE shape changes gradually or abruptly.

      (1f) The analysis of the synonymous mutations is also interesting. However I think a few additional analyses are necessary to clarify what is happening here. I would like to know the extent to which synonymous mutations are more often neutral compared to non-synonymous ones. Then, synonymous pairs interact in the same way as non-synonymous pair (i.e., plot Figure 1 for synonymous pairs)? Do synonymous or non-synonymous mutations that are neutral exhibit less epistasis than non-neutral ones? Finally, do non-synonymous mutations alter epistasis among other mutations more often than synonymous mutations do? What about synonymous-neutral versus synonymous-non-neutral. Basically, I'd like to understand the extent to which a mutation that is neutral in a given background is more or less likely to alter epistasis between other mutations than a non-neutral mutation in the same background. 

      (2) I have two related methodological concerns. First, in several analyses, the authors employ thresholds that appear to be arbitrary. And second, I did not see any account of measurement errors. For example, the authors chose the 0.05 threshold to distinguish between epistasis and no epistasis, but why this particular threshold was chosen is not justified. Another example: is whether the product s12 × (s1 + s2) is greater or smaller than zero for any given mutation is uncertain due to measurement errors. Presumably, how to classify each pair of mutations should depend on the precision with which the fitness of mutants is measured. These thresholds could well be different across mutants. We know, for example, that low-fitness mutants typically have noisier fitness estimates than high-fitness mutants. I think the authors should use a statistically rigorous procedure to categorize mutations and their epistatic interactions. I think it is very important to address this issue. I got very concerned about it when I saw on LL 383-388 that synonymous stop codon mutations appear to modulate epistasis among other mutations. This seems very strange to me and makes me quite worried that this is a result of noise in LF genotypes. 

      Thank you for your review of the manuscript. In the revised version, we have addressed both major criticisms, as detailed below.

      When carefully examining the plots in Figure 5 independently, we indeed observe that the fitness effect of a mutation on different genetic backgrounds can be classified into three characteristic patterns. Our reasoning for these patterns is as follows:

      Strong correlation: Typically observed when the mutation is lethal across backgrounds. Linear regression of mutations exhibiting strong global epistasis shows slopes close to −1 and pivot points near −0.7 (Table S4). Since the reported fitness threshold is −0.508, these mutations push otherwise functional backgrounds into the non-functional range, consistent with lethal effects.

      Weak correlation: Observed when a mutation has no significant effect on fitness across backgrounds, consistent with neutrality.

      No correlation: Out of the 261,333 reported variants, 243,303 (93%) lie below the fitness threshold of −0.508, indicating that the low-fitness region is densely populated by nonfunctional variants. The “strong correlation” and “weak correlation” lines intersect in this zone. Most mutations in this region have little effect (neutral), but occasional abrupt fitness increases correspond to “resurrecting” mutations, the converse of lethal changes. For example, mutations such as X→G at locus 4 or X→A at locus 5 restore function, while the reverse changes (e.g. C→A at locus 3) are lethal.

      Thus, the “no-correlation” pattern is largely explained by mutations that reverse the effect of lethal changes, effectively resurrecting non-functional variants. In the revised manuscript, we highlight these nuances within the broader classification of fitness effect versus background fitness (pp. 10–13).

      Additional analyses included in the revision:

      Synonymous vs. non-synonymous pairs: We repeated the Figure 1 analysis for synonymous–synonymous pairs. As expected, synonymous pairs exhibit lower overall frequencies of epistasis, consistent with their greater neutrality. However, the qualitative spectrum remains similar: positive and negative epistasis dominate, while sign epistasis is rare (Supplementary Figs. S6–S7, S9–S10).

      Fitness effect vs. epistasis change: We tested whether the mean fitness effect of a mutation correlates with the percent of cases in which it changes the nature of epistasis. No correlation was found (R² ≈ 0.11), and this analysis is now included in the revised manuscript.

      Epistasis-modulating ability: Non-synonymous mutations more frequently alter the interactions between other mutations than synonymous substitutions. Within synonymous substitutions, the subset with measurable fitness effects disproportionately contributes to epistasis modulation. Thus, the ability of synonymous substitutions to modulate epistasis arises primarily from the non-neutral subset.

      These analyses clarify the role of synonymous mutations in reshaping epistasis on the folA landscape.

      Revision of statistical treatment of epistasis:

      In our original submission, we used an arbitrary threshold of 0.05 to classify the presence or absence of epistasis, following Papkou et al., who based conclusions on a single experimental replicate. However, as the reviewer correctly noted, this does not adequately account for measurement variability across different genotypes.

      In the revised manuscript, we adopt a statistically rigorous framework that incorporates replicate-based error directly. Specifically, we now use the mean fitness across six independent replicates, together with the corresponding standard deviation, to classify fitness peaks and epistasis. This eliminates arbitrary thresholds and ensures that epistatic classifications reflect the precision of measurements for each genotype.

      This revision led to both quantitative and qualitative changes:

      For high-fitness genotypes, the core patterns of higher-order (“fluid”) epistasis remain robust (Figures 2–3).

      For low-fitness genotypes, incorporating replicate-based error removed spurious fluidity effects, yielding a more accurate characterization of epistasis (Figures 2–3; Supplementary Figs. S6–S7, S9–S10).

      We describe these methodological changes in detail in the revised Methods section and provide updated code.

      Together, these revisions directly address the reviewer’s concerns. They improve the statistical rigor of our analysis, strengthen the robustness of our conclusions, and underscore the importance of accounting for measurement error in large-scale fitness landscape studies—a point we now emphasize in the manuscript.

      Reviewer #2 (Public review): 

      Significance: 

      This paper reanalyzes an experimental fitness landscape generated by Papkou et al., who assayed the fitness of all possible combinations of 4 nucleotide states at 9 sites in the E. coli DHFR gene, which confers antibiotic resistance. The 9 nucleotide sites make up 3 amino acid sites in the protein, of which one was shown to be the primary determinant of fitness by Papkou et al. This paper sought to assess whether pairwise epistatic interactions differ among genetic backgrounds at other sites and whether there are major patterns in any such differences. They use a "double mutant cycle" approach to quantify pairwise epistasis, where the epistatic interaction between two mutations is the difference between the measured fitness of the double-mutant and its predicted fitness in the absence of epistasis (which equals the sum of individual effects of each mutation observed in the single mutants relative to the reference genotype). The paper claims that epistasis is "fluid," because pairwise epistatic effects often differs depending on the genetic state at the other site. It also claims that this fluidity is "binary," because pairwise effects depend strongly on the state at nucleotide positions 5 and 6 but weakly on those at other sites. Finally, they compare the distribution of fitness effects (DFE) of single mutations for starting genotypes with similar fitness and find that despite the apparent "fluidity" of interactions this distribution is well-predicted by the fitness of the starting genotype. 

      The paper addresses an important question for genetics and evolution: how complex and unpredictable are the effects and interactions among mutations in a protein? Epistasis can make the phenotype hard to predict from the genotype and also affect the evolutionary navigability of a genotype landscape. Whether pairwise epistatic interactions depend on genetic background - that is, whether there are important high-order interactions -- is important because interactions of order greater than pairwise would make phenotypes especially idiosyncratic and difficult to predict from the genotype (or by extrapolating from experimentally measured phenotypes of genotypes randomly sampled from the huge space of possible genotypes). Another interesting question is the sparsity of such high-order interactions: if they exist but mostly depend on a small number of identifiable sequence sites in the background, then this would drastically reduce the complexity and idiosyncrasy relative to a landscape on which "fluidity" involves interactions among groups of all sites in the protein. A number of papers in the recent literature have addressed the topics of high-order epistasis and sparsity and have come to conflicting conclusions. This paper contributes to that body of literature with a case study of one published experimental dataset of high quality. The findings are therefore potentially significant if convincingly supported. 

      Validity: 

      In my judgment, the major conclusions of this paper are not well supported by the data. There are three major problems with the analysis. 

      (1) Lack of statistical tests. The authors conclude that pairwise interactions differ among backgrounds, but no statistical analysis is provided to establish that the observed differences are statistically significant, rather than being attributable to error and noise in the assay measurements. It has been established previously that the methods the authors use to estimate high-order interactions can result in inflated inferences of epistasis because of the propagation of measurement noise (see PMID 31527666 and 39261454). Error propagation can be extreme because first-order mutation effects are calculated as the difference between the measured phenotype of a single-mutant variant and the reference genotype; pairwise effects are then calculated as the difference between the measured phenotype of a double mutant and the sum of the differences described above for the single mutants. This paper claims fluidity when this latter difference itself differs when assessed in two different backgrounds. At each step of these calculations, measurement noise propagates. Because no statistical analysis is provided to evaluate whether these observed differences are greater than expected because of propagated error, the paper has not convincingly established or quantified "fluidity" in epistatic effects. 

      (2) Arbitrary cutoffs. Many of the analyses involve assigning pairwise interactions into discrete categories, based on the magnitude and direction of the difference between the predicted and observed phenotypes for a pairwise mutant. For example, the authors categorize as a positive pairwise interaction if the apparent deviation of phenotype from prediction is >0.05, negative if the deviation is <-0.05, and no interaction if the deviation is between these cutoffs. Fluidity is diagnosed when the category for a pairwise interaction differs among backgrounds. These cutoffs are essentially arbitrary, and the effects are assigned to categories without assessing statistical significance. For example, an interaction of 0.06 in one background and 0.04 in another would be classified as fluid, but it is very plausible that such a difference would arise due to error alone. The frequency of epistatic interactions in each category as claimed in the paper, as well as the extent of fluidity across backgrounds, could therefore be systematically overestimated or underestimated, affecting the major conclusions of the study. 

      (3) Global nonlinearities. The analyses do not consider the fact that apparent fluidity could be attributable to the fact that fitness measurements are bounded by a minimum (the fitness of cells carrying proteins in which DHFR is essentially nonfunctional) and a maximum (the fitness of cells in which some biological factor other than DHFR function is limiting for fitness). The data are clearly bounded; the original Papkou et al. paper states that 93% of genotypes are at the low-fitness limit at which deleterious effects no longer influence fitness. Because of this bounding, mutations that are strongly deleterious to DHFR function will therefore have an apparently smaller effect when introduced in combination with other deleterious mutations, leading to apparent epistatic interactions; moreover, these apparent interactions will have different magnitudes if they are introduced into backgrounds that themselves differ in DHFR function/fitness, leading to apparent "fluidity" of these interactions. This is a well-established issue in the literature (see PMIDs 30037990, 28100592, 39261454). It is therefore important to adjust for these global nonlinearities before assessing interactions, but the authors have not done this. 

      This global nonlinearity could explain much of the fluidity claimed in this paper. It could explain the observation that epistasis does not seem to depend as much on genetic background for low-fitness backgrounds, and the latter is constant (Figure 2B and 2C): these patterns would arise simply because the effects of deleterious mutations are all epistatically masked in backgrounds that are already near the fitness minimum. It would also explain the observations in Figure 7. For background genotypes with relatively high fitness, there are two distinct peaks of fitness effects, which likely correspond to neutral mutations and deleterious mutations that bring fitness to the lower bound of measurement; as the fitness of the background declines, the deleterious mutations have a smaller effect, so the two peaks draw closer to each other, and in the lowest-fitness backgrounds, they collapse into a single unimodal distribution in which all mutations are approximately neutral (with the distribution reflecting only noise). Global nonlinearity could also explain the apparent "binary" nature of epistasis. Sites 4 and 5 change the second amino acid, and the Papkou paper shows that only 3 amino acid states (C, D, and E) are compatible with function; all others abolish function and yield lower-bound fitness, while mutations at other sites have much weaker effects. The apparent binary nature of epistasis in Figure 5 corresponds to these effects given the nonlinearity of the fitness assay. Most mutations are close to neutral irrespective of the fitness of the background into which they are introduced: these are the "non-epistatic" mutations in the binary scheme. For the mutations at sites 4 and 5 that abolish one of the beneficial mutations, however, these have a strong background-dependence: they are very deleterious when introduced into a high-fitness background but their impact shrinks as they are introduced into backgrounds with progressively lower fitness. The apparent "binary" nature of global epistasis is likely to be a simple artifact of bounding and the bimodal distribution of functional effects: neutral mutations are insensitive to background, while the magnitude of the fitness effect of deleterious mutations declines with background fitness because they are masked by the lower bound. The authors' statement is that "global epistasis often does not hold." This is not established. A more plausible conclusion is that global epistasis imposed by the phenotype limits affects all mutations, but it does so in a nonlinear fashion. 

      In conclusion, most of the major claims in the paper could be artifactual. Much of the claimed pairwise epistasis could be caused by measurement noise, the use of arbitrary cutoffs, and the lack of adjustment for global nonlinearity. Much of the fluidity or higher-order epistasis could be attributable to the same issues. And the apparently binary nature of global epistasis is also the expected result of this nonlinearity. 

      We thank the reviewer for raising this important concern. We fully agree that the use of arbitrary thresholds in the earlier version of the manuscript, together with the lack of an explicit treatment of measurement error, could compromise the rigor of our conclusions. To address this, we have undertaken a thorough re-analysis of the folA landscape.

      (1)  Incorporating measurement error and avoiding noise-driven artifacts

      In the original version, we followed Papkou et al. in using a single experimental replicate and applying fixed thresholds to classify epistasis. As the reviewer correctly notes, this approach allows noise to propagate from single-mutant measurements to double-mutant effects, and ultimately to higher-order epistasis.

      In the revised analysis, we now:

      Use the mean fitness across all six independent replicates for each genotype.

      Incorporate the corresponding standard deviation as a measure of experimental error.

      Classify epistatic interactions only when differences between a genotype and its neighbors exceed combined error margins, rather than using a fixed cutoff.

      This ensures that observed changes in epistasis are statistically distinguishable from noise. Details are provided in the revised Methods section and updated code.

      (2) Replacing arbitrary thresholds with error-based criteria

      Previously, we used an arbitrary ±0.05 cutoff to define the presence/absence of epistasis. As the reviewer notes, this could misclassify interactions (e.g. labeling an effect as “fluid” when the difference lies within error). In the revised framework, these thresholds have been eliminated. Instead, interactions are classified based on whether their distributions overlap within replicate variance.

      This approach scales naturally with measurement precision, which differs between high-fitness and low-fitness genotypes, and removes the need for a universal cutoff.

      (3) Consequences of re-analysis

      Implementing this revised framework produced several important updates:

      High-fitness backgrounds: The qualitative picture of higher-order (“fluid”) epistasis remains robust. The patterns reported originally are preserved.

      Low-fitness backgrounds: Accounting for replicate variance revealed that part of the previously inferred “fluidity” arose from noise. These spurious effects are now removed, giving a more conservative but more accurate view of epistasis in non-functional regions.

      Fitness peaks: Our replicate-aware analysis identifies 127 peaks, compared to 514 in Papkou et al. Importantly, all 127 peaks occur in functional regions of the landscape. This difference highlights the importance of replicate-based error treatment: relying on a single run without demonstrating repeatability can yield artifacts.

      (4) Addressing bounding effects and terminology

      We also agree with the reviewer that bounding effects, arising from the biological limits of fitness, can create apparent nonlinearities in the genotype–phenotype map. To clarify this, we made the following changes:

      Terminology: We now use the term higher-order epistasis instead of fluid epistasis, emphasizing that the observed background-dependence involves more than two mutations and cannot be explained by global nonlinearities alone.

      We also clarify the definitions of sign-epistasis used in this work.

      By replacing arbitrary cutoffs with replicate-based error estimates and by explicitly considering bounding effects, we have substantially increased the rigor of our analysis. While this reanalysis led to both quantitative and qualitative changes in some regions, the central conclusion remains unchanged: higher-order epistasis is pervasive in the folA landscape, especially in functional backgrounds.

      All analysis scripts and codes are provided as Supplementary Material.

      Reviewer #3 (Public review): 

      Summary: 

      The authors have studied a previously published large dataset on the fitness landscape of a 9 base-pair region of the folA gene. The objective of the paper is to understand various aspects of epistasis in this system, which the authors have achieved through detailed and computationally expensive exploration of the landscape. The authors describe epistasis in this system as "fluid", meaning that it depends sensitively on the genetic background, thereby reducing the predictability of evolution at the genetic level. However, the study also finds two robust patterns. The first is the existence of a "pivot point" for a majority of mutations, which is a fixed growth rate at which the effect of mutations switches from beneficial to deleterious (consistent with a previous study on the topic). The second is the observation that the distribution of fitness effects (DFE) of mutations is predicted quite well by the fitness of the genotype, especially for high-fitness genotypes. While the work does not offer a synthesis of the multitude of reported results, the information provided here raises interesting questions for future studies in this field. 

      Strengths: 

      A major strength of the study is its detailed and multifaceted approach, which has helped the authors tease out a number of interesting epistatic properties. The study makes a timely contribution by focusing on topical issues like the prevalence of global epistasis, the existence of pivot points, and the dependence of DFE on the background genotype and its fitness. The methodology is presented in a largely transparent manner, which makes it easy to interpret and evaluate the results. 

      The authors have classified pairwise epistasis into six types and found that the type of epistasis changes depending on background mutations. Switches happen more frequently for mutations at functionally important sites. Interestingly, the authors find that even synonymous mutations in stop codons can alter the epistatic interaction between mutations in other codons. Consistent with these observations of "fluidity", the study reports limited instances of global epistasis (which predicts a simple linear relationship between the size of a mutational effect and the fitness of the genetic background in which it occurs). Overall, the work presents some evidence for the genetic context-dependent nature of epistasis in this system. 

      Weaknesses: 

      Despite the wealth of information provided by the study, there are some shortcomings of the paper which must be mentioned. 

      (1) In the Significance Statement, the authors say that the "fluid" nature of epistasis is a previously unknown property. This is not accurate. What the authors describe as "fluidity" is essentially the prevalence of certain forms of higher-order epistasis (i.e., epistasis beyond pairwise mutational interactions). The existence of higher-order epistasis is a well-known feature of many landscapes. For example, in an early work, (Szendro et. al., J. Stat. Mech., 2013), the presence of a significant degree of higher-order epistasis was reported for a number of empirical fitness landscapes. Likewise, (Weinreich et. al., Curr. Opin. Genet. Dev., 2013) analysed several fitness landscapes and found that higher-order epistatic terms were on average larger than the pairwise term in nearly all cases. They further showed that ignoring higher-order epistasis leads to a significant overestimate of accessible evolutionary paths. The literature on higher-order epistasis has grown substantially since these early works. Any future versions of the present preprint will benefit from a more thorough contextual discussion of the literature on higher-order epistasis.

      (2) In the paper, the term 'sign epistasis' is used in a way that is different from its wellestablished meaning. (Pairwise) sign epistasis, in its standard usage, is said to occur when the effect of a mutation switches from beneficial to deleterious (or vice versa) when a mutation occurs at a different locus. The authors require a stronger condition, namely that the sum of the individual effects of two mutations should have the opposite sign from their joint effect. This is a sufficient condition for sign epistasis, but not a necessary one. The property studied by the authors is important in its own right, but it is not equivalent to sign epistasis. 

      (3) The authors have looked for global epistasis in all 108 (9x12) mutations, out of which only 16 showed a correlation of R^2 > 0.4. 14 out of these 16 mutations were in the functionally important nucleotide positions. Based on this, the authors conclude that global epistasis is rare in this landscape, and further, that mutations in this landscape can be classified into one of two binary states - those that exhibit global epistasis (a small minority) and those that do not (the majority). I suspect, however, that a biologically significant binary classification based on these data may be premature. Unsurprisingly, mutational effects are stronger at the functional sites as seen in Figure 5 and Figure 2, which means that even if global epistasis is present for all mutations, a statistical signal will be more easily detected for the functionally important sites. Indeed, the authors show that the means of DFEs decrease linearly with background fitness, which hints at the possibility that a weak global epistatic effect may be present (though hard to detect) in the individual mutations. Given the high importance of the phenomenon of global epistasis, it pays to be cautious in interpreting these results. 

      (4) The study reports that synonymous mutations frequently change the nature of epistasis between mutations in other codons. However, it is unclear whether this should be surprising, because, as the authors have already noted, synonymous mutations can have an impact on cellular functions. The reader may wonder if the synonymous mutations that cause changes in epistatic interactions in a certain background also tend to be non-neutral in that background. Unfortunately, the fitness effect of synonymous mutations has not been reported in the paper. 

      (5) The authors find that DFEs of high-fitness genotypes tend to depend only on fitness and not on genetic composition. This is an intriguing observation, but unfortunately, the authors do not provide any possible explanation or connect it to theoretical literature. I am reminded of work by (Agarwala and Fisher, Theor. Popul. Biol., 2019) as well as (Reddy and Desai, eLife, 2023) where conditions under which the DFE depends only on the fitness have been derived. Any discussion of possible connections to these works could be a useful addition.  

      We thank the reviewer for the summary of our work and for highlighting both its strengths and areas for improvement. We have carefully considered the points raised and revised the manuscript accordingly. The revised version:

      (1) Clarifies the conceptual framework. We emphasize the distinction between background-dependent, higher-order epistasis and global nonlinearities. To avoid ambiguity, we have replaced the term “fluid” epistasis with higher-order epistasis throughout, in line with prior literature (e.g. Szendro et al., 2013; Weinreich et al., 2013). We now explicitly situate our results in the context of these studies and clarify our definitions of epistasis, correcting the earlier error where “strong sign epistasis” was used in place of “sign epistasis.”

      (2) Improves statistical rigor. We now incorporate replicate variance and statistical error criteria in place of arbitrary thresholds. This ensures that classification of epistasis reflects experimental precision rather than fixed, arbitrary cutoffs.

      (3) Expands treatment of synonymous mutations. We now explicitly analyze synonymous mutations, separating those that are neutral from those that are non-neutral. Our results show that non-neutral synonymous mutations are disproportionately responsible for altering epistatic interactions, while neutral synonymous mutations rarely do so. We also report the fitness effects of synonymous mutations directly and include new analyses showing that there is no correlation between the mean fitness effect of a synonymous mutation and the frequency with which it alters epistasis (Supplementary Fig. S11).

      These revisions strengthen both the rigor and the clarity of the manuscript. We hope they address the reviewer’s concerns and make the significance of our findings, particularly the siteresolved quantification of higher-order epistasis in the folA landscape, including in synonymous mutations, more apparent.

      Reviewing Editor Comments: 

      Key revision suggestions: 

      (1) Please quantify the impact of measurement noise on your conclusions, and perform statistical analysis to determine whether the observed differences of epistasis due to different backgrounds are statistically significant. 

      (2) Please investigate how your conclusions depend on the cutoffs, and consider choosing them based on statistical criteria. 

      (3) Please reconsider the possible role of global epistasis. In particular, the effect of bounds on fitness values. All reviewers are concerned that all claims, including about global epistasis, may be consistent with a simple null model where most low fitness genotypes are non-functional and variation in their fitness is simply driven by measurement noise. Please provide a convincing argument rejecting this model. 

      More generally, we recommend that you consider all suggestions by reviewers, including those about results, but also those about terminology and citing relevant works. 

      Thank you for your guidance. We have substantially revised the manuscript to incorporate the reviewers’ suggestions. In addition to addressing the three central issues raised, we have refined terminology, expanded the discussion of prior work, and clarified the presentation of our main results. We believe these changes significantly strengthen both the rigor and the impact of the study. We are grateful to the Reviewing Editor and reviewers for their constructive feedback.

      In the revised manuscript, we address the three major points as follows:

      (1) Quantifying measurement noise and statistical significance. We now use the average of six independent experimental runs for each genotype, together with the corresponding standard deviations, to explicitly quantify measurement uncertainty. Pairwise and higher-order epistasis are assessed relative to these error estimates, rather than against fixed thresholds. This ensures that differences across genetic backgrounds are statistically distinguishable from noise.

      (2) Replacing arbitrary cutoffs with statistical criteria. We have eliminated the use of arbitrary thresholds. Instead, classification of interactions (positive, negative, or neutral epistasis) is based on whether fitness differences exceed replicate variance. This approach scales naturally with measurement precision. While some results change quantitatively for high-fitness backgrounds and qualitatively for low-fitness backgrounds, our central conclusions remain robust.

      (3) Analysis of synonymous mutations. We now separately analyze synonymous mutations to test their role in altering epistasis. Our results show that there is no correlation between the average fitness effect of a synonymous mutation and the frequency with which it changes epistatic interactions.

      We have revised terminology for clarity (replacing “fluid” with higher-order epistasis) and updated the Discussion to place our work in the broader context of the literature on higher-order epistasis.

      Finally, we have rewritten the entire manuscript to improve clarity, refine the narrative flow, and ensure that the presentation more crisply reflects the subject of the study

      Reviewer #1 (Recommendations for the authors): 

      MINOR COMMENTS 

      (1) Lines 102-107. Papkou's definition of non-functional genotypes makes sense since it is based on the fact that some genotypes are statistically indistinguishable in terms of fitness from mutants with premature stop codons in folA. It doesn't really matter whether to call them low fitness or non-functional, but it would be helpful to explain the basis for this distinction. 

      Thank you for raising this point. To maintain consistency with the original dataset and analysis, we retain Papkou et al.’s nomenclature and refer to these genotypes as “functional” or “non-functional.” 

      (2) Lines 111-112. I think the authors need to briefly explain here how they define the absence of epistasis. They do so in the Methods, but this information is essential and needs to be conveyed to the reader in the Results as well. 

      Thank you for the suggestion. We agree that this definition is essential for readers to follow the Results. In the revised manuscript, we have added a brief explanation at the start of the Results section clarifying how we define the absence of epistasis. Specifically, we now state that two mutations are considered non-epistatic when the observed fitness of the double mutant is statistically indistinguishable (within error of six replicates) from the additive expectation based on the single mutants. This ensures that the Results section is selfcontained, while full details remain in the Methods.

      (3) Lines 142 and elsewhere. The authors introduce the qualifier "fluid" to describe the fact that the value or sign of pairwise epistasis changes across genetic backgrounds. I don't see a need for this new terminology, since it is already captured adequately by the term "higher-order epistasis". The epistasis field is already rife with jargon, and I would prefer if new terms were introduced only when absolutely necessary. 

      Thank you for this helpful suggestion. We agree that introducing new terminology is unnecessary here. In the revised manuscript, we have replaced the term “fluid” epistasis with “higher-order epistasis” throughout, to align with established usage and avoid adding jargon.

      (4) Figure 6. I don't think this is the best way of showing that the pivot points are clustered. A histogram would be more appropriate and would take less space. However it would allow the authors to display a null distribution to demonstrate that this clustering is indeed surprising. 

      (5) Lines 320-321. Mann-Whitney U tests whether one distribution is systematically shifted up or down relative to the other. Please change the language here. It looks like the authors also performed the Kolmogorov-Smirnoff test, which is appropriate, but it doesn't look like the results are reported anywhere. Please report. 

      (6) Lines 330-334. The fact that HF genotypes seem to have more similar DFEs than LF genotypes is somewhat counterintuitive. Could this be an artifact of the fact that any two random HF genotypes are more similar to each other than any two randomly sampled LF genotypes? 

      (7) Lines 427. The sentence "The set of these selected variants are assigned their one hamming distance neighbours to construct a new 𝑛-base sequence space" is confusing. I think it is pretty clear how to construct a n-base sequence space, and this sentence adds more confusion than it removes. 

      Thank you for raising this point. To maintain consistency with the original dataset and analysis, we retain Papkou et al.’s nomenclature and refer to these genotypes as “functional” or “non-functional.” 

      We now start the results section of the manuscript with a brief description of how each type of epistasis is defined. Specifically, we now state that two mutations are considered non-epistatic when the observed fitness of the double mutant is statistically indistinguishable (within the error of six replicates) from the additive expectation based on the single mutants. This ensures that the Results section is self-contained, while full details remain in the Methods.

      We also agree that introducing new terminology is unnecessary. In the revised manuscript, we have replaced the term “fluid” epistasis with “higher-order epistasis” throughout, to align with established usage and avoid adding jargon. Finally, we concur that the identified sentence was unnecessary and potentially confusing; it has been removed from the revised manuscript to improve clarity. In fact, we have rewritten the entire manuscript for better flow and readability. 

      Reviewer #2 (Recommendations for the authors): 

      (1) Supplementary Figure S2A and S3 seem to be the same. 

      (3) The classification scheme for reciprocal sign/single sign/other sign epistasis differs from convention and should be made more explicit or renamed. 

      (4) Re the claim that high and low fitness backgrounds have different frequencies of the various types of epistasis: 

      Are the frequency distributions of the different types of epistasis statistically different between high and low fitness backgrounds statistically significant? It seems that they follow similar general patterns, and the sample size is much smaller for high fitness backgrounds so more variance in their distributions is expected. 

      Do bounding of fitness measurements play a role in generating the differences in types of epistasis seen in high vs. low-fitness backgrounds? If many variants are at the lower bound of the fitness assay, then positive epistasis might simply be less detectable for these backgrounds (which seems to be the biggest difference between high/low fitness backgrounds). 

      (5) In Figure 4B, points are not independent, because the mutation effects are calculated for all mutations in all backgrounds, rather than with reference to a single background or fluorescence value. The same mutations are therefore counted many times. 

      (6) It is not clear how the "pivot growth rate" was calculated or what the importance of this metric is. 

      (7) In the introduction, the justification for reanalyzing the Papkou et al dataset in particular is not clear. 

      (8) Epistasis at the nucleotide level is expected because of the genetic code: fitness and function are primarily affected by amino acid changes, and nucleotide mutations will affect amino acids depending on the state at other nucleotide sites in the same codon. For the most part, this is not explicitly taken account of in the paper. I recommend separating apparent epistasis due to the genetic code from that attributable to dependence among codons. 

      Thank you for noting this. Figure S2A shows results for high-fitness peaks only, whereas Figure S3 shows results for all peaks across the landscape. We have now made this distinction explicit in the figure legends and main text of the revised manuscript. 

      In the revised analysis, peaks are defined using the average fitness across six experimental replicates along with the corresponding standard deviation. Each genotype is compared with all single-step neighbors, and it is classified as a peak only if its mean fitness is significantly higher than all neighbors (p < 0.05). This procedure explicitly accounts for measurement error and replaces the arbitrary thresholding used previously. Full details are now described in the Methods.

      To avoid confusion, we now state our definitions explicitly at the start of the analysis. We have now corrected our definition in the text. We define sign epistasis as a one where at least one mutation switches from being beneficial to deleterious. 

      We have clarified our motivation in the Introduction. The Papkou et al. dataset is the most comprehensive experimental map of a complete 9-bp region of folA and provides six independent replicates, making it uniquely suited for testing hypotheses about backgrounddependent epistasis. Importantly, Papkou et al. based their conclusions on a single run, whereas our reanalysis incorporates replicate means and variances, leading to substantive differences—for example, a reduction in reported peaks from 514 to 127. By recalibrating the analysis, we provide a more rigorous account of this landscape and highlight how methodological choices affect conclusions.

      We also agree that some nucleotide-level epistasis reflects the structure of the genetic code (i.e., codon degeneracy and context-dependence of amino acid substitutions). In the revised manuscript, we explicitly separate epistasis attributable to codon structure from epistasis arising among codons. For example, synonymous mutations that alter epistasis within codons are treated separately from those affecting interactions across codons, and this distinction is now clearly indicated in the Results.

      Reviewer #3 (Recommendations for the authors): 

      (1) The analysis of peak density and accessibility in the paragraph starting on line 96 seems a bit out of context. Its connection with the various forms of epistasis treated in the rest of the paper is unclear. 

      (2) As mentioned in the Public Review, the term 'sign epistasis' has been used in a non-standard way. My suggestion would be to use a different term. Even a slightly modified term, such as "strong sign epistasis", should help to avoid any confusion. 

      (3)  mentioned in the public review that it is not clear whether the synonymous mutations that change the type of epistasis also tend to be non-neutral. This issue could be addressed by computing, for example, the fitness effects of all synonymous mutations for backgrounds and mutation pairs where a switch in epistasis occurs, and comparing it with fitness effects where no such switch occurs. 

      (4) Do the authors have any proposal for why synonymous mutations seem to cause more frequent changes in epistasis in low-fitness backgrounds? Related to this, is there any systematic difference between the types of switch caused by synonymous mutations in the low- versus high-fitness backgrounds? 

      (5) It is unclear exactly how the pivot points were determined, especially since the data for many mutations is noisy. The protocol should be provided in the Methods section. 

      (6) Line 303: possible typo, "accurate" --> "inaccurate". 

      (7) The value of Delta used for the "phenotypic DFE" has not been mentioned in the main text (including Methods).

      We agree that the connection needed to be clearer. In the revised manuscript, we (i) relocate and retitle this material as a brief “Landscape overview” preceding the epistasis analyses, (ii) explicitly link multi-peakedness and path accessibility to epistasis (e.g., multi-peak structure implies the presence of sign/reciprocal-sign epistasis; accessibility is shaped by background-dependent effects), and (iii) move derivations to the Supplement. We also recomputed peak density and accessibility using replicate-averaged fitness with replicate SDs, so the overview and downstream epistasis sections now use a single, error-aware landscape (updated in Figs. 1–3, with cross-references in the text).

      We have aligned our terminology and now state definitions upfront. 

      After replacing fixed cutoffs with replicate-based error criteria, switches are more frequent in high-fitness backgrounds (Fig. 3). Mechanistically, near the lower fitness bound, deleterious effects are masked (global nonlinearity), reducing apparent switching. Functional/high-fitness backgrounds allow both beneficial and deleterious outcomes, so background-dependent (higher-order) interactions manifest more readily. Switch types also vary by background fitness: high-fitness backgrounds show more sign/strong-sign switches, whereas low-fitness backgrounds show mostly magnitude reclassifications (Fig. 3C; Supplement Fig. Sx).

      Finally, we corrected a typo by replacing “accurate” with “inaccurate” and now define Δ (equal to 0.05) in the main text (in Results and Figure 8 caption).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewing Editor Comments:

      The study design used reversal learning (i.e. the CS+ becomes the CS- and vice versa), while the title mentions 'fear learning and extinction'. In my opinion, the paper does not provide insight into extinction and the title should be changed.

      Thank you for this important point. We agree that our paradigm focuses more directly on reversal learning than on standard extinction, as the test phases represent extinction in the absence of a US but follow a reversal phase. To better reflect the core of our investigation, we have changed the title.

      Proposed change in manuscript (Title): Original Title: Distinct representational properties of cues and contexts shape fear learning and extinction 

      New Title: Distinct representational properties of cues and contexts shape fear and reversal learning

      Secondly, the design uses 'trace conditioning', whereas the neuroscientific research and synaptic/memory models are rather based on 'delay conditioning'. However, given the limitations of this design, it would still be possible to make the implications of this paper relevant to other areas, such as declarative memory research.

      This is an excellent point, and we thank you for highlighting it. Our design, where a temporal gap exists between the CS offset and US onset, is indeed a form of trace conditioning. We also agree that this feature, particularly given the known role of the hippocampus in trace conditioning, strengthens the link between our findings and the broader field of episodic memory.

      Proposed change in manuscript (Methods, Section "General procedure and stimuli"): We inserted the following text (lines 218-220): "It is important to note that the temporal gap between the CS offset and potential US delivery (see Figure 1A) indicates that our paradigm employs a trace conditioning design. This form of learning is known to be hippocampus-dependent and has been distinguished from delay conditioning.

      Proposed change in manuscript (Discussion): We added the following to the discussion (lines 774-779): "Furthermore, our use of a trace conditioning paradigm, which is known to engage the hippocampus more than delay conditioning does, may have facilitated the detection of item-specific, episodiclike memory traces and their interaction with context. This strengthens the relevance of our findings for understanding the interplay between aversive learning and mechanisms of episodic memory."

      The strength of the evidence at this point would be described as 'solid'. In order to increase the strength (to convincing), analyses including FWE correction would be necessary. I think exploratory (and perhaps some FDR-based) analyses have their valued place in papers, but I agree that these should be reported as such. The issue of testing multiple independent hypotheses also needs to be addressed to increase the strength of evidence (to convincing). Evaluating the design with 4 cues could lead to false positives if, for example, current valence, i.e. (CS++ and CS-+) > (CS+- and CS--), and past valence (CS++ > CS+-) > (CS-+ > CS--) are tested as independent tests within the same data set. Authors need to adjust their alpha threshold.

      We fully agree. As summarized in our general response, we have implemented two major changes to our statistical approach to address these concerns comprehensively. These, are stated above, are the following:

      (1) Correction for Multiple Hypotheses: We previously used FWER-corrected p-values that were obtained through permutation testing. We have now applied a Bonferroni adjustment to the FWER-corrected threshold (previously 0.05) used in our searchlight analyses. For instance, in the acquisition phase, since 2 independent tests (contrasts) were conducted, the significance threshold of each of these searchlight maps was set to p <0.025 (after FWE-correction estimated through non-parametric permutation testing); in reversal, 4 tests were conducted, hence the significance threshold was set to p<0.0125. This change is now clearly described in the Methods section (section “Searchlight approach” (lines 477484). This change had no impact on our searchlight results, given that all clusters that were previously as significant with the previous FWER alpha of 0.05 were also significant at the new, Bonferroni-adjusted thresholds; we also now report the cluster-specific corrected p-values in the cluster tables in Supplementary Material.

      (2) ROI Analyses: Our ROI-based analyses used FDR-based correction for within each item reinstatement/generalized reinstatement pair of each ROI. We now explicitly state in the abstract, methods and results sections that these ROI-based analyses are exploratory and secondary to the primary whole-brain results, given that the correction method used is more liberal, in accordance with the exploratory character of these analyses.

      We are confident that these changes ensure both the robustness and transparency of our reported findings.

      Reviewer #1 (Public Review):

      (1) I had a difficult time unpacking lines 419-420: "item stability represents the similarity of the neural representation of an item to other representations of this same item."

      We thank the reviewer for pointing out this lack of clarity. We have revised the definition to be more intuitive and have ensured it is introduced earlier in the manuscript.

      Proposed change in manuscript (Introduction, lines 144-150): We introduced the concept earlier and more clearly: "Furthermore, we can measure the consistency of a neural pattern for a given item across multiple presentations. This metric, which we refer to as “item stability”, quantifies how consistently a specific stimulus (e.g., the image of a kettle) is represented in the brain across multiple repetitions of the same item. Higher item stability has been linked to successful episodic memory encoding (Xue et al., 2010)."

      Proposed change in manuscript (Methods, Section "Item stability and generalization of cues"): Original text: "Thus, item stability represents the similarity of the neural representation of an item to other representations of this same item (Xue, 2018), or the consistency of neural activity across repetitions (Sommer et al., 2022)."

      Revised text (lines 434-436): "Item stability is defined as the average similarity of neural patterns elicited by multiple presentations of the same item (e.g., the kettle). It therefore measures the consistency of an item's neural representation across repeated encounters."

      (2) The authors use the phrase "representational geometry" several times in the paper without clearly defining what they mean by this.

      We apologize for this omission. We have now added a clear and concise definition of "representational geometry" in the Introduction, citing the foundational work by Kriegeskorte et al. (2008).

      Proposed change in manuscript (Introduction): We inserted the following text (lines 117-125): " By contrast, multivariate pattern analyses (MVPA), such as representational similarity analysis (RSA; Kriegeskorte et al., 2008) has emerged as a powerful tool to investigate the content and structure of these representations (e.g., Hennings et al., 2022). This approach allows us to characterize the “representational geometry” of a set of items – that is, the structure of similarities and dissimilarities between their associated neural activity patterns. This geometry reveals how the brain organizes information, for instance, by clustering items that are conceptually similar while separating those that are distinct."

      (3) The abstract is quite dense and will likely be challenging to decipher for those without a specialized knowledge of both the topic (fear conditioning) and the analytical approach. For instance, the goal of the study is clearly articulated in the first few sentences, but then suddenly jumps to a sentence stating "our data show that contingency changes during reversal induce memory traces with distinct representational geometries characterized by stable activity patterns across repetitions..." this would be challenging for a reader to grok without having a clear understanding of the complex analytical approach used in the paper.

      We agree with your assessment. We have rewritten it to be more accessible to a general scientific audience, by focusing on the conceptual findings rather than methodological jargon.

      Proposed change in manuscript (Abstract): We revised the abstract to be clearer. It now reads: " When we learn that something is dangerous, a fear memory is formed. However, this memory is not fixed and can be updated through new experiences, such as learning that the threat is no longer present. This process of updating, known as extinction or reversal learning, is highly dependent on the context in which it occurs. How the brain represents cues, contexts, and their changing threat value remains a major question. Here, we used functional magnetic resonance imaging and a novel fear learning paradigm to track the neural representations of stimuli across fear acquisition, reversal, and test phases. We found that initial fear learning creates generalized neural representations for all threatening cues in the brain’s fear network. During reversal learning, when threat contingencies switched for some of the cues, two distinct representational strategies were observed. On the one hand, we still identified generalized patterns for currently threatening cues, whereas on the other hand, we observed highly stable representations of individual cues (i.e., item-specific) that changed their valence, particularly in the precuneus and prefrontal cortex. Furthermore, we observed that the brain represents contexts more distinctly during reversal learning. Furthermore, additional exploratory analyses showed that the degree of this context specificity in the prefrontal cortex predicted the subsequent return of fear, providing a potential neural mechanism for fear renewal. Our findings reveal that the brain uses a flexible combination of generalized and specific representations to adapt to a changing world, shedding new light on the mechanisms that support cognitive flexibility and the treatment of anxiety disorders via exposure therapy."

      (4) Minor: I believe it is STM200 not the STM2000.

      Thank you for pointing this out. We have corrected it in the Methods section.

      Proposed change in manuscript (Methods, Page 5, Line 211): Original: STM2000 -> Corrected: STM200

      (5) Line 146: "...could be particularly fruitful as a means to study the influence of fear reversal or extinction on context representations, which have never been analyzed in previous fear and extinction learning studies." I direct the authors to Hennings et al., 2020, Contextual reinstatement promotes extinction generalization in healthy adults but not PTSD, as an example of using MVPA to decipher reinstatement of the extinction context during test.

      Thank for pointing us towards this relevant work. We have revised the sentence to reflect the state of the literature more accurately.

      Proposed change in manuscript (Introduction, Page 3): Original text: "...which have never been analyzed in previous fear and extinction learning studies." 

      Revised text (lines 154-157): "...which, despite some notable exceptions (e.g., Hennings et al., 2020), have been less systematically investigated than cue representations across different learning stages."

      (6) This is a methodological/conceptual point, but it appears from Figure 1 that the shock occurs 2.5 seconds after the CS (and context) goes off the screen. This would seem to be more like a trace conditioning procedure than a standard delay fear conditioning procedure. This could be a trivial point, but there have been numerous studies over the last several decades comparing differences between these two forms of fear acquisition, both behaviorally and neurally, including differences in how trace vs delay conditioning is extinguished.

      Thank you for this pertinent observation; this was also pointed out by the editor. As detailed in our response to the editor, we now explicitly acknowledge that our paradigm uses a trace conditioning design, and have added statements to this effect in the Methods and Discussion sections (lines 218-220, and 774-779).

      (7) In Figure 4, it would help to see the individual data points derived from the model used to test significance between the different conditions (reinstatement between Acq, reversal, and test-new).

      We agree that this would improve the transparency of our results. We have revised Figure 4 to include individual data points, which are now plotted over the bar graphs. 

      Reviewer #2 (Public Review & Recommendations)

      Use a more stringent method of multiple comparison correction: voxel-wise FWE instead of FDR; Holm-Bonferroni across multiple hypothesis tests. If FDR is chosen then the exploratory character of the results should be transparently reported in the abstract.

      Thank you for these critical comments regarding our statistical methods. As detailed in the general response and response to the editor (Comment 3), we have thoroughly revised our approach to ensure its rigor. We now clarify that our whole-brain analyses consistently use FWER-corrected pvalues. Additionally, the significance of these FWER-corrected p-values (obtained through permutation testing), which were previously considered significant against a default threshold of 0.05, are now compared with a Bonferroni-adjusted threshold equal to the number of tested contrasts in each experimental phase. We have modified the revised manuscript accordingly, in the methods section (lines 473-484) and in the supplementary material, where we added the p-values (FWER-corrected) of each cluster, evaluated against the new Bonferroni-adjusted thresholds. It is to be of note that this had no impact on our searchlight results, given that all clusters that were previously reported as significant with the alpha threshold of 0.05 were also significant at the new, corrected thresholds.

      Proposed change in manuscript (Methods): We revised the relevant paragraphs (lines 473-484): "Significance corresponding to the contrast between conditions of the maps of interest was FWER-corrected using nonparametric permutation testing at the cluster level (10,000 permutations) to estimate significant cluster size. Additionally, we adjusted the alpha threshold against which we assessed the significance of the cluster-specific FWERcorrected p-values using Bonferroni correction. In this order, we divided the default alpha corrected threshold of 0.05 by the number of statistical comparisons that were conducted in each experimental phase. For example, for fear acquisition, we compared the CS+>CS- contrast for both item stability and cue generalization, resulting in 2 comparisons and hence a corrected alpha threshold of 0.025. Only clusters that had a FWER-corrected p-value below the Bonferroni-adjusted threshold were deemed significant. All searchlight analyses were restricted within a gray matter mask.”

      The authors report fMRI results from line 96 onwards; all of these refer exclusively to mass-univariate fMRI which could be mentioned more transparently... The authors contrast "activation fMRI" with "RSA" (line 112). Again, I would suggest mentioning "mass-univariate fMRI", and contrasting this with "multivariate" fMRI, of which RSA is just one flavour. For example, there is some work that is clear and replicable, demonstrating human amygdala involvement in fear conditioning using SVM-based analysis of highresolution amygdala signals (one paper is currently cited in the discussion).

      Thank you for this important clarification. We have revised the manuscript to incorporate your suggestions. We now introduce our initial analyses as "mass-univariate" and contrast them with the "multivariate pattern analysis" (MVPA) approach of RSA.

      Proposed change in manuscript (Introduction): We revised the relevant paragraphs (lines 113-125): " While mass-univariate functional magnetic resonance imaging (fMRI) activation studies have been instrumental in identifying the brain regions involved in fear learning and extinction, they are insensitive to the patterns of neural activity that underlie the stimulus-specific representations of threat cues and contexts. Contrastingly, multivariate pattern analyses methods, such as representational similarity analysis (RSA; Kriegeskorte et al., 2008), have emerged as a powerful tool to investigate the content and structure of these representations (e.g., Hennings et al., 2022). This approach allows us to characterize the “representational geometry” of a set of items – i.e., the structure of similarities and dissimilarities between their associated neural activity patterns. This geometry reveals how the brain organizes information, for instance, by clustering items that are conceptually similar while separating those that are distinct.”

      Line 177: unclear how incomplete data was dealt with. If there are 30 subjects and 9 incomplete data sets, then how do they end up with 24 in the final sample?

      We apologize for the unclear wording in our original manuscript. We have clarified the participant exclusion pipeline in the Methods section.

      Proposed change in manuscript (Methods, Section "Participants"): Original text: "The number of participants with usable fMRI data for each phase was as follows: N = 30 for the first phase of day one, N = 29 for the second phase of day one, N = 27 for the first phase of day two, and N = 26 for the second phase of day two. Of the 30 participants who completed the first session, four did not return for the second day and thus had incomplete data across the four experimental phases. An additional two participants were excluded from the analysis due to excessive head movement (>2.5 mm in any direction). This resulted in a final sample of 24 participants (8 males) between 18 and 32 years of age (mean: 24.69 years, standard deviation: 3.6) with complete, low-motion fMRI data for all analyses." 

      Revised text: "The number of participants with usable fMRI data for each phase was as follows: N = 30 for the first phase of day one, N = 29 for the second phase of day one, N = 27 for the first phase of day two, and N = 26 for the second phase of day two. An additional two participants were excluded from the analysis due to excessive head movement (>2.5 mm in any direction). This resulted in a final sample of 24 participants (8 males) between 18 and 32 years of age (mean: 24.69 years, standard deviation: 3.6) with complete, low-motion fMRI data for all analyses."

      Typo in line 201.  

      Thank you for your comment. We have re-examined line 201 (“interval (Figure 1A). A total of eight CSs were presented during each phase and”) and the surrounding text but were unable to identify a clear typographical error in the provided quote. However, in the process of revising the manuscript for clarity, we have rephrased this section.

      it would be good to see all details of the US calibration procedure, and the physical details of the electric shock (e.g. duration, ...).

      Thank you for your comment. We have expanded the Methods section to include these important details.

      Proposed change in manuscript (Methods, Section "General procedure and stimuli"): We inserted the following text (lines 225-230): "Electrical stimulation was delivered via two Ag/AgCl electrodes attached to the distal phalanx of the index and middle fingers of the non-dominant hand. he intensity of the electrical stimulation was calibrated individually for each participant prior to the experiment. Using a stepping procedure, the voltage was gradually increased until the participant rated the sensation as 'unpleasant but not painful'.

      "beta series modelling" is a jargon term used in some neuroimaging software but not others. In essence, the authors use trial-by-trial BOLD response amplitude estimates in their model. Also, I don't think this requires justification - using the raw BOLD signal would seem outdated for at least 15 years.

      Thank you for this helpful suggestion. We have simplified the relevant sentences for improved clarity.

      Proposed change in manuscript (Methods, Section "RSA"): Original text: "...an approach known as beta-series modeling (Rissman et al., 2004; Turner et al., 2012)." 

      Revised text (lines 391-393): "...an approach that allows for the estimation of trial-by-trial BOLD response amplitudes, often referred to as beta-series modeling (Rissman et al., 2004). Specifically, we used a Least Square Separate (LSS) approach..."

      I found the use of "Pavlovian trace" a bit confusing. The authors are coming from memory research where "memory trace" is often used; however, in associative learning the term "trace conditioning" means something else. Perhaps this can be explained upon first occurrence, and "memory trace" instead of "Pavlovian trace" might be more common.

      We are grateful for this comment, as it highlights a critical point of potential confusion, especially given that we now acknowledge our paradigm uses a trace conditioning design. To eliminate this ambiguity, we have replaced all instances of "Pavlovian trace" with "lingering fear memory trace" throughout the manuscript (lines 542 and 599).

      I would suggest removing evaluative statements from the results (repeated use of "interesting").

      Thank you for this valuable suggestion. We have reviewed the Results section and removed subjective evaluative words to maintain a more objective tone. 

      Line 882: one of these references refers to a multivariate BOLD analysis using SVM, not explicitly using temporal information in the signal (although they do show session-by-session information).

      Thank you for this correction. We have re-examined the cited paper (Bach et al., 2011) and removed its inclusion in the text accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewing Editor Comments:

      Recommendations for improvement:

      (1) Address data presentation, editing, and other issues of lack of clarity as pointed out by the reviewers.

      We have now addressed all comments from reviewers that identify editing errors and lack of clarity issues. Regarding data presentation we have made some changes, for example including a combined heatmap to show consistency between row names (Figure 2 - figure supplement 2), but also kept some stylistic features such as the balance between main and supplemental figures that we think fits more naturally with the story of the paper.

      (2) Inclusion of requested and critical details in the methodology section, an important component for broad applicability of a new methodology by other investigators.

      We have added the requested details to the methods section, specifically the RCA protocol.

      (3) More in-depth discussion of the limitations of the methodology and approach to capture important but more complex components of tissues of interest, for example, sexual dimorphism.

      We have now edited the ‘pitfalls of study’ section in the discussion to include further detail of the limitations of the number of genes that can be used to deeply profile transcriptomic types, including sexual dimorphism. Regarding its use in other tissues of interest, we have now included a reference in the discussion (Bintu et al., 2025) where a similar strategy has been used to profile cells in the olfactory epithelium and olfactory bulb. We have also used hamFISH in other brain areas (as commented in our public reviews responses) but as this is unpublished work we will refrain from mentioning it in the main text.

      Reviewer #1 (Recommendations for the authors):

      The manuscript by Edwards et al. would benefit from minor revisions. Here, we outline several points that could / should be addressed:

      (1) General balance of data presentation between main and supplementary figures

      (a) quantifications were often missing from main figures and only presented in the supplements

      Thank you for raising this point. We believe that the balance of panels between the main and supplemental figures matches our story and results section well with quantifications included in the main figures where appropriate.

      (b) more informative figure legends in supplements (e.g.: Supplementary Figure I - Figure 3)

      We have now revised the figure legends and added more description where appropriate.

      (c) missing subpanel in Figure 3; figure legend describes 3H, which is missing in the figure

      We thank the reviewer for pointing this out and have now amended the subpanel.

      stand-alone figure on inhibitory neuron cluster i3 cells

      We agree that this is an important characterisation of i3 cells but decided to place this figure in the supplement as it does not fall within the main storyline (defining transcriptomic characterisation of cell types in a multimodal fashion), but rather acts as accessory information for those specifically interested in these inhibitory cell types.

      statistical tests used (e.g.: Figure 1 C -, Supplementary Figure 3 - Figure 2)/ graphs shown (Supplementary Figure 1 - 1 D)

      The statistical tests used are described in the figure legends.

      t-SNE dimensionality reduction of positional parameters

      Explanations of the t-SNE dimensionality reduction of positional parameters can be found in the materials and methods.

      (d) heatmaps similarly informative and more convincing

      We have included an extra heatmap (Figure 2 - figure supplement 2) in response to Reviewer 3’s comment (see below) in order to more easily follow genes across all the different clusters. We hope this helps to make the heatmaps more convincing and informative.

      code availability

      Code availability is described in the methods section of the manuscript.

      page 6, 3rd paragraph wrong description of PMCo abbreviation

      We thank the reviewer for identifying the mistake and we have now amended it.

      Reviewer #2 (Recommendations for the authors):

      The pre-existing scRNA-seq dataset on which the manuscript is based is an older Drop-seq dataset for which minimal QC information is provided. The authors should include QC information (genes/cells and UMIs/cells) in the Methods. Moreover, the Seurat clustering of these cells and depiction of marker genes in feature plots are not shown.

      It is therefore difficult to determine how the authors selected their 31 genes for their hamFISH panel, or how selective they are to the original Drop-seq clusters.

      The QC information of this dataset can be found in the original publication (Chen et al., 2019) with our clustering methods described in the materials and methods section. We have not included individual gene names in our heatmap plots for presentation purposes (there are over 200 rows), but the data and cluster descriptions can be found in supplemental tables.

      Reviewer #3 (Recommendations for the authors):

      (1) The imaging modality is not entirely clear in the methods. The microscopy technique is referenced to prior work and involves taking z-stacks, but analysis appears to be done on maximum z-projections, which seems like it would introduce the risk of false attribution of gene expression to cells that are overlapping in "z".

      Thank you for pointing out the technical limitation of the microscopy. For imaging we used epifluorescence microscopy with 14x 500 nm z-steps to collect our raw data and generate a maximum intensity projection for further analysis. Because of the thin sections (10 um) used for the imaging, the overlap between cells in z is expected to be minimal. However, we cannot completely rule out misattribution raised in the comment. The method section contains this information.

      (2) Supplemental Figure 1 - Figure Supplement 2B: RCA looks significantly different when compared to v2 smFISH from the representative image, although it is written as comparable. Additionally, there is no information about RCA mentioned in the Materials and Methods section. Supplemental Figure 1 - Figure Supplement 2B: The figure label for RCA is missing.

      By comparable we are referring to the intensity rather than pattern as mentioned in the results section. We did not analyze the number of spots. It is true that the pattern of RCA signal is much sparser due to its inherent insensitivity compared with hamFISH. We thank the reviewer for identifying the lack of a methodological RCA description and have amended the manuscript to include this. We have also now amended the missing RCA label in the figure.

      (3) Figure 2C and associated supplement: The rows (each gene) are not consistent across the subpanels (i.e. they do not line up left-to-right), this makes it difficult for the reader to follow the patterns that distinguish the cell types in each subset.

      We have done this as we believe it makes for an easier interpretation of inhibitory vs excitatory clusters for the reader. However, we agree with the reviewer that one may wish to look at the dataset as a whole with a consistent gene order, and we have now provided this in the corresponding supplemental figure.  

      (4) "Consistent with previous work, most inhibitory classes are localized in the dorsal and ventral subdivisions of the MeA, whereas excitatory neurons occupy primarily the ventral MeA (Figure 2D, Figure 2 - Figure Supplement 2C, Figure 1D)". - The reference to Figure 1D seems to be an error.

      We thank the reviewer for identifying the mistake, and we have now amended it.

      (5) Supplemental Figure 2 - Figure Supplement 1, "published by Chen et al." - should have a proper reference number to be compatible with the rest of the manuscript. Also, the lack of gene info makes it difficult to understand Panel A. Finally, the text on Panel B refers to "hamMERFISH" which seems an error.

      We thank the reviewer for identifying the mistake on Panel B, it has now been amended. We have also changed the reference format. Regarding the lack of gene information in panel A, it is difficult to present all row names due to the large number of rows (>200), but this information can be found in supplemental table 2.

      (6) Supplemental Figure 2 - Figure Supplement 1: there are thin dividing lines drawn on each section, but these are not described or defined, making it difficult to understand what is being delineated.

      We thank the reviewer for identifying this omission and have now edited to figure legend to contain a description.

      (7) Page 4, "...we found 26 clusters in cells that are positive for Slc32a1 (inhibitory) or Slc17a6 (encoding Vglut2 and therefore excitatory) positive (Figure 2 - figure supplement 1A, Table S2)."

      This seems to be an error as Figure 2 - figure supplement 1A does not show this.

      We double-checked that this description describes the panel accurately.

      (8) "The clustering revealed that inhibitory and excitatory classes generally have different spatial properties (Figure 1E, left), although the salt-and-pepper, sparse nature of e10 (Nts+) cells is more similar to inhibitory cells than other excitatory classes".

      The references to Figure 1E's should be to Figure 2E.

      We thank the reviewer for identifying the mistake, and we have now amended it.

      (9) "Comparison of the proportion of all cells that are cluster X vs projection neurons labelled by CTB that are cluster X". Please explain cluster X in this context.

      We have now rephrased this sentence in the figure legend for clarity.

      (10) Figure 3 - figure supplement 3: There appears to be quite a bit of heterogeneity in the patterns of activity across clusters even within behavioral contexts (e.g. the bottom 2 animals paired with females). It might be worth commenting on (or quantifying) whether there were any evident differences in the social behaviors observed (e.g. mating or not?) in individuals demonstrating these patterns.

      We thank the reviewer for this observation. We unfortunately did not quantify the behaviors, but we agree that more work is needed to link the pattern of c-fos activity with incrementally measured behavioral variables. At least, we did not include animals that did not display the anticipated social behaviours (as described in the materials and methods) in the in situ transcriptomic profiling work.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the diNerential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs). 

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these eNects were specific to stimulus-based actions, as valuebased choices were left intact in all manipulations. 

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment. 

      Reviewer 2 (Public Review):

      Summary: 

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cueguided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no eNects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths: 

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific eNects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry. 

      We thank the Reviewer for their positive assessment. 

      Weaknesses: 

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPNs engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) oMers another promising avenue, given its established utility in reporting striatal SPNs plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014). The manuscript was modified to report these future avenues of research (page 12). 

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (page 13), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer 3 (Public Review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these eNects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment. 

      Weaknesses:

      I think the interpretation of some of the eNects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of valuebased choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal eMect as evidence for a direct role of NAc-S D1SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012). Furthermore, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaMected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012; Parkes et al., 2015). The marginal impairment may therefore reflect inadvertent silencing of a small number of  NAc-C D1-SPNs rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

      Reviewer 1 (Recommendations for the Author):

      My main concerns and comments are listed below.

      (1) Could the authors provide the "raw" data of the PIT tests, such as PreSame vs Same vs PreDiNerent vs DiNerent? Could the authors clarify how the Net responding was calculated? Was it Same minus PreSame & DiNerent minus PreDiNerent, or was the average of PreSame and PreDiNerent used in this calculation?

      The raw data for PIT testing across all experiments are now included in the Supplemental Figures (Supplemental Figures S1E, S2E, S3E, and S4E). Baseline responding was quantified as the average number of lever presses per minute for both actions during the two-minute period (i.e., average of PreSame and PreDiMerent) preceding each stimulus presentation. This methodology has been clarified in the revised manuscript (page 7).

      (2) While both sexes are utilized in the current study, no statistical analysis is provided. Can the authors please comment on this point and provide these analyses (for both training and tests)?

      As noted in the original manuscript, the final sample sizes for female and male rats were insuMicient to provide adequate statistical power for sex-based analyses (page 15). To address this limitation, we have now cited a previous study from our laboratory (Burton et al., 2014) that conducted such analyses with suMicient power in identical behavioural tasks. That study identified only marginal sex diMerences in performance, with female rats exhibiting slightly higher magazine entry rates during Pavlovian conditioning. Importantly, no diMerences were observed in outcome-specific PIT or value-based choice performance between sexes.

      (3) Regarding Figure 1 - Anterograde tracing in D1-Cre and A2a-Cre rats (from line 976), I have one major and one minor question:

      (3.1) I do not understand the rationale of showing anterograde tracing from the Dorsal Striatum (DS) as this region is not studied in the current work. Moreover, sagittal micrographs of D1-Cre and A2a-Cre would be relevant here. Could the authors please provide these micrographs and explain the rationale for doing tracing in DS?

      We included dorsal striatum (DS) tracing data as a reference because the projection patterns of D1 and D2 SPNs in this region are well-established and extensively characterized, in contrast to the more limited literature on these cell types in the NAc-S. Regarding the comment about sagittal micrographs, we are uncertain of the specific concern as these images are presented in Figure 1B.

      If the reviewer is requesting sagittal micrographs for NAc-S anterograde tracing, we did not employ this approach because: (1) the NAc-S and ventral pallidum are anatomically adjacent regions and (2) the medial-lateral coordinates of the ventral pallidum and lateral hypothalamus do not align optimally with those of the NAc-S, limiting the utility of sagittal analysis for these projections.

      (3.2) There is no description about how the quantifications were done: manually? Automatically? What script or plugin was used? If automated, what were the thresholding conditions? How many brain sections along the anteroposterior axis? What was the density of these subpopulations? Can the authors include a methodological section to address this point?

      We apologize for the omission of quantification methods used to assess viral transduction specificity. This methodological description has now been added to the revised manuscript (page 22). Briefly, we employed a manual procedure in two sections per rat, and cell counts were completed in a defined region of interest located around the viral infusion site.

      (4) Lex A & Hauber (2008) Dopamine D1 and D2 receptors in the nucleus accumbens core and shell mediate Pavlovian-instrumental transfer. Learning & memory 15:483- 491, should be cited and discussed. It also seems that the contribution of the main dopaminergic source of the brain, the ventral tegmental area, is not cited, while it has been investigated in PIT in at least 3 studies regarding sPIT only, notably the VP-VTA pathway (Leung & Balleine 2015, accurately cited already).

      We did not include the Lex & Hauber (2008) study because its experimental design (single lever and single outcome) prevents diMerentiation between the eMects of Pavlovian stimuli on action performance (general PIT) versus action selection (outcome-specific PIT, as examined in the present study). Drawing connections between their findings and our results would require speculative interpretations regarding whether their observed eMects reflect general or outcome-specific PIT mechanisms, which could distract from the core findings reported in the article.

      Several studies examining the role of the VTA in outcome-specific PIT were referenced in the manuscript's introduction. Following the reviewer's recommendation, these references have also been incorporated into the discussion section (page 13). 

      (5) While not directly the focus of this study, it would be interesting to highlight the accumbens dissociation between General vs Specific PIT, and how the dopaminergic system (diNerentially?) influences both forms of PIT.

      We agree with the reviewer that the double dissociation between nucleus accumbens core/shell function and general/specific PIT is an interesting topic. However, the present manuscript does not examine this dissociation, the nucleus accumbens core, or general PIT. Similarly, our study does not directly investigate the dopaminergic system per se. We believe that discussing these topics would distract from our core findings and substantially increase manuscript length without contributing novel data directly relevant to these areas. 

      (6) While authors indicate that conditioned response to auditory stimuli (magazine visits) are persevered in all groups, suggesting intact sensitivity to the general motivational properties of reward-predictive stimuli (lines 344, 360), authors can't conclude about the specificity of this behavior i.e. does the subject use a mental representation of O1 when experiencing S1, leading to a magazine visits to retrieve O1 (and same for S2-O2), or not? Two food ports would be needed to address this question; also, authors should comment on the fact that competition between instrumental & pavlovian responses does not explain the deficits observed.

      We agree with the Reviewer that magazine entry data cannot be used to draw conclusions about specificity, and we do not make such claims in our manuscript. We are therefore unclear about the specific concern being raised. Following the Reviewer’s recommendation, we have commented on the fact that response competition could not explain the results obtained (page 11, see also supplemental discussion). 

      The minor comments are listed below.

      (7) A high number of rats were excluded (> 32 total), and the number of rats excluded for NAc-S D1-SPNs-VP is not indicated.

      We apologize for omitting the number of rats excluded from the experiment examining NAc-S D1-SPN projections to the ventral pallidum. This information has been added to the revised manuscript (page 22).

      (7.1) Can authors please comment on the elevated number of exclusions?

      A total of 133 rats were used across the reported experiments, with 40 rats excluded based on post-mortem analyses. This represents an attrition rate of approximately 30%, which we consider reasonable given that most animals received two separate viral infusions and two separate fiber-optic cannula implantations, and that the inclusion of both female and male rats contributed to some variability in coordinates and so targeting. 

      (7.2) Can authors please present the performance of these animals during the tasks (OFF conditions, and for control ones, both ON & OFF conditions)?

      Rats were excluded after assessing the spread of viral infusions, placement of fibre-optic cannulas and potential damage due to the surgical procedures (page 21). The requested data are presented below and plotted in the same manner as in Figures 3-6. The pattern of performance in excluded animals was highly variable. 

      Author response image 1.

       

      (8) For tracing, only males were used, and for electrophysiology, only females were used.

      (8.1) Can authors please comment on not using both sexes in these experiments? 

      We agree that equal allocation of female and male rats in the experiments presented in Figures 1-2 would have been preferable. Animal availability was the sole factor determining these allocations. Importantly, both female and male D1-Cre and A2A-Cre rats were used for the NAc-S tracing studies, and no sex diMerences were observed in the projection patterns. The article describing the two transgenic lines of rats did not report any sex diMerence (Pettibone et al., 2019). 

      (8.2) Is there evidence in the literature that the electrophysiological properties of female versus male SPNs could diNer?

      The literature indicates that there is no sex diMerence in the electrophysiological properties of NAc-S SPNs (Cao et al., 2018; Willett et al., 2016).  

      (8.3) It seems like there is a discrepancy between the number of animals used as presented in the Figure 2 legend versus what is described in the main text. In the Figure legend, I understand that 5 animals were used for D1-Cre/DIO-eNpHR3.0 validation, and 7 animals for A2a-Cre/DIO-eNpHR3.0; however, the main text indicates the use of a total of 8 animals instead of the 12 presented in the Figure legend. Can authors please address this mismatch or clarify?

      The number of rats reported in the main text and Figure 2 legend was correct. However, recordings sometimes involved multiple cells from the same animal, and this aspect of the data was incorrectly reported and generated confusion. We have clarified the numbers in both the main text and Figure 2 legend to distinguish between animal counts and cell counts. 

      (9) Overall, in the study, have the authors checked for outliers?

      Performance across all training and testing stages was inspected to identify potential behavioral outliers in each experiment. Abnormal performance during a single session within a multi-session stage was not considered suMicient grounds for outlier designation. Based on these criteria, no subjects remaining after post-mortem analyses exhibited performance patterns warranting exclusion through statistical outlier analysis. However, we have conducted the specific analyses requested by the Reviewer, as described below. 

      (9.1) In Figure 3, it seems that one female in the eYFP group, in the OFF situation, for the diNerent condition, has a higher level of responding than the others. Can authors please confirm or refute this visual observation with the appropriate statistical analysis?

      Statistical analysis (z-score) confirmed the reviewer's observation regarding responding of the diMerent action in the OFF condition for this subject (|z| = 2.58). Similar extreme responding was observed in the ON condition (|z| = 2.03). Analyzing responding on the diMerent action in isolation is not informative in the context of outcome-specific PIT. Additional analyses revealed |z| < 2 when examining the magnitude of choice discrimination in outcome-specific PIT (i.e., net same versus net diMerent responding) in both ON and OFF conditions. Furthermore, this subject showed |z| < 2 across all other experimental stages. Based on these analyses, we conclude that the subject should be kept in all analyses. 

      (9.2) In Figure 5, it seems that one male, in the ON situation, in the diNerent condition, has a quite higher level of responding - is this subject an outlier? If so, how does it aNect the statistical analysis after being removed? And who is this subject in the OFF condition?

      The reviewer has identified two diMerent male rats infused with the eNpHR3.0 virus and has asked closer examination of their performance.

      The first rat showed outlier-level responding on the diMerent action in the ON condition (|z| = 2.89) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.55 when examining choice discrimination magnitude in outcome-specific PIT during the ON condition but not during the OFF condition (|z| = 0.62). This subject exhibited |z| < 2 across all other experimental stages.

      The second rat showed outlier-level responding on the same action in the OFF condition (|z| = 2.02) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.12 when examining choice discrimination magnitude in outcome-specific PIT during the OFF condition but not during the ON condition (|z| = 0.67). This subject also exhibited |z| < 2 across all other experimental stages.

      We excluded these two subjects and conducted the same analyses as described in the original manuscript. Baseline responding did not diMer between groups (p = 0.14), allowing to look at the net eMect of the stimuli. Overall lever presses were greater in the eYFP rats (Group: F(1,16) = 6.08, p < 0.05; η<sup>2</sup> = 0.28) and were reduced by LED activation (LED: F(1,16) = 9.52, p < 0.01; η<sup>2</sup> = 0.44) and this reduction depended on the group considered (Group x LED: F(1,16) = 12.125, p < 0.001; η<sup>2</sup> = 0.43). Lever press rates were higher on the action earning the same outcome as the stimuli compared to the action earning the diMerent outcome (Lever: F(1,16)= 49.32; η<sup>2</sup> = 0.76; p < 0.001), regardless of group (Group x Lever: p = 0.14). There was a Lever by LED light condition interaction (Lever x LED: F(1,16)= 5.25; η<sup>2</sup> = 0.24; p < 0.05) but no an interaction between group, LED light condition, and Lever during the presentation of the predictive stimuli (p = 0.10). Given the significant Group x LED and Lever x LED interactions, additional analyses were conducted to determine the source of these interactions. In eYFP rats, LED activation had no eMect (LED: p = 0.70) and lever presses were greater on the same action (Lever: (F(1,9) = 23.94, p < 0.001; η<sup>2</sup> = 0.79) regardless of LED condition (LED x Lever: p = 0.72). By contrast, in eNpHR3.0 rats, lever presses were reduced by LED activation (LED: F(1,9) = 23.97, p < 0.001; η<sup>2</sup> = 0.73), were greater on the same action (Lever: F(1,9) = 16.920, p < 0.001; η<sup>2</sup> = 0.65) and the two factors interacted (LED x Lever: F(1,9) = 9.12, p < 0.01; η<sup>2</sup> = 0.50). These rats demonstrated outcome-specific PIT in the OFF condition (F(1,9) = 27.26, p < 0.001; η<sup>2</sup> = 0.75) but not in the ON condition (p = 0.08).

      Overall, excluding these two rats altered the statistical analyses, but both the original and revised analyses yielded the same outcome: silencing the NAc-S D1-SPN to VP pathway disrupted PIT. More importantly, we do not believe there are suMicient grounds to exclude the two rats identified by the reviewer. These animals did not display outlier-level responding across training stages or during the choice test. Their potential classification as outliers would be based on responding during only one LED condition and not the other, with notably opposite patterns between the two rats despite belonging to the same experimental group. 

      (10) I think it would be appreciable if in the cartoons from Figure 5.A and 6.A, the SPNs neurons were color-coded as in the results (test plots) and the supplementary figures (histological color-coding), such as D1- in blue & D2-SPNs in red.

      Our current color-coding system uses blue for D1-SPNs transduced with eNpHR3.0 and red for D2-SPNs transduced with eNpHR3.0. The D1-SPNs and D2-SPNs shown in Figures 5A and 6A represent cells transduced with either eYFP (control) or eNpHR3.0 virus and therefore cannot be assigned the blue or red color, which is reserved for eNpHR3.0transduced cells specifically. The micrographs in the Supplemental Figures maintain consistency with the color-coding established in the main figures.

      (11) As there are (relatively small) variations in the control performance in term of Net responding (from ~3 to ~7 responses per min), I wonder what would be the result of pooling eYFP groups from the two first experiments (Figures 3 & 4) and from the two last ones (Figures 5 & 6) - would the same statically results stand or vary (as eYFP vs D1-Cre vs A2a-Cre rats)? In particular for Figures 3 & 4, with and without the potential outlier, if it's indeed an outlier.

      We considered the Reviewer’s recommendation but do not believe the requested analysis is appropriate. The Reviewer is requesting the pooling of data from subjects of distinct transgenic strains (D1-Cre and A2A-Cre rats) that underwent surgical and behavioral procedures at diMerent time points, sometimes months apart. Each experiment was designed with necessary controls to enable adequate statistical analyses for testing our specific hypotheses. 

      (12) Presence of cameras in operant cages is mentioned in methods, but no data is presented regarding recordings, though authors mention that they allow for real-time observations of behavior. I suggest removing "to record" or adding a statement about the fact that no videos were recorded or used in the present study.

      We have removed “to record” from the manuscript (page 18). 

      (13) In all supplementary Figures, "F" is wrongly indicated as "E".

      We thank the Reviewer for reporting these errors, which have been corrected. 

      (14) While the authors acknowledge that the eNicacy of optogenetic inhibition of terminals is questionable, I think that more details are required to address this point in the discussion (existing literature?). Maybe, the combination of an anterograde tracer from SPNs to VP, to label VP neurons (to facilitate patching these neurons), and the Credependent inhibitory opsin in the NAc Shell, with optogenetic illumination at the level of the VP, along with electrophysiological recordings of VP neurons, could help address this question but may, reasonably, seem challenging technically.

      Our manuscript does not state that optogenetic inhibition of terminals is questionable. It acknowledges that we do not provide any evidence about the eMicacy of the approach. Regardless, we have provided additional details and suggestions to address this lack of evidence (page 13). 

      (15) A nice addition could be an illustration of the proposed model (from line 374), but it may be unnecessary.

      We have carefully considered the reviewer's recommendation. The proposed model is detailed in three published articles, including one that is freely accessible, which we have cited when presenting the model in our manuscript (page 14). This reference should provide interested readers with easy access to a comprehensive illustration of the model.

      Reviewer 2 (Recommendations for the Author):

      As noted in my public comments, this is a truly excellent and compelling study. I have only a few minor comments.

      (1) I could not find the coordinates/parameters for the dorsal striatal AAV injections for that component of the tract tracing experiment.

      We apologize for this omission, which has now been corrected (page 16). 

      (2) Please add the final group sizes to the figure captions.

      We followed the Reviewer’s recommendation and added group sizes in the main figure captions. 

      (3) The discussion of group exclusions (p 21 line 637) seems to accidentally omit (n = X) the number of NAc-S D1-SPNs-VP mice excluded.

      We apologize for this omission, which has now been corrected (page 22). 

      (4) There were some labeling issues in the supplementary figures (perhaps elsewhere, too). Specifically, panel E was listed twice (once for F) in captions.

      We apologize for this error, which has now been corrected.  

      (5) Inspection of the magazine entry data from PIT tests suggests that the optogenetic manipulations may have had some eNects on this behavior and would encourage the authors to probe further. There was a significant group diNerence for D1-SPN inhibition and a marginal group eNect for D2-SPNs. The fact that these eNects were in opposite directions is intriguing, although not easily interpreted based on the canonical D1/D2 model. Of course, the eNects are not specific to the light-on trials, but this could be due to carryover into light-oN trials. An analysis of trial-order eNects seems crucial for interpreting these eNects. One might also consider normalizing for pre-test baseline performance. Response rates during Pavlovian conditioning seem to suggest that D2eNpHR mice showed slightly higher conditioned responding during training, which contrasts with their low entry rates at test. I don't see any of this as problematic -- but more should be done to interpret these findings.

      We thank the reviewer for raising this interesting point regarding magazine entry rates. Since these data are presented in the Supplemental Figures, we have added a section in the Supplemental Material file that elaborates on these findings. This section does not address trial order eMects, as trial order was fully counterbalanced in our experiments and the relevant statistical analyses would lack adequate power. Baseline normalization was not conducted because the reviewer's suggestion was based on their assumption that eNpHR3.0 rats in the D2-SPNs experiment showed slightly higher magazine entries during Pavlovian training. However, this was not the case. In fact, like the eNpHR3.0 rats in the D1-SPNs experiment, they tended to display lower magazine entries during training. The added section therefore focuses on the potential role of response competition during outcome-specific PIT tests. Although we concluded that response competition cannot explain our findings, we believe it may complicate interpretation of magazine entry behavior. Thus, we recommend that future studies examine the role of NAc-S SPNs using purely Pavlovian tasks. It is worth nothing that we have recently completed experiments (unpublished) examining NAc-S D1- and D2-SPN silencing during stimulus presentation in a Pavlovian task identical to the one used here. Silencing of either SPN population had no eMect on magazine entry behavior.

      Reviewer 3 (Recommendations for the Author):

      Broad comments:

      Throughout the manuscript, the authors draw parallels between the eNect established via pharmacological manipulations and those shown here with optogenetic manipulation. I understand using the pharmacological data to launch this investigation, but these two procedures address very diNerent physiological questions. In the case of a pharmacological manipulation, the targets are receptors, wherever they are expressed, and in the case of D2 receptors, this means altering function in both pre-synaptically expressed autoreceptors and post-synaptically expressed D2 MSN receptors. In the case of an optogenetic approach, the target is a specific cell population with a high degree of temporal control. So I would just caution against comparing results from these types of studies too closely.

      Related to this point is the consideration of the physiological relevance of the manipulation. Under normal conditions, dopamine acts at D1-like receptors to increase the probability of cell firing via Ga signaling. In contrast, dopamine binding of D2-like receptors decreases the cell's firing probability (signaling via Gi/o). Thus, shunting D1MSN activation provides a clear impression of the role of these cells and, putatively, the role of dopamine acting on these cells. However, inhibiting D2-MSNs more closely mimics these cells' response to dopamine (though optogenetic manipulations are likely far more impactful than Gi signaling). All this is to say that when we consider the results presented here in Experiment 2, it might suggest that during PIT testing, normal performance may require a halting of DA release onto D2-MSNs. This is highly speculative, of course, just a thought worth considering.

      We agree with the comments made by the Reviewer, and the original manuscript included statements acknowledging that pharmacological approaches are limited in the capacity to inform about the function of NAc-S SPNs (pages 4 and 9). As noted by the Reviewer, these limitations are especially salient when considering NAc-S D2-SPNs. Based on the Reviewer’s comment, we have modified our discussion to further underscore these limitations (page 12). Finally, we agree with the suggestion that PIT may require a halting of DA release onto D2-SPNs. This is consistent with the model presented, whereby D2-SPNs function is required to trigger enkephalin release (page 13).     

      Section-Specific Comments and Questions:

      Results:

      Anterograde tracing and ex vivo cell recordings in D1 Cre and A2a Cre rats: Why are there no statistics reported for the e-phys data in this section? Was this merely a qualitative demonstration? I realize that the A2a-Cre condition only shows 3 recordings, so I appreciate the limitations in analyzing the data presented.

      The reviewer is correct that we initially intended to provide a qualitative demonstration. However, we have now included statistical analyses for the ex vivo recordings. It is important to note that there were at least 5 recordings per condition, though overlapping data points may give the impression of fewer recordings in certain conditions. We have provided the exact number of recordings in both the main text (page 5) and figure legend. 

      What does trial by trial analysis look like, because in addition to the eNects of extinction, do you know if the responsiveness of the opsin to light stimulation is altered after repeated exposures, or whether the cells themselves become compromised in any way with repeated light-inhibition, particularly given the relatively long 2m duration of the trial.

      The Reviewer raises an interesting point, and we provide complete trial-by-trial data for each experiment below. As identified by the Reviewer, there is some evidence for extinction, although it remained modest. Importantly, the data suggest that light stimulation did not aMect the physiology of the targeted cells. In eNpHR3.0 rats, performance across OFF trials remained stable (both for Same and DiMerent) even though they were preceded by ON trials, indicating no carryover eMects from optical stimulation.

      Author response image 2.

       

      The statistics for the choice test are not reported for eNpHR-D1-Cre rats, but do show a weakening of the instrumental devaluation eNect "Group x Lever x LED: F1,18 = 10.04, p < 0.01, = 0.36". The post hoc comparisons showed that all groups showed devaluation, but it is evident that there is a weakening of this eNect when the LED was on (η<sup>2</sup> = 0.41) vs oN (η<sup>2</sup> = 0.78), so I think the authors should soften the claim that NAcS-D1s are not involved in value-based decision-making. (Also, there is a typo in the legend in Figure S1, where the caption for panel "F" is listed as "E".) I also think that this could be potentially interesting in light of the fact that with circuit manipulation, this same weakening of the instrumental devaluation eNect was not observed. To me, this suggests that D1-NAcS that project to a diNerent region (not VP) contribute to value-based decision making.

      This comment overlaps with one made in the Public Review, for which we have already provided a response. Given its importance, we have added a section addressing this point in the supplemental discussion of the Supplementary Material file, which aligns with the location of the relevant data. The caption labelling error has been corrected.

      Materials and Methods:

      Subjects:

      Were these heterozygous or homozygous rats? If hetero, what rats were used for crossbreeding (sex, strain, and vendor)? Was genotyping done by the lab or outsourced to commercial services? If genotyping was done within the lab, please provide a brief description of the protocol used. How was food restriction established and maintained (i.e., how many days to bring weights down, and was maintenance achieved by rationing or by limiting ad lib access to food for some period in the day)?

      The information requested by the Reviewer have been added to the subjects section (pages 15-16).  

      Were rats pair/group housed after implantation of optic fibers?

      We have clarified that rats were group houses throughout (see subjects section; pages 15-16). 

      Behavioral Procedures:

      How long did each 0.2ml sucrose infusion take? For pellets, for each US delivery, was it a single pellet or two in quick succession?

      We have modified the method section to indicate that the sucrose was delivered across 2 seconds and that a single pellet was provided (page 17). 

      The CS to ITI duration ratio is quite low. Is there a reason such a short ratio was used in training?

      These parameters are those used in all our previous experiments on outcome-specific PIT. There is no specific reason for using such a ratio, except that it shortens the length of the training session. 

      Relative to the end of training, when were the optical implantation surgeries conducted, and how much recovery time was given before initiating reminder training and testing?

      Fibre-optic implantation was conducted 3-4 days after training and another 3-4 days were given for recovery. This has been clarified in the Materials and methods section (pages 15-16).

      I think a diagram or schematic showing the timeline for surgeries, training, and testing would be helpful to the audience.

      We opted for a text-based experimental timeline rather than a diagram due to slight temporal variations across experiments (page 15).

      On trials, when the LED was on, was light delivered continuously or pulsed? Do these opto-receptors 'bleach' within such a long window?

      We apologize for the lack of clarity; the light was delivered continuously. We have modified the manuscript (pages 6 and 19) and figure legend accordingly. The postmortem analysis did not provide evidence for photobleaching (Supplemental Figures) and as noted above, the behavioural results do not indicate any negative physiological impact on cell function.  

      Immunofluorescence: The blocking solution used during IHC is described as "NHS"; is this normal horse serum?

      The Reviewer is correct; NHS stands for normal horse serum. This has been added (page 21). 

      Microscopy and imaging:

      For the description of rats excluded due to placement or viral spread problems, an n=X is listed for the NAc S D1 SPNs --> VP silencing group. Is this a typo, or was that meant to read as n=0? Also, was there a major sex diNerence in the attrition rate? If so, I think reporting the sex of the lost subjects might be beneficial to the scientific community, as it might reflect a need for better guidance on sex-specific coordinates for targeting small nuclei.

      We apologize for the error regarding the number of excluded animals. This error has been corrected (page 23). There were no major sex diMerences in the attrition rate. The manuscript has been updated to provide information about the sex of excluded animals (page 23). 

      References

      Cao, J., Willett, J. A., Dorris, D. M., & Meitzen, J. (2018). Sex DiMerences in Medium Spiny Neuron Excitability and Glutamatergic Synaptic Input: Heterogeneity Across Striatal Regions and Evidence for Estradiol-Dependent Sexual DiMerentiation. Front Endocrinol (Lausanne), 9, 173. https://doi.org/10.3389/fendo.2018.00173

      Corbit, L. H., Muir, J. L., Balleine, B. W., & Balleine, B. W. (2001). The role of the nucleus accumbens in instrumental conditioning: Evidence of a functional dissociation between accumbens core and shell. J Neurosci, 21(9), 3251-3260. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=11312 310&retmode=ref&cmd=prlinks

      Corbit, L. H., & Balleine, B. W. (2011). The general and outcome-specific forms of Pavlovian-instrumental transfer are diMerentially mediated by the nucleus accumbens core and shell. J Neurosci, 31(33), 11786-11794. https://doi.org/10.1523/JNEUROSCI.2711-11.2011

      Laurent, V., Bertran-Gonzalez, J., Chieng, B. C., & Balleine, B. W. (2014). δ-Opioid and Dopaminergic Processes in Accumbens Shell Modulate the Cholinergic Control of Predictive Learning and Choice. J Neurosci, 34(4), 1358-1369. https://doi.org/10.1523/JNEUROSCI.4592-13.2014

      Laurent, V., Leung, B., Maidment, N., & Balleine, B. W. (2012). μ- and δ-opioid-related processes in the accumbens core and shell diMerentially mediate the influence of reward-guided and stimulus-guided decisions on choice. J Neurosci, 32(5), 1875-1883. https://doi.org/10.1523/JNEUROSCI.4688-11.2012

      Matamales, M., McGovern, A. E., Mi, J. D., Mazzone, S. B., Balleine, B. W., & BertranGonzalez, J. (2020). Local D2- to D1-neuron transmodulation updates goal-directed learning in the striatum. Science, 367(6477), 549-555. https://doi.org/10.1126/science.aaz5751

      Parkes, S. L., Bradfield, L. A., & Balleine, B. W. (2015). Interaction of insular cortex and ventral striatum mediates the eMect of incentive memory on choice between goaldirected actions. J Neurosci, 35(16), 6464-6471. https://doi.org/10.1523/JNEUROSCI.4153-14.2015

      Pettibone, J. R., Yu, J. Y., Derman, R. C., Faust, T. W., Hughes, E. D., Filipiak, W. E., Saunders, T. L., Ferrario, C. R., & Berke, J. D. (2019). Knock-In Rat Lines with Cre Recombinase at the Dopamine D1 and Adenosine 2a Receptor Loci. eNeuro, 6(5). https://doi.org/10.1523/ENEURO.0163-19.2019

      Willett, J. A., Will, T., Hauser, C. A., Dorris, D. M., Cao, J., & Meitzen, J. (2016). No Evidence for Sex DiMerences in the Electrophysiological Properties and Excitatory Synaptic Input onto Nucleus Accumbens Shell Medium Spiny Neurons. eNeuro, 3(1), ENEURO.0147-15.2016. https://doi.org/10.1523/ENEURO.0147-15.2016

    1. Author response:

      Reviewer #1 (Public review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the differential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs).

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2-SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these effects were specific to stimulus-based actions, as value-based choices were left intact in all manipulations.

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cue-guided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no effects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths:

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific effects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPN engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) offers another promising avenue, given its established utility in reporting striatal SPN plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014).

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (see discussion), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer #3 (Public review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2-MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these effects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      I think the interpretation of some of the effects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of value-based choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal effect as evidence for a direct role of NAc-S D1-SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). Notably, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaffected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). The marginal impairment may therefore reflect inadvertent silencing of a small NAc-C D1-SPN population rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

    1. Reviewer #2 (Public review):

      Summary:

      This work by den Bakker and Kloosterman contributes to the vast body of research exploring the dynamics governing the communication between the hippocampus (HPC) and the medial prefrontal cortex (mPFC) during spatial learning and navigation. Previous research showed that population activity of mPFC neurons is replayed during HPC sharp-wave ripple events (SWRs), which may therefore correspond to privileged windows for the transfer of learned navigation information from the HPC, where initial learning occurs, to the mPFC, which is thought to store this information long term. Indeed, it was also previously shown that the activity of mPFC neurons contains task-related information that can inform about the location of an animal in a maze, which can predict the animals' navigational choices. Here, the authors aim to show that the mPFC neurons that are modulated by HPC activity (SWRs and theta rhythms) are distinct from those "encoding" spatial information. This result could suggest that the integration of spatial information originating from the HPC within the mPFC may require the cooperation of separate sets of neurons.

      This observation may be useful to further extend our understanding of the dynamics regulating the exchange of information between the HPC and mPFC during learning. However, my understanding is that this finding is mainly based upon a negative result, which cannot be statistically proven by the failure to reject the null hypothesis. Moreover, in my reading, the rest of the paper mainly replicates phenomena that have already been described, with the original reports not correctly cited. My opinion is that the novel elements should be precisely identified and discussed, while the current phrasing in the manuscript, in most cases, leads readers to think that these results are new. Detailed comments are provided below.

      Major concerns:

      ORIGINAL COMMENT: (1) The main claim of the manuscript is that the neurons involved in predicting upcoming choices are not the neurons modulated by the HPC. This is based upon the evidence provided in Figure 5, which is a negative result that the authors employ to claim that predictive non-local representations in the mPFC are not linked to hippocampal SWRs and theta phase. However, it is important to remember that in a statistical test, the failure to reject the null hypothesis does not prove that the null hypothesis is true. Since this claim is so central in this work, the authors should use appropriate statistics to demonstrate that the null hypothesis is true. This can be accomplished by showing that there is no effect above some size that is so small that it would make the effect meaningless (see https://doi.org/10.1177/070674370304801108).

      AUTHOR RESPONSE: We would like to highlight a few important points here. (1) We indeed do not intend to claim that the SWR-modulated neurons are not at all involved in predicting upcoming choice, just that the SWR-unmodulated neurons may play a larger role. We have rephrased the title and abstract to make this clearer.

      REVIEWER COMMENT: The title has been rephrased but still conveys the same substantive claim. The abstract sentence also does not clearly state what was found. Using "independently" in the new title continues to imply that SWR modulation and prediction of upcoming choices are separate phenomena. By contrast, in your response here in the rebuttall you state only that "SWR-unmodulated neurons may play a larger role," which is a much more tempered claim than what the manuscript currently argues. Why is this clarification not adopted in the article? Moreover, the main text continues to use the same arguments as before; beyond the cosmetic changes of title and abstract, the claim itself has not materially changed.

      AUTHOR RESPONSE: (2) The hypothesis that we put forward is based not only on a negative effect, but on the findings that: the SWR-unmodulated neurons show higher spatial tuning (Fig 3b), more directional selectivity (Fig 3d), more frequent encoding of the upcoming choice at the choice point (new analysis, added in Fig 4d), and higher spike rates during the representations of the upcoming choice (Fig 5b). This is further highlighted by the fact that the representations of upcoming choice in the PFC are not time locked to SWRs (whereas the hippocampal representations of upcoming choice are; see Fig 5a and Fig 6a), and not time-locked to hippocampal theta phase (whereas the hippocampal representations are; see Fig 5c and Fig 6c). Finally, the representations of upcoming and alternative choices in the PFC do not show a large overlap in time with the representations in the hippocampus (see updated Fig 4e were we added a statistical test to show the likelihood of the overlap of decoded timepoints). All these results together lead us to hypothesize that SWR-modulation is not the driving factor behind non-local decoding in the PFC.

      REVIEWER COMMENT: I do not see how these precisions address my remark. The main claim in the title used to be "Neurons in the medial prefrontal cortex that are not modulated by hippocampal sharp-wave ripples are involved in spatial tuning and signaling upcoming choice." It is now "Neurons in the medial prefrontal cortex are involved in spatial tuning and signaling upcoming choice independently from hippocampal sharp-wave ripples." The substance has not changed. This specific claim is supported solely by Figure 5.

      The other analyses cited describe functional characteristics of SWR-unmodulated neurons but, unless linked by explicit new analyses, do not substantiate independence/orthogonality between SWR modulation and non-local decoding in PFC. If there is an analysis that makes this link explicit, it should be clearly presented; as it stands, I cannot find an explanation in the manuscript for why "all these results together" justify the conclusion that "All these results together lead us to hypothesize that SWR-modulation is not the driving factor behind non-local decoding in the PFC". Also: is the main result of this work a "hypothesis"? If so, this should be clearly differentiated from a conclusion supported by results and analyses.

      AUTHOR RESPONSE: (3) Based on the reviewers suggestion, we have added a statistical test to compare the phase-locking based of the non-local decoding to hippocampal SWRs and theta phase to shuffled posterior probabilities. Instead of looking at all SWRs in a -2 to 2 second window, we have now only selected the closest SWR in time within that window, and did the statistical comparison in the bin of 0-20 ms from SWR onset. With this new analysis we are looking more directly at the time-locking of the decoded segments to SWR onset (see updated Fig 5a and 6a).

      REVIEWER COMMENT: I appreciate the added analysis focusing on the closest SWR and a 0-20 ms bin. My understanding is that you consider the revised analyses in Figures 5a and 6a sufficient to show that predictive non-local representations in mPFC are not linked to hippocampal SWRs and theta phase.

      First, the manuscript should explicitly explain the rationale for this analysis and why it is sufficient to support the claim. From the main text it is not possible to understand what was done; the Methods are hard to follow, and the figure legends are not clearly described (e.g. the shuffle is not even defined there).

      Specific points I could not reconcile:

      i) The gray histograms in the revised Figures 5a and 6a now show a peak at zero lag, whereas in the previous version they were flat, although they are said to plot the same data. What changed?

      ii) Why choose a 20 ms bin? A single narrow bin invites false negatives. Please justify this choice.

      iii) Comparing to a shuffle is a useful control, but when the p-value is non-significant we only learn that no difference was detected under that shuffle-not that there is no difference or that the processes are independent.

      ORIGINAL COMMENT: (2) The main claim of the work is also based on Figure 3, where the authors show that SWRs-unmodulated mPFC neurons have higher spatial tuning, and higher directional selectivity scores, and a higher percentage of these neurons show theta skipping. This is used to support the claim that SWRs-unmodulated cells encode spatial information. However, it must be noted that in this kind of task, it is not possible to disentangle space and specific task variables involving separate cognitive processes from processing spatial information such as decision-making, attention, motor control, etc., which always happen at specific locations of the maze. Therefore, the results shown in Figure 3 may relate to other specific processes rather than encoding of space and it cannot be unequivocally claimed that mPFC neurons "encode spatial information". This limitation is presented by Mashoori et al (2018), an article that appears to be a major inspiration for this work. Can the authors provide a control analysis/experiment that supports their claim? Otherwise, this claim should be tempered. Also, the authors say that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space. How do they reconcile it with their results?

      AUTHOR RESPONSE: The reviewer is right to assert caution when talking about claims such as spatial tuning where other factors may also be involved. Although we agree that there may be some other factors influencing what we are seeing as spatial tuning, it is very important to note that the behavioral task is executed on a symmetrical 4-armed maze, where two of the arms are always used for the start of the trajectory, and the other two arms (North and South) function as the goal (reward) arms. Therefore, if the PFC is encoding cognitive processes such as task phases related to decision-making and reward, we would not be able to differentiate between the two start arms and the two goal arms, as these represent the same task phases. Note also that the North and South arm are illuminated in a pseudo-random order between trials and during cue-based rule learning this is a direct indication of where the reward will be found. Even in this phase of the task, the PFC encodes where the animal will turn on a trial-to-trial basis (meaning the North and South arm are still differentiated correctly on each trial even though the illumination and associated reward are changing).

      REVIEWER COMMENT: I appreciate that the departure location was pseudorandomized. However, this control does not rule out that PFC activity reflects motor preparation (left vs right turns) and associated perceptual decision-making/attentional processes that are inherently tied to a specific action. As such, it cannot by itself support the claim that PFC neurons "encode spatial information." Moreover, the authors acknowledge here that "other factors may also be involved," yet this caveat is not reflected in the manuscript. Why?

      AUTHOR RESPONSE: Secondly, importantly, the reviewer mentions that we claimed that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space, but this is incorrect. Jadhav et al. (2016) showed that SWR-unmodulated neurons had lower spatial coverage, meaning that they are more spatially selective (congruent with our results). We have rephrased this in the text to be clearer.

      REVIEWER COMMENT: Thanks for clarifying this.

      ORIGINAL COMMENT: (3) My reading is that the rest of the paper mainly consists of replications or incremental observations of already known phenomena with some not necessarily surprising new observations:<br /> a) Figure 2 shows that a subset of mPFC neurons is modulated by HPC SWRs and theta (already known), that vmPFC neurons are more strongly modulated by SWRs (not surprising given anatomy), and that theta phase preference is different between vmPFC and dmPFC (not surprising given the fact that theta is a travelling wave).

      AUTHOR RESPONSE: The finding that vmPFC neurons are more strongly modulated by SWRs than dmPFC indeed matches what we know from anatomy, but that does not make it a trivial finding. A lot remains unknown about the mPFC subregions and their interactions with the hippocampus, and not every finding will be directly linked to the anatomy. Therefore, in our view this is a significant finding which has not been studied before due to the technical complexity of large-scale recordings along the dorsal-ventral axis of the mPFC.

      REVIEWER COMMENT: This finding is indeed non-trivial; however, it seems completely irrelevant to the paper's main claim unless the Authors can argue otherwise.

      AUTHOR RESPONSE: Similarly, theta being a traveling wave (which in itself is still under debate), does not mean we should assume that the dorsal and ventral mPFC should follow this signature and be modulated by different phases of the theta cycle. Again, in our view this is not at all trivial, but an important finding which brings us closer to understanding the intricate interactions between the hippocampus and PFC in spatial learning and decision-making.

      REVIEWER COMMENT: Yes, but in what way does this support the manuscript's primary claim? This is unclear to me.

      ORIGINAL COMMENT: b) Figure 4 shows that non-local representations in mPFC are predictive of the animal's choice. This is mostly an increment to the work of Mashoori et al (2018). My understanding is that in addition to what had already been shown by Mashoori et al here it is shown how the upcoming choice can be predicted. The author may want to emphasize this novel aspect.

      AUTHOR RESPONSE: In our view our manuscript focuses on a completely different aspect of learning and memory than the paper the reviewer is referring to (Mashoori et al. 2018). Importantly, the Mashoori et al. paper looked at choice evaluation at reward sites and shows that disappointing reinforcements are associated with reactivations in the ACC of the unselected target. This points to the role of the ACC in error detection and evaluation. Although this is an interesting result, it is in essence unrelated to what we are focusing on here, which is decision making and prediction of upcoming choices. The fact that the turning direction of the animal can be predicted on a trial-to-trial basis, and even precedes the behavioral change over the course of learning, sheds light on the role of the PFC in these important predictive cognitive processes (as opposed to post-choice reflective processes).

      REVIEWER COMMENT: Indeed, as I said, the new element here is that the upcoming choice can be predicted. This appears only incremental and could belong to another story; as the manuscript is currently written, it does not support the article's main claim. I would like to specify that, regarding this and the other points above, my inability to see how these minor results support the Authors' claim may reflect my misunderstanding; nevertheless, this suggests that the manuscript should be extensively rewritten and reorganized to make the Authors' meaning clear.

      ORIGINAL COMMENT: c) Figure 6 shows that prospective activity in the HPC is linked to SWRs and theta oscillations. This has been described in various forms since at least the works of Johnson and Redish in 2007, Pastalkova et al 2008, and Dragoi and Tonegawa (2011 and 2013), as well as in earlier literature on splitter cells. These foundational papers on this topic are not even cited in the current manuscript.

      AUTHOR RESPONSE: We have added these citations to the introduction (line 37).

      REVIEWER COMMENT: This is an example of how the Authors fail to acknowledge the underlying problem with how the manuscript is written; the issue has not been addressed except with a cosmetic change like the one described above. The Results section contains a series of findings that are well-known phenomena described previously (see below). Prior results should be acknowledged at the beginning of each relevant paragraph, followed by an explicit statement of what is new, so that readers can distinguish replication from novelty. Here, I pointed specifically to the results of Figure 6, and the Authors deemed it sufficient simply to add the citations I indicated to an existing sentence in the Introduction, while keeping the Results description unchanged. As written, this reads as if these phenomena are being described for the first time. This is incorrect. It is hard to avoid the impression that the Authors did not take this concern seriously; the same issue appears elsewhere in the manuscript, and I fail to see how the Authors "have improved clarity of the text throughout to highlight the novelty of our results better."

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors used high-density probe recordings in the medial prefrontal cortex (PFC) and hippocampus during a rodent spatial memory task to examine functional sub-populations of PFC neurons that are modulated vs. unmodulated by hippocampal sharp-wave ripples (SWRs), an important physiological biomarker that is thought to have a role in mediating information transfer across hippocampal-cortical networks for memory processes. SWRs are associated with the reactivation of representations of previous experiences, and associated reactivation in hippocampal and cortical regions has been proposed to have a role in memory formation, retrieval, planning, and memory-guided behavior. This study focuses on awake SWRs that are prevalent during immobility periods during pauses in behavior. Previous studies have reported strong modulation of a subset of prefrontal neurons during hippocampal SWRs, with some studies reporting prefrontal reactivation during SWRs that have a role in spatial memory processes. The study seeks to extend these findings by examining the activity of SWR-modulated vs. unmodulated neurons across PFC sub-regions, and whether there is a functional distinction between these two kinds of neuronal populations with respect to representing spatial information and supporting memory-guided decision-making.

      Strengths:

      The major strength of the study is the use of Neuropixels 1.0 probes to monitor activity throughout the dorsal-ventral extent of the rodent medial prefrontal cortex, permitting an investigation of functional distinction in neuronal populations across PFC sub-regions. They are able to show that SWR-unmodulated neurons, in addition to having stronger spatial tuning than SWR-modulated neurons as previously reported, also show stronger directional selectivity and theta-cycle skipping properties.

      Weaknesses:

      (1) While the study is able to extend previous findings that SWR-modulated PFC neurons have significantly lower spatial tuning that SWR-unmodulated neurons, the evidence presented does not support the main conclusion of the paper that only the unmodulated neurons are involved in spatial tuning and signaling upcoming choice, implying that SWR-modulated neurons are not involved in predicting upcoming choice, as stated in the abstract. This conclusion makes a categorical distinction between two neuronal populations, that SWR-modulated neurons are involved and SWR-unmodulated are not involved in predicting upcoming choice, which requires evidence that clearly shows this absolute distinction. However, in the analyses showing non-local population decoding in PFC for predicting upcoming choice, the results show that SWR-unmodulated neurons have higher firing rates than SWR-modulated neurons, which is not a categorical distinction. Higher firing rates do not imply that only SWR-unmodulated neurons are contributing to the non-local decoding. They may contribute more than SWR-modulated neurons, but there are no follow-up analyses to assess the contribution of the two sub-populations to non-local decoding.

      We agree with the reviewer that this is indeed not a categorical distinction, and do not wish to claim that the SWR-modulated neurons have absolutely no role in non-local decoding and signaling upcoming choice. We have adjusted this in the title, abstract and text to clarify this for the reader. Furthermore, we have performed additional analyses to elucidate the role of SWR-modulated neurons in non-local decoding by creating separate decoding models for SWR-modulated and unmodulated PFC neurons respectively. These analyses show that the SWR-unmodulated neurons are indeed encoding representations of the upcoming choice more often than the alternative choice, whereas the SWR-modulated neurons do not reliably differentiate the upcoming and alternative choices in non-local decoding at the choice point (see new Fig 4d).

      (2) Further, the results show that during non-local representations of the hippocampus of the upcoming options, SWR-excited PFC neurons were more active during hippocampal representations of the upcoming choice, and SWR-inhibited PFC neurons were less active during hippocampal representations of the alternative choice. This clearly suggests that SWR-modulated neurons are involved in signaling upcoming choice, at least during hippocampal non-local representations, which contradicts the main conclusion of the paper.

      This does not contradict the main conclusion of the paper, but in fact strengthens the hypothesis we are putting forward: that the SWR-modulated neurons are more linked to the hippocampal non-local representations, whereas the SWR-unmodulated neurons seem to have their own encoding of upcoming choice which is not linked to the signatures in the hippocampus (almost no time overlap with hippocampal representations, no phase locking to hippocampal theta, no time locking to hippocampal SWRs, no increased firing during hippocampal representations of upcoming choice).

      (3) Similarly, one of the analyses shows that PFC nonlocal representations show no preference for hippocampal SWRs or hippocampal theta phase. However, the examples shown for non-local representations clearly show that these decodes occur prior to the start of the trajectory, or during running on the central zone or start arm. The time period of immobility prior to the start arm running will have a higher prevalence of SWRs and that during running will have a higher prevalence of theta oscillations and theta sequences, so non-local decoded representations have to sub-divided according to these known local-field potential phenomena for this analysis, which is not followed.

      These analyses are in fact separated based on proximity to SWRs (only segments that occurred within 2 seconds of SWR onset were included, see Methods) and theta periods respectively (selected based on a running speed of more than 5 cm/s and the absence of SWRs in the hippocampus, see Methods). We have clarified this in the main text.

      (4) The primary phenomenon that the manuscript relies on is the modulation of PFC neurons by hippocampal SWRs, so it is necessary to perform the PFC population decoding analyses during SWRs (or examine non-local decoding that occurs specifically during SWRs), as reported in previous studies of PFC reactivation during SWRs, to see if there is any distinction between modulated and unmodulated neurons in this reactivation. Even in the case of independent PFC reactivation as reported by one study, this PFC reactivation was still reported to occur during hippocampal SWRs, therefore decoding during SWRs has to be examined. Similarly, the phenomenon of theta cycle skipping is related to theta sequence representations, so decoding during PFC and hippocampal theta sequences has to be examined before coming to any conclusions.

      The histograms shown in Figure 5a (see updated Fig 5a where we look at the closest SWR in time and compare the occurrence with shuffled data) show that there is no increased prevalence of decoding upcoming and alternative choices in the PFC during hippocampal SWRs. The lack of overlap of non-local decoding between the hippocampus and PFC further shows that these non-local representations occur at different timepoints in the PFC and hippocampus (see updated Fig 4e where we added a statistical test to show the likeliness of the overlap between the decoded segments in the PFC and hippocampus). Based on the reviewer's suggestion, we have additionally decoded the information in the PFC during hippocampal SWRs exclusively, and found that the direction on the maze could not be predicted based on the decoding of SWR time points in the PFC. See figure below. Similarly, we can see from the histograms in Figure 5c that there is no phase locking to the hippocampal theta phase for non-local representations in the PFC, and in contrast there is phase locking of the hippocampal encoding of upcoming choice to the rising phase of the theta cycle (Fig 6c), further highlighting the separation between these two regions in the non-local decoding.

      Reviewer #2 (Public review):

      Summary:

      This work by den Bakker and Kloosterman contributes to the vast body of research exploring the dynamics governing the communication between the hippocampus (HPC) and the medial prefrontal cortex (mPFC) during spatial learning and navigation. Previous research showed that population activity of mPFC neurons is replayed during HPC sharp-wave ripple events (SWRs), which may therefore correspond to privileged windows for the transfer of learned navigation information from the HPC, where initial learning occurs, to the mPFC, which is thought to store this information long term. Indeed, it was also previously shown that the activity of mPFC neurons contains task-related information that can inform about the location of an animal in a maze, which can predict the animals' navigational choices. Here, the authors aim to show that the mPFC neurons that are modulated by HPC activity (SWRs and theta rhythms) are distinct from those "encoding" spatial information. This result could suggest that the integration of spatial information originating from the HPC within the mPFC may require the cooperation of separate sets of neurons.

      This observation may be useful to further extend our understanding of the dynamics regulating the exchange of information between the HPC and mPFC during learning. However, my understanding is that this finding is mainly based upon a negative result, which cannot be statistically proven by the failure to reject the null hypothesis. Moreover, in my reading, the rest of the paper mainly replicates phenomena that have already been described, with the original reports not correctly cited. My opinion is that the novel elements should be precisely identified and discussed, while the current phrasing in the manuscript, in most cases, leads readers to think that these results are new. Detailed comments are provided below.

      Major concerns:

      (1) The main claim of the manuscript is that the neurons involved in predicting upcoming choices are not the neurons modulated by the HPC. This is based upon the evidence provided in Figure 5, which is a negative result that the authors employ to claim that predictive non-local representations in the mPFC are not linked to hippocampal SWRs and theta phase. However, it is important to remember that in a statistical test, the failure to reject the null hypothesis does not prove that the null hypothesis is true. Since this claim is so central in this work, the authors should use appropriate statistics to demonstrate that the null hypothesis is true. This can be accomplished by showing that there is no effect above some size that is so small that it would make the effect meaningless (see https://doi.org/10.1177/070674370304801108).

      We would like to highlight a few important points here. (1) We indeed do not intend to claim that the SWR-modulated neurons are not at all involved in predicting upcoming choice, just that the SWR-unmodulated neurons may play a larger role. We have rephrased the title and abstract to make this clearer. (2) The hypothesis that we put forward is based not only on a negative effect, but on the findings that: the SWR-unmodulated neurons show higher spatial tuning (Fig 3b), more directional selectivity (Fig 3d), more frequent encoding of the upcoming choice at the choice point (new analysis, added in Fig 4d), and higher spike rates during the representations of the upcoming choice (Fig 5b). This is further highlighted by the fact that the representations of upcoming choice in the PFC are not time locked to SWRs (whereas the hippocampal representations of upcoming choice are;  see Fig 5a and Fig 6a), and not time-locked to hippocampal theta phase (whereas the hippocampal representations are; see Fig 5c and Fig 6c). Finally, the representations of upcoming and alternative choices in the PFC do not show a large overlap in time with the representations in the hippocampus (see updated Fig 4e were we added a statistical test to show the likelihood of the overlap of decoded timepoints). All these results together lead us to hypothesize that SWR-modulation is not the driving factor behind non-local decoding in the PFC. (3) Based on the reviewers suggestion, we have added a statistical test to compare the phase-locking based of the non-local decoding to hippocampal SWRs and theta phase to shuffled posterior probabilities. Instead of looking at all SWRs in a -2 to 2 second window, we have now only selected the closest SWR in time within that window, and did the statistical comparison in the bin of 0-20 ms from SWR onset. With this new analysis we are looking more directly at the time-locking of the decoded segments to SWR onset (see updated Fig 5a and 6a).

      (2) The main claim of the work is also based on Figure 3, where the authors show that SWRs-unmodulated mPFC neurons have higher spatial tuning, and higher directional selectivity scores, and a higher percentage of these neurons show theta skipping. This is used to support the claim that SWRs-unmodulated cells encode spatial information. However, it must be noted that in this kind of task, it is not possible to disentangle space and specific task variables involving separate cognitive processes from processing spatial information such as decision-making, attention, motor control, etc., which always happen at specific locations of the maze. Therefore, the results shown in Figure 3 may relate to other specific processes rather than encoding of space and it cannot be unequivocally claimed that mPFC neurons "encode spatial information". This limitation is presented by Mashoori et al (2018), an article that appears to be a major inspiration for this work. Can the authors provide a control analysis/experiment that supports their claim? Otherwise, this claim should be tempered. Also, the authors say that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space. How do they reconcile it with their results?

      The reviewer is right to assert caution when talking about claims such as spatial tuning where other factors may also be involved. Although we agree that there may be some other factors influencing what we are seeing as spatial tuning, it is very important to note that the behavioral task is executed on a symmetrical 4-armed maze, where two of the arms are always used for the start of the trajectory, and the other two arms (North and South) function as the goal (reward) arms. Therefore, if the PFC is encoding cognitive processes such as task phases related to decision-making and reward, we would not be able to differentiate between the two start arms and the two goal arms, as these represent the same task phases. Note also that the North and South arm are illuminated in a pseudo-random order between trials and during cue-based rule learning this is a direct indication of where the reward will be found. Even in this phase of the task, the PFC encodes where the animal will turn on a trial-to-trial basis (meaning the North and South arm are still differentiated correctly on each trial even though the illumination and associated reward are changing).

      Secondly, importantly, the reviewer mentions that we claimed that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space, but this is incorrect. Jadhav et al. (2016) showed that SWR-unmodulated neurons had lower spatial coverage, meaning that they are more spatially selective (congruent with our results). We have rephrased this in the text to be clearer.

      (3) My reading is that the rest of the paper mainly consists of replications or incremental observations of already known phenomena with some not necessarily surprising new observations:

      (a) Figure 2 shows that a subset of mPFC neurons is modulated by HPC SWRs and theta (already known), that vmPFC neurons are more strongly modulated by SWRs (not surprising given anatomy), and that theta phase preference is different between vmPFC and dmPFC (not surprising given the fact that theta is a travelling wave).

      The finding that vmPFC neurons are more strongly modulated by SWRs than dmPFC indeed matches what we know from anatomy, but that does not make it a trivial finding. A lot remains unknown about the mPFC subregions and their interactions with the hippocampus, and not every finding will be directly linked to the anatomy. Therefore, in our view this is a significant finding which has not been studied before due to the technical complexity of large-scale recordings along the dorsal-ventral axis of the mPFC.

      Similarly, theta being a traveling wave (which in itself is still under debate), does not mean we should assume that the dorsal and ventral mPFC should follow this signature and be modulated by different phases of the theta cycle. Again, in our view this is not at all trivial, but an important finding which brings us closer to understanding the intricate interactions between the hippocampus and PFC in spatial learning and decision-making.

      (b) Figure 4 shows that non-local representations in mPFC are predictive of the animal's choice. This is mostly an increment to the work of Mashoori et al (2018). My understanding is that in addition to what had already been shown by Mashoori et al here it is shown how the upcoming choice can be predicted. The author may want to emphasize this novel aspect.

      In our view our manuscript focuses on a completely different aspect of learning and memory than the paper the reviewer is referring to (Mashoori et al. 2018). Importantly, the Mashoori et al. paper looked at choice evaluation at reward sites and shows that disappointing reinforcements are associated with reactivations in the ACC of the unselected target. This points to the role of the ACC in error detection and evaluation. Although this is an interesting result, it is in essence unrelated to what we are focusing on here, which is decision making and prediction of upcoming choices. The fact that the turning direction of the animal can be predicted on a trial-to-trial basis, and even precedes the behavioral change over the course of learning, sheds light on the role of the PFC in these important predictive cognitive processes (as opposed to post-choice reflective processes).

      (c) Figure 6 shows that prospective activity in the HPC is linked to SWRs and theta oscillations. This has been described in various forms since at least the works of Johnson and Redish in 2007, Pastalkova et al 2008, and Dragoi and Tonegawa (2011 and 2013), as well as in earlier literature on splitter cells. These foundational papers on this topic are not even cited in the current manuscript.

      We have added these citations to the introduction (line 37).

      Although some previous work is cited, the current narrative of the results section may lead the reader to think that these results are new, which I think is unfair. Previous evidence of the same phenomena should be cited all along the results and what is new and/or different from previous results should be clearly stated and discussed. Pure replications of previous works may actually just be supplementary figures. It is not fair that the titles of paragraphs and main figures correspond to notions that are well established in the literature (e.g., Figure 2, 2nd paragraph of results, etc.).

      We have changed the title of paragraph 2 and Figure 2 to highlight more clearly the novel result (the difference between the dorsal and ventral mPFC), and have improved clarity of the text throughout to highlight the novelty of our results better.

      (d) My opinion is that, overall, the paper gives the impression of being somewhat rushed and lacking attention to detail. Many figure panels are difficult to understand due to incomplete legends and visualizations with tiny, indistinguishable details. Moreover, some previous works are not correctly cited. I tried to make a list of everything I spotted below.

      We have addressed all the comments in the Recommendations for Authors.

      Reviewer #1 (Recommendations for the authors):

      (1) Expanding on the points above, one of the strengths of the study is expanding the previous result that SWR-unmodulated neurons are more spatially selective (Jadhav et al., 2016), across prefrontal sub-regions, and showing that these neurons are more directionally selective and show more theta cycle skipping. Theta cycle skipping is related to theta sequence representations and previous studies have established PFC theta sequences in parallel to hippocampal theta sequences (Tang et al., 2021; Hasz and Redish, 2020; Wang et al., 2024), and the theta cycle skipping result suggests that SWR-unmodulated neurons should show stronger participation than SWR-modulated neurons in PFC theta sequences that decode to upcoming or alternative location, which can be tested in this high-density PFC physiology data. This is still unlikely to make a categorical distinction that only SWR-unmodulated neurons participate in theta sequence decoding, but will be useful to examine.

      We thank the reviewer for their suggestion and have now included results based on separate decoding models that only use SWR-modulated or SWR-unmodulated mPFC neurons. From this analysis we see that indeed SWR-unmodulated neurons are not the only group contributing to theta sequence decoding, but they do distinguish more strongly between the upcoming and alternative arms at the choice point (see new Fig 4d).

      (2) Non-local decoding in 50ms windows on a theta timescale is a valid analysis, but ignoring potential variability in the internal state during running vs. immobility, and as indicated by LFPs by the presence of SWRs or theta oscillations, is incorrect especially when conclusions are being made about decoding during SWRs and theta oscillation phase, and in light of previous evidence that these are distinct states during behavior. There are multiple papers on PFC theta sequences (Tang et al., 2021; Hasz and Redish, 2020; Wang et al., 2024), and on PFC reactivation during SWRs (Shin et al., 2019; Kaefer et al., 2020; Jarovi et al., 2023), and this dataset of high-density prefrontal recordings using Neuropixels 1.0 provides an opportunity to investigate these phenomena in detail. Here, it should be noted that although Kaefer et al. reported independent prefrontal reactivation from hippocampal reactivation, these PFC reactivation events still occurred during hippocampal SWRs in their data, and were linked to memory performance.

      From our data we see that the time segments that represent upcoming or alternative choice in the prefrontal cortex are in fact not time-locked to hippocampal SWRs (updated Fig 5a where we look only at the closest SWR in time and compare this to shuffled data). In addition, these segments do not overlap much with the decoded segments in the hippocampus (see updated Fig 4e where we added a shuffling procedure to assess the likelihood of the overlap with hippocampal decoded segments). Importantly, we are not ignoring the variability during running and immobility, as theta segments were selected based on a running speed of more than 5 cm/s and the absence of SWRs in the hippocampus (see Methods), ensuring that the theta and SWR analyses were done on the two different behavioral states respectively. We have  clarified this in the main text.

      (3) The majority of rodent studies make the distinction between ACC, PrL, and IL, although as the authors noted, there are arguments that rodent mPFC is a continuum (Howland et al., 2022), or even that rodent mPFC is a unitary cingulate cortical region (van Heukelum et al., 2020). The authors choose to present the results as dorsal (ACC + dorsal PrL) vs. ventral mPFC (ventral PrL + IL), however, in my opinion, it will be more useful to the field to see results separately for ACC, PrL, and IL, given the vast literature on connectivity and functional differences in these regions.

      We appreciate the reviewer’s suggestion. Initially, we did perform all analyses separately for the ACC, PLC and ILC subregions. However, we observed that the differences between subregions (strength of SWR-modulation and the phase locking to theta) varied uniformly along the dorsal-ventral axis, i.e., the PLC showed a profile of SWR-modulation and theta phase locking that fell in between that of the ACC and the ILC. This is also highlighted in paragraph 3 of the introduction (lines 52-56). For that reason, and for the sake of reducing the number of variables, increasing statistical power, and improving readability, we focused on the dorsal-ventral distinction instead, as this is where the main differences were seen.

      (4) I suggest that the authors refrain from making categorical distinctions as in their title and abstract, such as "neurons that are involved in predicting upcoming choice are not the neurons that are modulated by hippocampal sharp-wave ripples" when the evidence presented can only support gradation of participation of the two neuronal sub-populations, not an absolute distinction. The division of SWR-modulated and SWR-unmodulated neurons itself is determined by the statistic chosen to divide the neurons into one or two sub-classes and will vary with the statistical threshold employed. Further, previous studies have suggested that SWR-excited and SWR-inhibited neurons comprise distinct functional sub-populations based on their activity properties (Jadhav et al., 2016; Tang et al., 2017), but it is not clear to what degree is SWR-modulated neurons a distinct and singular functional sub-population. In the absence of connectivity information and cross-correlation measures within and across sub-populations, it is prudent to be conservative about this interpretation of SWR-unmodulated neurons.

      We agree with the reviewer that the distinction is not categorical and have changed the wording in the title and abstract. We also do not intend to claim that the SWR-modulated neurons are a distinct and singular functional sub-population, and for that reason the firing rates from the SWR-excited and SWR-inhibited groups are reported separately throughout the paper.

      Reviewer #2 (Recommendations for the authors):

      Minor detailed remarks:

      (1) The authors should provide a statistical test, perhaps against shuffled data, for Figures 5a,c and 6a,c.

      We thank the reviewer for their suggestion and have added statistical tests in Figures 5a, 5c, 6a and 6c.

      (2) The behavioral task is explained only in the legend of Figure 1c, and the explanation is quite vague. In this type of article format, readers need to have a clear understanding of the task without having to refer to the methods section. A clear understanding of the task is crucial for interpreting all subsequent analyses. In my opinion, the word 'trial' in the figure is misleading, as these are sessions composed of many trials.

      We have added a more thorough description of the behavioral task, both in the main text and the Figure legend.

      (3) Figure 1d, legend of markers missing.

      We have added a legend for the markers.

      (4) When there are multiple bars and a single p-value is presented, it is unclear which group comparisons the p-value pertains to. For instance, Figures 2c-f and 3b, d, f (right parts), and 5b...

      For all p-values we have added lines to the figures that indicate the groups that were compared and have added descriptions of the statistical test to the figure legends to indicate what each p-value represents.

      (5) In Figure 3c, the legend does not explain what the colored lines represent, and the lines themselves are very small and almost indistinguishable.

      We have changed the colored lines to quadrants on the maze to clarify what each direction represents.

      (6) Figure 4a is too small, and the elements are so tiny that it is impossible to distinguish them and their respective colors. The term 'segment' has not been unequivocally explained in the text. All the different elements of the panel should be explicitly explained in the legend to make it easily understandable. What do the pictograms of the maze on the left represent? What does the dashed vertical line indicate?

      We have added the definition of a segment in the text (lines 283-286) and have improved the clarity and readability of Figure 4a.

      (7) In Figure 5, what do the red dots on the right part relate to? The legend should explicitly explain what is shown in the left and right parts, respectively. What comparisons do the p-values relate to?

      We have adjusted the legend to explain the left and right parts of the figure and we have added the statistical test that was used to get to the p-value (in addition to the text which already explained this).

      (8) Panels b of Figures 5 and 6 should have the same y-axis scale for comparison. The position of the p-values should also be consistent. With the current arrangement in Figure 6, it is unclear what the p-values relate to.

      We have adjusted the y-scale to be the same for Figures 5 and 6, and we have added a description of the statistical test to the legend.

      (9) Multiple studies have previously shown that mPFC activity contains spatial information (e.g., refs 24-27). It is important that, throughout the paper, the authors frame their results in relation to previous findings, highlighting what is novel in this work.

      We thank the reviewer for this valuable suggestion. In the revised manuscript, we have indicated more clearly which results replicate previous findings and highlighted novel results.

      (10) Please note that Peyrache et al. (2009) do not show trajectory replay, nor do they decode location. I am not familiar with all the cited literature, but this makes me think that the authors may want to double-check their citations to ensure they assign the correct claims to each past work.

      We have adjusted the reference to the work to exclude the word ‘trajectory’ and doublechecked our other citations.

      (11) The authors perform theta-skipping analysis, first described by Kay et al., but do not cite the original paper until the discussion.

      Thank you pointing out this oversight. We have now included this citation earlier in the paper (line 231).

      (12) Additionally, some parts of the text are difficult to grasp, and there are English vocabulary and syntax errors. I am happy to provide comments on the next version of the text, but please include page and line numbers in the PDF. The authors may also consider using AI to correct English mistakes and improve the fluency and readability of their text.

      We have carefully gone through the text to correct any errors.  We have now also included page and line numbers and we will be happy to address any specific issues the reviewer may spot in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      This study presents evidence that remote memory in the APP/PS1 mouse model of Alzheimer's disease (AD) is associated with PV interneuron hyperexcitability and increased inhibition of cortical engram cells. Its strength lies in the fact that it explores a neglected aspect of memory research - remote memory impairments related to AD (for which the primary research focus is usually on recent memory impairments) -which has received minimal attention to date. While the findings are intriguing, the weakness of the paper hovers around purely correlational types of evidence and superficial data analyses, which require substantial revisions as outlined below. 

      We thank the reviewer for their feedback, and we appreciate the recognition of the study’s novelty in addressing remote memory impairments in AD. We acknowledge the reviewer’s concerns and have implemented revisions to strengthen the manuscript.

      Major concerns: 

      (1) In light of previous work, including that by the authors themselves, the data in Figure 1 should be implemented by measurements of recent memory recall in order to assess whether remote memories are exclusively impaired or whether remote memory recall merely represents a continuation of recent memory impairments.

      We agree with the reviewer that is an important point. In line with their suggestion in minor comment 1, we now omitted the statement on recent memory in the results (previously on lines 109-111 and 117). Nonetheless, previous independent experiments from our group have repeatedly shown recent memory deficits in APP/PS1 mice at 12 weeks of age, including a recent article published in 2023. We refer the reviewer to figure 2c in Végh et al. (2014) and figure 2i in Kater et al. (2023). We have added a reference of the latter paper to our discussion section (line 458-459). Therefore, we are confident that the recent memory deficit at 12 weeks of age is a stable phenotype in our APP/PS1 mice.

      With these data in mind, we argue that the remote memory recall impairment is not a continuation of recent memory impairments. Recent memory deficits emerge already at 12 weeks of age, and when remote memory is assessed at 16 weeks (4 weeks after training at 12 weeks of age), APP/PS1 mice are still capable of forming and retrieving a remote memory. This suggests that remote memory retrieval can occur even when recent memory is compromised, arguing against the idea that the remote memory deficit observed at 20 weeks is a continuation of earlier recent memory impairments. We have clarified this point in the revised manuscript by adding the following sentence to the discussion section (line 462-465): 

      ‘This suggests that a remote memory can be formed even when recent memory expression is already compromised, indicating that the remote memory deficit in 20-week-old APP/PS1 mice is not a continuation of earlier recent memory impairments.’

      (2) Figure 2 shows electrophysiological properties of PV cells in the mPFC that correlate with the behavior shown in Figure 1. However, the mice used in Figure 2 are different than the mice used in Figure 1. Thus, the data are correlative at best, and the authors need to confirm that behavioral impairments in the APP/PS1 mice crossed to PV-Cre (and SST-Cre mice) used in Figure 2 are similar to those of the APP/PS1 mice used in Figure 1. Without that, no conclusions between behavioral impairments and electrophysiological as well as engram reactivation properties can be made, and the central claims of the paper cannot be upheld. 

      We thank the reviewer for raising this concern. Indeed, the remote memory impairment and PV hyperexcitability are correlative data, and therefore we do not make causal claims based on these data. However, please note that most of our key findings, including behavioural impairments, characterization of the engram ensemble and reactivation thereof, as well as inhibitory input measurements, were acquired using the same mouse line (APP/PS1), strengthening the coherence of our conclusions. Also, our electrophysiological findings in APP/PS1 (enhanced sIPSC frequency) and APP/PS1-PV-Cre-tdTomato (enhanced PV cell excitability) mice align well. Direct comparisons between the transgenic mouse lines APP/PS1 and APP/PS1 Parv-Cre were performed in our previous studies, confirming that these lines are similar in terms of behaviour and pathology. Specifically, we demonstrated that APP/PS1 mice display spatial memory impairments at 16 weeks of age, Fig 4a-d, consistent with the deficits observed in APP/PS1 Parv-Cre mice at 16 weeks of age, Fig 5a-c (Hijazi et al., 2020a). Additionally, Hijazi et al. (2020a) showed that soluble and insoluble Aβ levels do not differ between APP/PS1 Parv-Cre and APP/PS1 mice (sFig. 1), indicating comparable levels of pathology between these lines. While we do not have a similar characterization of the APP/PS1 SST-Cre line, we should mention that we also did not observe excitability differences in SST cells. We now acknowledge the limitation in the revised discussion section (line 480-487), and stress that our electrophysiology and behavioural findings are correlative in nature:

      ‘Although the excitability measurements were performed in APP/PS1-PV-Cre-tdTomato mice, and not in the APP/PS1 parental line, we previously found that these transgenic mouse lines exhibit comparable amyloid pathology (both soluble and insoluble amyloid beta levels) as well as similar spatial memory deficits (Hijazi et al., 2020a; Kater et al., 2023). Thus, our observations indicate that the APP/PS1 PV-Cre-tdTomato and APP/PS1 lines are similar in terms of pathology and behaviour. Nonetheless, further work is needed to identify a causal link between PV cell hyperexcitability and remote memory impairment.’ 

      (3) The reactivation data starting in Figure 3 should be analysed in much more depth: 

      a) The authors restrict their analysis to intra-animal comparisons, but additional ones should be performed, such as inter-animal (WT vs APP/PS1) as well as inter-age (12-16w vs 16-20w). In doing so, reactivation data should be normalized to chance levels per animal, to account for differences in labelling efficiency - this is standard in the field (see original Tonegawa papers and for a reference). This could highlight differences in total reactivation that are already apparent, such as for instance in WT vs APP/PS1 at 20w (Figure 3o) and highlight a decrease in reactivation in AD mice at this age, contrary to what is stated in lines 213-214. 

      We would like to thank the reviewer for the valuable input on the reactivation data in Figure 3. 

      We agree with the reviewer and now depict the data as normalized to chance levels (Figure 3). The original figures are now supplemental (sFig. 5). The reactivation data normalized to chance are similar to the original results, i.e. no difference was observed in the reactivation of the mPFC engram ensemble between genotypes. The reviewer may have overlooked that we did perform inter-animal (WT vs. APP/PS1) comparisons, however these were not significantly different. We have made this clearer in the main text, lines 277, 288-289, 294-295 and 303-304. Moreover, the reviewer recommended including inter-age group comparisons, which have now been added to the supplemental figures (sFig. 6). No genotype-dependent differences were observed. While a main effect of age group did emerge, indicating that there is a potential increased overlap between Fos+ and mCherry+ in animals aged 16-20 weeks, we caution against overinterpreting this finding. These experimental groups were processed in separate cohorts, with viral injection and 4TM-induced tagging performed at different moments in time, which may have contributed to the observed differences in overlap. We have addressed this point in the revised discussion (line 612-617):

      ‘Furthermore, we also observed an increase in the amount overlap between Fos+ and mCherry+ engram cells when comparing the 12-16w and 16-20w age groups. This finding should be interpreted with caution, as the experimental groups were processed in separate cohorts, with viral injections and 4TM-induced tagging performed at different moments in time. This may have contributed to the observed differences between ages.’

      b) Comparing the proportion of mcherry+ cells in PV- and PV+ is problematic, considering that the PV- population is not "pure" like the PV+, but rather likely to represent a mix of different pyramidal neurons (probably from several layers), other inhibitory neurons like SST and maybe even glial cells. Considering this, the statement on line 218 is misleading in saying that PVs are overrepresented. If anything, the same populations should be compared across ages or groups.  

      We thank the reviewer for their insightful comment and agree that the PV- population of cells is likely more heterogenous than the PV+ population. However, we would like to clarify that all quantified cells were selected based on Nissl immunoreactivity, and to exclude non-neuronal cells, stringent thresholding was applied in the script that was used to identify Nissl+ cells. The threshold information has now been added to the methods section (line 758-760). Thus, although heterogenous, the analysed PV- population reflects a neuronal subset. In response to the reviewer’s suggestion, we have now included overlap measurements relative to chance levels (Figure 3). These analyses did not reveal differences with the original analyses, i.e., there are no genotype specific differences. We have also incorporated the suggested inter-age group comparisons (sFig. 6) and found no differences between age groups. In light of the raised concerns, we have removed the statement that PV cells were overrepresented in the engram ensemble.

      c) A similar concern applies to the mcherry- population in Figure 4, which could represent different types of neurons that were never active, compared to the relatively homogeneous engram mcherry+ population. This could be elegantly fixed by restricting the comparison to mCherry+Fos+ vs mCherry+Fos- ensembles and could indicate engram reactivation-specific differences in perisomatic inhibition by PV cells. 

      The comparison the reviewer suggests, comparing mCherry+Fos+ to mCherry+Fos- is indeed conceptually interesting and could provide more insight into engram reactivation and PV input. However, there are practical limitations to performing this analysis, as neurons in close proximity need to be compared in a pairwise manner to account for local variability in staining intensity. As shown in Figure 3c+k and Figure 4a+b, d+e, PV immunostaining intensity varies to a certain extend within a given image. While pairwise comparisons of neighbouring neurons were feasible when analysing mCherry+ and mCherry- cells, they are unfortunately not feasible for the mCherry+Fos+ vs. mCherry+Fos- comparison. The occurrence of spatially adjacent mCherry+Fos+ and mCherry+Fos- neurons is too sparse for a pairwise comparison. This analysis would therefore result in substantial under-sampling and limit the reliability of the analysis. Nonetheless, we agree with the reviewer that the mCherry- population may be more heterogenous than the mCherry+ population, despite the fact that PV+ neurons and that non-neuronal cells were excluded from both populations in the analyses. We therefore added a statement to the discussion to acknowledge this limitation (line 536-539): 

      ‘Although PV+ cells were not included in this analysis and we excluded non-neuronal cells based on the area of the Nissl stain, the mCherry- population was potentially more heterogenous than the mCherry+ population, which may have contributed to the differences we observed.’

      (4) At several instances, there are some doubts about the statistical measures having been employed: 

      a) In Figure 4f, it is unclear why a repeated measurement ANOVA was used as opposed to a regular ANOVA. 

      b) In Supplementary Figure 2b, a Mann-Whitney test was used, supposedly because the data were not normally distributed. However, when looking at the individual data points, the data does seem to be normally distributed. Thus, the authors need to provide the test details as to how they measured the normalcy of distribution. 

      a) Based on the pairwise comparison of neighbouring neurons within animals, the data in Figure 4f was analysed with a repeated measure ANOVA. 

      b) We thank the author for their comment on Supplementary Figure 2b. The data is indeed normally distributed, and we have analysed it using a D’Agostino & Pearson test. We have corrected this in the supplemental figure. 

      Minor concerns: 

      (1) Line 117: The authors cite a recent memory impairment here, as shown by another paper. However, given the notorious difficulty in replicating behavioral findings, in particular in APP/PS1 mice (number of backcrossings, housing conditions, etc., might differ between laboratories), such a statement cannot be made. The authors should either show in their own hands that recent memory is indeed affected at 12 weeks of age, or they should omit this statement. 

      We thank the reviewer for this thoughtful comment. As noted in our response to major concern (1), we have addressed this concern by providing additional information and clarification in the discussion (line 462-465) regarding the possibility that remote memory impairments are a continuation of recent memory impairments. As mentioned in our response, we have added a reference to a more recent study from our lab (Kater et al. (2023). These findings are consistent with the earlier report from our lab (Végh et al. (2014), underscoring the reproducibility of this phenotype across independent cohorts and time. Notably, the experiments in the 2023 and present study were performed using the same housing and experimental conditions. Nevertheless, in light of the reviewer’s suggestion, and to avoid overstatement or speculation, we have now omitted the sentence referring to recent memory impairments at 12 weeks of age from the results section.

      (2) Pertaining to Figure 3, low-resolution images of the mPFC should be provided to assess the spread of injection and the overall degree of double-positive cells.  

      We agree with the reviewer and have added images of the mPFC as a supplemental figure (sFig. 3) that show the spread of the injection. Unfortunately, it is not possible to visualize the overall degree of double-positive cells at a lower magnification (or low-resolution). Representative examples of colocalization are presented in Figure 3.

      Reviewer #2 (Public review): 

      This study presents a comprehensive investigation of remote memory deficits in the APP/PS1 mouse model of Alzheimer's disease. The authors convincingly show that these deficits emerge progressively and are paralleled by selective hyperexcitability of PV interneurons in the mPFC. Using viral-TRAP labeling and patch-clamp electrophysiology, they demonstrate that inhibitory input onto labeled engram cells is selectively increased in APP/PS1 mice, despite unaltered engram size or reactivation. These findings support the idea that alterations in inhibitory microcircuits may contribute to cognitive decline in AD. 

      However, several aspects of the study merit further clarification. Most critically, the central paradox, i.e., increased inhibitory input without an apparent change in engram reactivation, remains unresolved. The authors propose possible mechanisms involving altered synchrony or impaired output of engram cells, but these hypotheses require further empirical support. Additionally, the study employs multiple crossed transgenic lines without reporting the progression of amyloid pathology in the mPFC, which is important for interpreting the relationship between circuit dysfunction and disease stage. Finally, the potential contribution of broader network dysfunction, such as spontaneous epileptiform activity reported in APP/PS1 mice, is also not addressed. 

      We thank the reviewer for their evaluation and appreciate the positive assessment of our study’s contributing to understanding remote memory deficits and the dysfunction of inhibitory microcircuits in AD. We also acknowledge the relevant points raised and have revised the manuscript to clarify our interpretations. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 68: What are "APP23xPS45" mice? This is most likely a typo.

      This line is a previously reported double transgenic amyloid beta mouse model that was obtained by crossing APP23 (overexpressing human amyloid precursor protein with the Swedish double mutation at position 670/671) with PS45 (carrying a transgene for mutant Presenilin 1, G384A mutation) (Busche et al., 2008; Grienberger et al., 2012). 

      (2) Line 148: The authors should also briefly describe in the main text that APP/PS1 x SST-Cre mice were generated and used here.  

      We thank the reviewer for their comment and have added their suggestion to the main text (line 166-168):

      ‘To do this, APP/PS1 mice were crossed with SST-Cre mice to generate APP/PS1 SST-Cre mice. Following microinjection of AAV-hSyn::DIO-mCherry into the mPFC, recordings were obtained from SST neurons.’

      (3) The discussion should be condensed because of redundancies on several occasions. For example, memory allocation is discussed starting on line 371, then again on line 392. This should be combined. Likewise, how the correlative nature of the findings about PV interneurons could be further functionally addressed is discussed on lines 413 and 454, and should be condensed into one paragraph. 

      We thank the reviewer for this suggestion and have revised the discussion to remove the redundancies as proposed.  

      Reviewer #2 (Recommendations for the authors): 

      To strengthen the manuscript, the following points should be addressed: 

      (1) Quantify amyloid pathology: It is essential to assess amyloid-β levels (soluble and insoluble) in the mPFC of APP/PS1-PV-Cre-tdTomato mice at the studied ages. This would help determine whether the observed circuitlevel changes track with disease progression as seen in canonical APP/PS1 models. 

      We thank the reviewer for this valuable suggestion and agree that assessing Aβ levels in the mPFC is important to determine whether the observed circuit level alterations in APP/PS1 mice coincide with the progression of amyloid pathology. Therefore, we assessed the amyloid plaque load in the mPFC of APP/PS1 mice at 16 and 20 weeks of age (new supplemental figure sFig. 1) and observed no difference in plaque load between these two time points. This suggests that the increased excitability in the mPFC cannot be attributed to differences in plaque load (insoluble amyloid beta).

      In line with this, we previously studied both soluble and insoluble Aβ levels in the CA1 and reported that there are no differences between 12 and 16 weeks of age (Kater et al., 2023), while PV cell hyperexcitability is present at 16 weeks of age (Hijazi et al., 2020a). From 24 weeks onwards, the level of amyloid beta increases. Similarly, Végh et al. (2014) showed using immunoblotting that monomeric and low molecular weight oligomeric forms of soluble Aβ are already present as early as 6 weeks of age and become more prominent at 24 weeks of age. Although the soluble Aβ measurements were performed in the hippocampus, we think these findings can be extrapolated to cortical regions, as the APP and PS1 mutations in APP/PS1 mice are driven by a prion promotor, which should induce consistent expression across brain regions. Data from other research groups support this hypothesis (Kim et al., 2015; Zhang et al., 2011). Thus, large regional differences in soluble Aβ are not expected. The temporal progression suggests that increasing levels of soluble amyloid beta might contribute to the emergence of PV cell hyperexcitability. We have added this point to the manuscript (line 585-591):

      ‘Since amyloid beta plaque load in the mPFC remains comparable between 16- and 20-week-old APP/PS1 mice, the observed increased excitability is unlikely the result of changes in insoluble amyloid beta levels. Previous data from our lab show that soluble amyloid beta is already present as early as 6 weeks of age and becomes more prominent at 24 weeks of age (Kater et al., 2023; Végh et al., 2014). The progressive increase in soluble amyloid beta levels may contribute to the emergence of PV cell hyperexcitability.’

      Finally, we previously compared soluble and insoluble amyloid beta levels in APP/PS1 and APP/PS1 Parv Cre mice and show that these are similar (Hijazi et al., 2020a). While our current study shows the progression of amyloid beta accumulation in APP/PS1 mice, these mice also exhibit altered microcircuitry (enhanced sIPSC frequency on engram cells) at 20 weeks of age, the same age at which we observed PV cell hyperexcitability in APP/PS1 Parv Cre tdTomato mice. This further supports the generalizability of our findings across genotypes, between APP/PS1 and APP/PS1 Parv Cre tdTomato mice. 

      (2) Examine later disease stages: Since the current effects are modest, assessing memory performance, PV cell excitability, and engram inhibition at more advanced stages could clarify whether these alterations become more pronounced with disease progression. 

      We thank the reviewer for this thoughtful suggestion. Investigating advanced disease stages could indeed provide valuable insights into whether the observed alterations in memory performance, PV cell hyperexcitability and engram inhibition become more pronounced over time. Our previous work has shown that changes in pyramidal cell excitability emerge at a later stage than in PV cells, supporting the idea of progressive circuit dysfunction (Hijazi et al., 2020a). However, at these more advanced stages, additional pathological processes, such as an increased gliosis (Janota, Brites, Lemere, & Brito, 2015; Kater et al., 2023) and synaptic loss (Alonso-Nanclares, MerinoSerrais, Gonzalez, & DeFelipe, 2013; Bittner et al., 2012), will likely contribute to both electrophysiological and behavioural measurements. Furthermore, we would like to point out that the current changes observed in memory performance, PV hyperexcitability and increased inhibitory input on engram cells at 16-20 weeks of age are not modest, but already quite substantial. Our focus on these early time points in APP/PS1 mice were intentional, as it helps us understand the initial changes in Alzheimer’s disease at a circuit level and to identify therapeutic targets early intervention. What happens at later stages is certainly of interest, but beyond the scope of this study and should therefore be addressed in future studies. We have incorporated a discussion related to this point into the revised manuscript (line 602-606):

      ‘Moreover, it is relevant to investigate whether changes in PV and PYR cell excitability, as well as input onto engram cells in the mPFC, become more pronounced at later disease stages. Nonetheless, by focussing on early disease timepoints in the present study, we aimed to understand the initial circuit-level changes in AD and identify targets for early therapeutic intervention.’

      (3) Address network hyperexcitability: Spontaneous epileptiform activity has been reported in APP/PS1 mice from 4 months of age (Reyes-Marin & Nuñez, 2017). Including EEG data or discussing this point in relation to your findings would help contextualize the observed inhibitory remodeling within broader network dysfunction. 

      We thank the reviewer for this valuable input and for highlighting the study by Reyes-Marin and Nuñez (2017). In line with this, we recently reported longitudinal local field potential (LFP) recordings in freely behaving APP/PS1 Parv-Cre mice and wild type control animals between the ages of 3 to 12 months (van Heusden et al., 2023). Weekly recordings were performed in the home cage under awake mobile conditions. These data showed no indications of epileptiform activity during wakefulness, consistent with previous findings that epileptic discharges in APP/PS1 mice predominantly occur during sleep (Gureviciene et al., 2019). Recordings were obtained from the prefrontal cortex (PFC), parietal cortex and the hippocampus. In contrast, the study by Reyes-Marin and Nuñez (2017) recorded from the somatosensory cortex in anesthetized animals. Here, during spontaneous recordings, no differences were observed in delta, theta or alpha frequency bands between APP/PS1 and WT mice. Interestingly, we observed an early increase in absolute power, particularly in the hippocampus and parietal cortex from 12 to 24 weeks of age in APP/PS1 mice. In the PFC we found a shift in relative power from lower to higher frequencies and a reduction in theta power. Connectivity analyses revealed a progressive, age-dependent decline in theta/alpha coherence between the PFC and both the parietal cortex and hippocampus. Given the well-established role of PV interneurons network synchrony and coordinating theta and gamma oscillations critical for cognitive function (Sohal, Zhang, Yizhar, & Deisseroth, 2009; Xia et al., 2017), these findings support the idea of early circuit dysfunction in APP/PS1 mice. Our findings, i.e. hyperexcitability of PV cells, align with these LFP based networklevel observations. These data suggest an early shift in the E/I balance, contributing to altered oscillatory dynamics and impaired inter-regional connectivity, possibly leading to alterations in memory. However, whether the observed PV hyperexcitability in our study directly contributes to alterations in power and synchrony remains to be elucidated. Furthermore, it would be interesting to determine the individual contribution of PV cell hyperexcitability in the hippocampus versus the mPFC to network changes and concurrent memory deficits. We have added a statement on network hyperexcitability to the discussion (line 561-565). 

      ‘Interestingly, we recently found a progressive disruption of oscillatory network synchrony between the mPFC and hippocampus in APP/PS1 Parv-Cre mice (van Heusden et al., 2023). However, whether the observed PV cell hyperexcitability directly contributes to changes in inter-regional synchrony, and whether this leads to alterations at a network level, i.e. increased inhibitory input on engram cells, and consequently to memory deficits, remains to be elucidated in future studies.’ 

      (4) Mechanisms responsible for PV hyperexcitability: Related to the previous point, a discussion of the possible underlying mechanisms, e.g., direct effects of amyloid-β, inflammatory processes, or compensatory mechanisms, would strengthen the discussion. 

      We agree with the reviewer that this will strengthen the discussion. We have now added a comprehensive discussion in the revised manuscript to address potential mechanisms responsible for PV cell hyperexcitability (line 579-594).:

      ‘Prior studies have shown that neurons in the vicinity of amyloid beta plaques show increased excitability (Busche et al., 2008). We demonstrated that PV neurons in the CA1 are hyperexcitable and that treatment with a BACE1 inhibitors, i.e. reducing amyloid beta levels, rescues PV excitability (Hijazi et al., 2020a). In line with this, we also reported that addition of amyloid beta to hippocampal slices increases PV excitability, without altering pyramidal cell excitability (Hijazi et al., 2020a). Finally, applying amyloid beta to an induced mouse model of PV hyperexcitability further impairs PV function (Hijazi et al., 2020b). Since amyloid beta plaque load in the mPFC remains comparable between 16- and 20-week-old APP/PS1 mice, the observed increased excitability is unlikely the result of changes in insoluble amyloid beta levels. Previous data from our lab show that soluble amyloid beta is already present as early as 6 weeks of age and becomes more prominent at 24 weeks of age (Kater et al., 2023; Végh et al., 2014). The progressive increase in soluble amyloid beta levels may contribute to the emergence of PV cell hyperexcitability. We hypothesize that the hyperexcitability induced by amyloid beta may result from disrupted ion channel function, as PV neuron dysfunction can result from altered potassium (Olah et al., 2022) and sodium channel activity (Verret et al., 2012).’

      (5) Excitatory-inhibitory balance: While the main focus is on increased inhibition onto engram cells, the reported increase in sEPSC frequency (Figure 5g) across genotypes suggests the presence of excitatory remodelling as well. A brief discussion of how this may interact with increased inhibition would be valuable.  

      We thank the reviewer for this comment regarding the interaction between excitatory and inhibitory remodelling. We have now incorporated this discussion point into the revised manuscript (line 528-534):

      ‘Interestingly, both WT and APP/PS1 mice showed an increase in sEPSC frequency onto engram cells, suggesting that increased excitatory input is a consequence of memory retrieval and not affected by genotype. However, only in APP/PS1 mice, the augmented excitatory input coincided with an elevation of inhibitory input onto engram cells. The resulting imbalance between excitation and inhibition could therefore potentially disrupt the precise control of engram reactivation and contribute to the observed remote memory impairment.’

      References

      Alonso-Nanclares, L., Merino-Serrais, P., Gonzalez, S., & DeFelipe, J. (2013). Synaptic changes in the dentate gyrus of APP/PS1 transgenic mice revealed by electron microscopy. J Neuropathol Exp Neurol, 72(5), 386-395. doi:10.1097/NEN.0b013e31828d41ec

      Bittner, T., Burgold, S., Dorostkar, M. M., Fuhrmann, M., Wegenast-Braun, B. M., Schmidt, B., . . . Herms, J. (2012). Amyloid plaque formation precedes dendritic spine loss. Acta Neuropathologica, 124(6), 797807. doi:10.1007/s00401-012-1047-8

      Busche, M. A., Eichhoff, G., Adelsberger, H., Abramowski, D., Wiederhold, K. H., Haass, C., . . . Garaschuk, O. (2008). Clusters of hyperactive neurons near amyloid plaques in a mouse model of Alzheimer's disease. Science, 321(5896), 1686-1689. doi:10.1126/science.1162844

      Grienberger, C., Rochefort, N. L., Adelsberger, H., Henning, H. A., Hill, D. N., Reichwald, J., . . . Konnerth, A. (2012). Staged decline of neuronal function in vivo in an animal model of Alzheimer's disease. Nat Commun, 3, 774. doi:10.1038/ncomms1783

      Gureviciene, I., Ishchenko, I., Ziyatdinova, S., Jin, N., Lipponen, A., Gurevicius, K., & Tanila, H. (2019). Characterization of Epileptic Spiking Associated With Brain Amyloidosis in APP/PS1 Mice. Front Neurol, 10, 1151. doi:10.3389/fneur.2019.01151

      Hijazi, S., Heistek, T. S., Scheltens, P., Neumann, U., Shimshek, D. R., Mansvelder, H. D., . . . van Kesteren, R. E. (2020a). Early restoration of parvalbumin interneuron activity prevents memory loss and network hyperexcitability in a mouse model of Alzheimer's disease. Mol Psychiatry, 25(12), 3380-3398. doi:10.1038/s41380-019-0483-4

      Hijazi, S., Heistek, T. S., van der Loo, R., Mansvelder, H. D., Smit, A. B., & van Kesteren, R. E. (2020b). Hyperexcitable Parvalbumin Interneurons Render Hippocampal Circuitry Vulnerable to Amyloid Beta. iScience, 23(7), 101271. doi:10.1016/j.isci.2020.101271

      Janota, C. S., Brites, D., Lemere, C. A., & Brito, M. A. (2015). Glio-vascular changes during ageing in wild-type and Alzheimer's disease-like APP/PS1 mice. Brain Res, 1620, 153-168. doi:10.1016/j.brainres.2015.04.056

      Kater, M. S. J., Huffels, C. F. M., Oshima, T., Renckens, N. S., Middeldorp, J., Boddeke, E., . . . Verheijen, M. H. G. (2023). Prevention of microgliosis halts early memory loss in a mouse model of Alzheimer's disease. Brain Behav Immun, 107, 225-241. doi:10.1016/j.bbi.2022.10.009

      Kim, H. Y., Kim, H. V., Jo, S., Lee, C. J., Choi, S. Y., Kim, D. J., & Kim, Y. (2015). EPPS rescues hippocampus-dependent cognitive deficits in APP/PS1 mice by disaggregation of amyloid-β oligomers and plaques. ature Communications, 6(1), 8997. doi:10.1038/ncomms9997

      Olah, V. J., Goettemoeller, A. M., Rayaprolu, S., Dammer, E. B., Seyfried, N. T., Rangaraju, S., . . . Rowan, M. J. M. (2022). Biophysical Kv3 channel alterations dampen excitability of cortical PV interneurons and contribute to network hyperexcitability in early Alzheimer’s. Elife, 11, e75316. doi:10.7554/eLife.75316

      Reyes-Marin, K. E., & Nuñez, A. (2017). Seizure susceptibility in the APP/PS1 mouse model of Alzheimer's disease and relationship with amyloid β plaques. Brain Res, 1677, 93-100. doi:10.1016/j.brainres.2017.09.026

      Sohal, V. S., Zhang, F., Yizhar, O., & Deisseroth, K. (2009). Parvalbumin neurons and gamma rhythms enhance cortical circuit performance. Nature, 459(7247), 698-702. doi:10.1038/nature07991

      van Heusden, F. C., van Nifterick, A. M., Souza, B. C., França, A. S. C., Nauta, I. M., Stam, C. J., . . . van Kesteren, R. E. (2023). Neurophysiological alterations in mice and humans carrying mutations in APP and PSEN1 genes. Alzheimers Res Ther, 15(1), 142. doi:10.1186/s13195-023-01287-6

      Végh, M. J., Heldring, C. M., Kamphuis, W., Hijazi, S., Timmerman, A. J., Li, K. W., . . . van Kesteren, R. E. (2014). Reducing hippocampal extracellular matrix reverses early memory deficits in a mouse model of Alzheimer's disease. Acta Neuropathol Commun, 2, 76. doi:10.1186/s40478-014-0076-z

      Verret, L., Mann, E. O., Hang, G. B., Barth, A. M., Cobos, I., Ho, K., . . . Palop, J. J. (2012). Inhibitory interneuron deficit links altered network activity and cognitive dysfunction in Alzheimer model. Cell, 149(3), 708-721. doi:10.1016/j.cell.2012.02.046

      Xia, F., Richards, B. A., Tran, M. M., Josselyn, S. A., Takehara-Nishiuchi, K., & Frankland, P. W. (2017). Parvalbumin-positive interneurons mediate neocortical-hippocampal interactions that are necessary for memory consolidation. Elife, 6. doi:10.7554/eLife.27868

      Zhang, W., Hao, J., Liu, R., Zhang, Z., Lei, G., Su, C., . . . Li, Z. (2011). Soluble Aβ levels correlate with cognitive deficits in the 12-month-old APPswe/PS1dE9 mouse model of Alzheimer's disease. Behavioural Brain Research, 222(2), 342-350. doi:https://doi.org/10.1016/j.bbr.2011.03.072

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Chang et al. investigated the cell type-specific role of the integrin activator Shv in activity-dependent synaptic remodeling. Using the Drosophila larval neuromuscular junction as a model, they show that glial-secreted Shv modulates synaptic plasticity by maintaining the extracellular balance of neuronal Shv proteins and regulating ambient extracellular glutamate concentrations, which in turn affects postsynaptic glutamate receptor abundance. Furthermore, they report that genetic perturbation of glial morphogenesis phenocopies the defects observed with the loss of glial Shv. Altogether, their findings propose a role for glia in activity-induced synaptic remodeling through Shv secretion. While the conclusions are intriguing, several issues related to experimental design and data interpretation merit further discussion.

      We appreciate the insightful and constructive comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please see our detailed point-by-point response below. 

      Reviewer #2 (Public review):

      In this paper Chang et al follow up on their lab's previous findings about the secreted protein Shv and its role in activity-induced synaptic remodeling at the fly NMJ. Previously they reported that shv mutants have impaired synaptic plasticity. Normally a high stimulation paradigm should increase bouton size and GluR expression at synapses but this does not happen in shv mutants. The phenotypes relating to activity dependent plasticity were completely recapitulated when Shv was knocked down only in neurons and could be completely rescued by incubation in exogenously applied Shv protein. The authors also showed that Shv activation of integrin signaling on both the pre- and post- synapse was the molecular mechanism underlying its function. Here they extend their study to consider the role of Shv derived from glia in modulating synaptic features at baseline and remodeling conditions. This study is important to understand if and how glia contribute to these processes. Using cell-type specific knockdown of Shv only in glia causes abnormally high baseline GluR expression and prevents activity-dependent increases in bouton size or GluR expression post-stimulation. This does not appear to be a developmental defect as the authors show that knocking down Shv in glia after basic development has the same effects as lifelong knockdown, so Shv is acting in real time. Restoring Shv in ONLY glia in mutant animals is sufficient to completely rescue the plasticity phenotypes and baseline GluR expression, but glial-Shv does not appear to activate integrin signaling which was shown to be the mechanism for neuronally derived Shv to control plasticity. This led the authors to hypothesize that glial Shv works by controlling the levels of neuronal Shv and extracellular glutamate. They provide evidence that in the absence of glial Shv, synaptic levels of Shv go up overall, presumably indicating that neurons secrete more Shv. In this context which could then work via integrin signaling as described to control plasticity. They use a glutamate sensor and observe decreased signal (extracellular glutamate) from the sensor in glial Shv KD animals, however, this background has extremely high GluR levels at the synapse which may account for some or all of the decreases in sensor signal in this background. Additional controls to test if increased GluR density alone affects sensor readouts and/or independently modulating GluR levels in the glial KD background would help strengthen this data. In fact, glialspecific shv KD animals have baseline levels of GluR that are potentially high enough to have hit a ceiling of expression or detection that accounts for the inability for these levels to modulate any higher after strong stimulation and such a ceiling effect should be considered when interpreting the data and conclusions of this paper. Several outstanding questions remain-why can't glial derived Shv activate integrin pathways but exogenously applied recombinant Shv protein can? The effects of neuronal specific rescue of shv in a shv mutant are not provided vis-à-vis GluR levels and bouton size to compare to the glial only rescue. Inclusion of this data might provide more insight to outstanding questions of how and why the source of Shv seems to matter for some aspects of the phenotypes but not others despite the fact that exogenous Shv can rescue and in some experimental paradigms but not others.

      We appreciate your insightful comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please also see the enclosed point-by-point response.

      To address the question of whether altered GluR density alone affects sensor readouts, we expressed GluR using a mhc promoter-driven GluRIIA fusion line, which increases total GluRIIA expression in muscle independently of the Gal4/UAS system. As shown in Figure 6 – figure supplement 1, mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. Despite this increase in GluR expression, we did not observe any change in extracellular glutamate levels, as measured by live imaging using the neuronal iGluSnFR sensor (updated Figure 6A). These results suggest that elevated GluR density alone does not alter iGluSnFR sensors dynamics and further support our conclusions.

      In regard to the question about ceiling effect, we do not think that the lack of GluR enhancement in repo>shv-RNAi is due to a saturated postsynaptic state. This is based on results in Figure 6, which shows that GluR levels can increase up to fourfold upon stimulation in the presence of glutamate, whereas repo>shv-RNAi results in only a ~2-fold increase in baseline GluR concentration. These results suggest that the synapse retains the capacity for further upregulation. 

      To address the question of why exogenously applied Shv activates integrin while glial derived Shv does not, we tested whether glia and neurons could differentially modify Shv. Based on Western blot analyses of adult heads and larval brains showing that Shv is present as a single band (Fig. 1A and Figure 2 – figure supplement 1B), the functional differences in neuronal or glial Shv is not likely due to the presence of different isoforms. Consistent with this, FlyBase also suggests that shv encodes a single isoform. However, while we did not detect obvious posttranslational modifications when Shv protein was expressed in neurons or glia (Figure 5 – figure supplement 1A), we cannot exclude the possibility that different cell types process Shv differently through post-transcriptional or post-translational mechanisms. Notably, shv is predicted to undergo A-to-I RNA editing, including an editing site in the coding region, which will result in a single amino acid change (St Laurent et al., 2013). Given that ADAR, the editing enzyme, is enriched in neurons and absent from glia (Jepson et al., 2011), such cell-specific editing could contribute to functional differences. It will be interesting to investigate this in the future. We have now included this in the Discussion section.

      Additionally, we have now included new data on neuronal Shv rescue of shv<sup>1</sup> mutants as suggested in the updated Figure 4. Consistent with previous findings that neuronal Shv rescues integrin signaling and electrophysiological phenotypes (Lee et al., 2017), we found that it also restores bouton size, GluR levels, and activity-induced synaptic remodeling. These results support the functional contribution of neuronal Shv. 

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chang and colleagues provides compelling evidence that glia-derived Shriveled (Shv) modulates activity-dependent synaptic plasticity at the Drosophila neuromuscular junction (NMJ). This mechanism differs from the previously reported function of neuronally released Shv, which activates integrin signaling. They further show that this requirement of Shv is acute and that glial Shv supports synaptic plasticity by modulating neuronal Shv release and the ambient glutamate levels. However, there are a number of conceptual and technical issues that need to be addressed.

      We appreciate the insightful and constructive comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please see our detailed point-by-point response below.

      Major comments:

      (1) From the images provided for Fig 2B +RU486, the bouton size appears to be bigger in shv RNAi + stimulation, especially judging from the outline of GluR clusters.

      Thank you for pointing this out. We have selected another image to better represent the data.

      (2) The shv result needs to be replicated with a separate RNAi.

      We have used another independent RNAi line targeting shv to confirm our findings (BDSC 37507). This shv-RNAi<sup>37507</sup> line also showed the same phenotype, including increased GluR levels and impaired activity-induced synaptic remodeling line (new Figure 2 – figure supplement 1A).

      (3) The phenotype of shv mutant resembles that of neuronal shv RNAi - no increased GluR baseline. Any insights why that is the case?

      This is an interesting question. We speculate that neuronal Shv normally has a dominant role in maintaining GluR levels during development, mainly through its ability to activate integrin signaling. Consistent with this, we have shown that mutations in integrin leads to a drastic reduction in GluR levels at the NMJ (Lee et al., 2017). While we have shown that neuronal knockdown of shv elevates Shv from glia (Fig. 5E), glial Shv cannot activate integrin signaling (Fig. 5B, 5C). Additionally, high levels of glial Shv will elevate ambient glutamate concentrations (Figure 6A), which will likely reduce GluR abundance and impair synaptic remodeling (Augustin et al.  2007, Chen et al., 2009, and Figure 6B). Therefore, neuronal knockdown of Shv resulted in the same phenotype as shv<sup>1</sup> mutant. 

      (4) In Fig 3B, SPG shv RNAi has elevated GluR baseline, while PG shv RNAi has a lower baseline. In both cases, there is no activity induced GluR increase. What could explain the different phenotypes?

      SPG is the middle glial cell layer in the fly peripheral nervous system and may also influence the PG layer through signaling mechanisms (Lavery et al., 2007), therefore having a stronger effect. We have now mentioned this in the text. 

      (5) In Fig 4C, the rescue of PTP is only partial. Does that suggest neuronal shv is also needed to fully rescue the deficit of PTP in shv mutants?

      This is indeed a possibility. We have shown that neuronal and glial Shv each contribute to activity-induced synaptic remodeling through different mechanisms. It will be interesting test this in the future.

      (6) The observation in Fig 5D is interesting. While there is a reduction in Shv release from glia after stimulation, it is unclear what the mechanism could be. Is there a change in glial shv transcription, translation or the releasing machinery? It will be helpful to look at the full shv pool vs the released ones. 

      Thank you for the suggestion. To address this, we monitored the levels of intracellular Shv using a permeabilized preparation (we found that the addition of detergent to permeabilize the sample strips away extracellular Shv). Combined with the extracellular staining results, we can get an idea about the total amount of Shv. As shown in the updated Figure 5D, intracellular Shv levels (permeabilized) remained unchanged following stimulation, indicating that there is no intracellular accumulation and that the observed decrease in extracellular Shv is unlikely due to impaired release machinery.

      (7) In Fig 5E, what will happen after stimulation? Will the elevated glial Shv after neuronal shv RNAi be retained in the glia? 

      Thank you for the interesting question. We agree that examining Shv distribution following neuronal activity would be highly informative. While we plan to perform time-lapse experiments in future studies to address this, we feel that such analyses are beyond the scope of the current manuscript.

      (8) It would be interesting to see if the localization of shv differs based on if it is released by neuron or glia, which might be able to explain the difference in GluR baseline. For example, by using glia-Gal4>UAS-shv-HA and neuronal-QF>QUAS-shv-FLAG. It seems important to determine if they mix together after release? It is unclear if the two shv pools are processed differently.

      We agree that investigating whether neuronal and glial shv pools colocalize or are differentially processed is an important future direction. We hope to examine how each pool responds to stimulation in the shv<sup>1</sup> mutant background using LexA and Gal4 systems in the future

      (9) Alternatively, do neurons and glia express and release different Shv isoforms, which would bind different receptors?

      Thank you for the questions. We have now addressed this in the discussion and also enclosed below:

      Based on Western blot analyses of adult heads and larval brains showing that Shv is present as a single band (Fig. 1A and Figure 2 – figure supplement 1B), the functional differences in neuronal or glial Shv is not likely due to the presence of different isoforms. Consistent with this, FlyBase also suggests that shv encodes a single isoform (Ozturk-Colak et al., 2024). However, while we did not detect obvious post-translational modifications when Shv protein was expressed in neurons or glia (Figure 5 – figure supplement 1A), we cannot exclude the possibility that different cell types process Shv differently through posttranscriptional or post-translational mechanisms. Notably, shv is predicted to undergo A-to-I RNA editing, including an editing site in the coding region, which could result in a single amino acid change (St Laurent et al., 2013). Given that ADAR, the editing enzyme, is enriched in neurons and absent from glia (Jepson et al., 2011), such cell-specific editing could contribute to functional differences. It will be interesting to investigate this in the future.

      (10) It is claimed that Sup Fig 2 shows no observable change in gross glial morphology, further bolstering support that glial Shv does not activate integrin. This seems quite an overinterpretation. There is only one image for each condition without quantification. It is hard to judge if glia, which is labeled by GFP (presumably by UAS-eGFP?), is altered or not.

      Thank you for raising this concern. To strengthen our claim, we now include additional images (Figure 5, figure supplement 2). No obvious change in overall glial morphology was observed, with glia continuing to wrap the segmental nerves and extend processes that closely associate with proximal synaptic boutons (Figure 5, figure supplement 2). These observations suggest that glial  Shv is not essential for maintaining normal glial structure or survival, and is consistent with the idea that glial Shv does not activate integrin, as integrin signaling is required to maintain the integrity of peripheral glial layers. 

      (11) The hypothesis that glutamate regulates GluR level as a homeostatic mechanism makes sense. What is the explanation of the increased bouton size in the control after glutamate application in Fig 6?

      We speculate that it could be due to a retrograde signaling mechanism activated by elevated extracellular glutamate, allowing neurons to modulate bouton morphology in response to synaptic demand. It will be interesting to investigate this possibility in the future.  

      (12) What could be a mechanism that prevents elevated glial released Shv to activate integrin signaling after neuronal shv RNAi, as seen in Fig 5E?

      One potential mechanism is post-translational or post-transcriptional processing of Shv. Although our Western blots did not reveal differences in the molecular weight of glial vs. neuronal Shv, we cannot exclude the possibility that modifications not readily detectable by this method are responsible. Additionally, as mentioned in the Discussion section, post-transcriptional processing such as A-to-I RNA editing could introduce changes in the Shv protein, potentially altering its ability to interact with or activate integrin. 

      (13) Any speculation on how the released Shv pool is sensed?

      The same RNA editing modification mentioned earlier or post-translational modifications in Shv may also influence how it is sensed by target cells. 

      Reviewer #1 (Recommendations for the authors):

      Issues Regarding Cell Type-Specific Secretion and the Role of Shv:

      Extracellular Secretion of Shv:

      (1) The data in Figure 1 suggest that Shv is not secreted under resting conditions, challenging the proposed extracellular role of Shv. It remains unclear whether Shv secretion can be confirmed using Shv-eGFP (knock-in) following high K+ stimulation.

      We apologize for not being clear. In Figure 1, Shv signals we’ve shown are from permeabilized preparation, which preferentially labels intracellular Shv. We do observe secreted Shv-eGFP following stimulation (Figure 5E), consistent with our hypothesis. However, endogenous extracellular Shv-eGFP signal is very weak, and was therefore detected using the GFP antibody and amplified with a  fluorescent secondary antibody. We have now also included additional controls in Figure 5E to demonstrate the specificity of the staining.

      (2) In Figure 5D, total Shv staining should be included to evaluate potential presynaptic accumulation of intracellular Shv, which may lead to extracellular secretion upon stimulation. Additionally, the representative images of glial rescue do not seem to align with the quantification data; more extracellular Shv signals were observed after stimulation.

      Thank you for the comments. We monitored the levels of intracellular Shv using a permeabilized preparation (detergent treatment stripped away extracellular Shv signal). When combined with non-permeabilized extracellular staining, this approach provides insights into total Shv levels. We found no intracellular accumulation of Shv and the intracellular levels remained unchanged following stimulation (updated Figure 5D), suggesting that reduced extracellular Shv is not likely due to impaired release. Additionally, we have selected another image for glial rescue by avoiding the trachea region, which better represent the quantification data.

      (3) In Figure 5E, "extracellular" Shv staining in repo>shv-RNAi samples appears localized within synaptic boutons. This raises concerns about the staining protocol potentially labeling intracellular proteins. Control experiments using presynaptic cytosolic markers are needed to confirm staining specificity.

      Thank you for the thoughtful suggestion. To validate that our staining protocol is selective for extracellular proteins, we also stained for cysteine string protein (CSP), an intracellular synaptic vesicle protein predominantly located in the presynaptic terminals (Zinsmaier et al., 1990; Umbach et al., 1994), under the same conditions. CSP was detected only in the permeabilized condition (updated Figure 5E), suggesting that the non-permeabilizing protocol is selective for extracellular proteins. 

      (4) The study does not clarify why Shv knockdown in either perineurial glia or subperineurial glia abolishes stimulus-dependent synaptic remodeling. Does Shv secretion occur from PG, SPG, or both toward the synaptic bouton?

      Thank you for raising this point. SPG is the middle glial cell layer in the fly peripheral nervous system and may also influence the PG layer through signaling mechanisms (Lavery et al., 2007). Consistent with this, we observed a stronger effect on GluR levels when SPG was disrupted compared to PG. It will be interesting to distinguish whether Shv is released by PG or SPG in the future.

      (5) The possibility of an inter-glial role for Shv via integrin signaling in regulating glial morphogenesis is underexplored. The rough morphological characterization in Supplemental Figure 2 requires more detailed quantification and the use of sub-glial typespecific GAL4 drivers.

      We now include additional images (Figure 5, figure supplement 2) to examine the overall glial morphology. There was no obvious change in gross glial morphology, with glia continuing to wrap the segmental nerves and extend processes that closely associate with proximal synaptic boutons when shv is knocked down in glia (Figure 5, figure supplement 2). These observations suggest that glial  Shv is not essential for maintaining normal glial structure or survival, and is consistent with the idea that glial Shv does not activate integrin, as integrin signaling is required to maintain the integrity of peripheral glial layers (Xie and Auld, 2011; Hunter et al., 2020).

      (6) While repo>shv rescues stimulus-dependent bouton size and GluR increases in the shv mutant (Figure 5), the interaction between neuronal and glial Shv remains unclear. Does neuronal Shv influence the expression or distribution of glial Shv?

      We agree that investigating whether neuronal and glial shv pools influence each other’s expression or distribution is an important future direction. We hope to investigate this in more detail in the future using LexA-LexOp and GAL4/UAS dual expression systems.

      Issues Regarding the Regulation of GluR and Perisynaptic Glutamate by Glial Shv:

      (7) The methodology for iGluSnFR measurement (Figure 6A) is inadequately described. If anti-HRP staining was used to normalize signals, it suggests the experiment may have involved fixed tissue. However, iGluSnFR typically measures glutamate levels in live cells, raising concerns about the validity of this approach in fixed samples.

      We apologize for not being clear about the method used to measure iGluSnFR. The original figure was generated from imaging iGluSnFR signals immediately following fixation. To address the reviewer’s concern and validate these results, we have now performed live imaging experiments using a water dipping objective to measure iGluSnFR intensity in unfixed preparations (new Figure 6A). To label synaptic boutons, we co-expressed mtdTomato using the neuronal driver, nSybGAL4. The results from the live imaging experiments confirmed our original observations that glial Shv required to control ambient extracellular glutamate levels (see updated Fig. 6A and text). Additionally, to ascertain that the decrease in iGluSnFR signal reflects a decrease in ambient extracellular glutamate levels rather than glutamate depletion caused by high levels of GluR, we upregulated GluR levels using mhc-GluRIIA, which drives GluRIIA expression in muscles (Petersen et al., 1997). We found mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. However, iGluSnFR signals at the synapse remained unchanged (Figure 6A), suggesting that elevated GluR density alone does not reduce signals. Taken together, these results suggest that glial Shv plays a critical role in controlling ambient extracellular glutamate levels. 

      (8) As shown in Figure 2, repo>shv-RNAi increases GluR levels before high K+ stimulation, potentially saturating postsynaptic GluR expression and precluding further increases upon stimulation.

      Our data in Figure 6 show that GluR levels can increase up to four-fold upon stimulation in the presence of glutamate, whereas repo>shv-RNAi results in only a ~2-fold increase in baseline GluR concentration. These results suggest that the synapse retains the capacity for further upregulation. Thus, we do not think that the lack of GluR enhancement in repo>shv-RNAi is due to a saturated postsynaptic state, but rather reflects a requirement for glial Shv in activity-dependent modulation.

      (9) Despite glial shv knockdown lowering extracellular glutamate levels, GluR levels unexpectedly increase (Figure 6B). This contradicts the known requirement for high ambient glutamate concentrations to promote GluR clustering and membrane expression (Chen et al., 2009). Furthermore, adding 2 mM glutamate reverses these increases, suggesting additional complexity in the regulation of Shv synaptic remodeling.

      Thank you for the comment and the opportunity to clarify this point. While it may seem counterintuitive at first glance, our observations are in line with previous reports that showed low ambient glutamate levels significantly elevated GluR intensity at the Drosophila NMJ (Chen et al., 2009), but such increase can be reversed by glutamate supplementation (Augustin et al., 2007; Chen et al., 2009). We have revised the text to more clearly reflect this connection.

      (10) If glial Shv promotes GluR expression, why does the increased extracellular Shv from neuronal shv knockdown (elav>shv-RNAi, Figure 5E) fail to elicit stimulus-dependent GluR elevation?

      We speculate that this is because glial Shv does not activate integrin signaling (Figure 5B, C), and elevated glial Shv increases ambient glutamate concentration (Figure 6A), thereby reducing GluR expression (Augustin et al., 2007; Chen et al., 2009). This is indeed what we observed when shv is knocked down in neurons. 

      Additional Issues:

      (11) The type of bouton used for quantification (e.g., Ib or Is boutons) is not specified, which is critical for interpreting the results.

      We apologize for not being clear. We analyzed type Ib boutons as done previously (Lee et al., 2017 and Chang et al., 2024), and have now included this information in the Methods section.  

      (12) The extent of Shv protein depletion in the repo-GeneSwitch system needs validation to confirm the efficacy of the knockdown.

      Thank you for the suggestion. We confirmed the efficiency of acute shv knockdown by the repo-GeneSwitch system by performing Western blot analysis of dissected larval brains (Figure 2 – figure supplement 1B). Acute glial knockdown using the repo-GeneSwitch driver resulted in a 30% reduction in Shv levels, similar to the decrease observed with the repo-GAL4 driver, suggesting that the GeneSwitch driver is functional. Furthermore, knockdown of shv by the ubiquitous tubulin-GAL4 driver completely eliminated Shv protein, indicating that the RNAi construct is effective.  

      Reviewer #2 (Recommendations for the authors):

      (1) General comment on statistics/data presentation: The authors employ an unusual method of using both one-way ANOVA and multiple t-test stats for the same data. Would a 2-way ANOVA be the more appropriate solution to this problem (to analyze across genotype and stimulation condition)? Also a chart in the supplementals showing all comparisons rather than just the fraction explicitly reported in the graphs would be helpful (it is not clear if no indication on significance indicates no difference or just not reported between some of the baseline levels, especially since everything is presented as ratios and in some cases this could help with data interpretation of which baseline levels are different and how they compare to other baselines and other post-stim levels). Further, there are no sample sizes given for any experiment, nor are any values of means, SD, etc ever explicitly given.

      We appreciate the thoughtful suggestion. While a two-way ANOVA could be used to examine interaction effects between genotype and stimulation condition, our analysis was designed to address a specific biological question: whether each genotype, independent of baseline levels, is capable of undergoing activitydependent synaptic remodeling. To this end, we used t-tests to directly compare unstimulated vs. stimulated conditions within each genotype, allowing us to determine whether stimulation produces a significant effect in an all-or-none manner. In parallel, we applied one-way ANOVA with post hoc tests to analyze differences among baseline (unstimulated) conditions across genotypes. This approach is justified by the fact that stimulation was applied acutely and separately, and therefore the baseline values should not be influenced by the stimulated condition. Because we were not aiming to compare the extent of synaptic remodeling between genotypes, we did not use a two-way ANOVA to analyze interaction effects across all conditions.

      In response to the reviewer’s suggestion, we have now added the sample number in the graphs. Additionally, in the Methods section, we include information that each sample represents biological repeats, and that data are presented as fold-change relative to unstimulated controls from the same experimental batch. This normalization is necessary, as absolute GluR intensities can vary depending on microscope settings and staining conditions.

      (2) To clarify distinct roles of Shv coming from neurons vs glia it would help if the authors could include more data on the rescue of shv mutants with UAS-Shv in neurons alone. This data is never shown in the manuscript and data on what effect this rescue has on the pertinent phenotypes in this paper (bouton size and GluR staining) is not reported in the referred to 2017 paper. What this does and does not do for these phenotypes has important implications for how to interpret the glia-only rescue findings.

      Thank you for the suggestion. We have now included new data on neuronal Shv rescue in shv<sup>1</sup> mutants as suggested (updated Figure 4A). Consistent with previous findings that neuronal Shv rescues integrin signaling and electrophysiological phenotypes (Lee et al., 2017), we found that it also restores bouton size, GluR levels, and activity-induced synaptic remodeling. These results support the functional contribution of neuronal Shv. 

      (3) Figure 1C: Where are the images in the periphery taken? The morphology of the glia is odd in that "blobs" of glial membrane seemingly unattached to anything else are floating about? Perhaps these are a thin stack projection and so the connection to the main glia "stalks" are just cut off? Could a specific individual synapse be shown? Also consider HRP shown on its own so that where the actual boutons are could be more clear. It seems like both the Tomato and HRP channels are really overexposed making visualizing the morphology quite confusing. Also why not use the antibody against Shv to directly visualize expression which is more direct than a knock-in tagged version?

      Figure 1C shows a single optical slice of the NMJ at muscle segment 2, selected to clearly highlight Shv-eGFP localization at a branch in close contact with the glial membrane. The glial stalk is not visible in this image because it lies in a different focal plane from the branch of interest. We have now specified this information in the figure legend. In the original figure, the HRP signal (405 channel) was oversaturated, which interfered with visual clarity. In the updated Figure 1C, we reduced the intensity of overexposed channels to better reveal the weak ShveGFP signal and fine glial processes. While we have generated an antibody against Shv, the amount is extremely limited, and hence the Shv-eGFP fusion serves as a valuable tool for visualizing subcellular localization.

      (4) Do glutamate levels really rise in glia Shv KD? Although iGluSnFR signal changes could it be the high level of GluR at the synapse acting as sponges to sequester glutamate so that it can't stimulate the sensor as well? One way to test this would be to overexpress or KD GluRs in muscle in wildtype (or in the repo>Shv RNAi background) to see if that alone can modulate iGluSnfR signals?

      Thank you for suggesting this important control. To address the question of whether high level GluR density alone could influence neuronal iGluSnFR sensor readouts, we expressed GluR using a mhc promoter-driven GluRIIA fusion line, which increases total GluRIIA expression in muscle independently of the Gal4/UAS system. As shown in Figure 6 – figure supplement 1, mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. Despite this increase in GluR expression, we did not observe any change in extracellular glutamate levels, as measured by live imaging using the neuronal iGluSnFR sensor (updated Figure 6A). These results suggest that elevated GluR density alone does not alter iGluSnFR sensors  dynamics and further support our conclusions.

      (5) The authors have some Shv constructs that can't be secreted or can't bind to integrins. Performing cell type specific rescues with these constructs might also help distinguish how source matters for each proposed sub-function of Shv though this may be outside the scope of this study. 

      Thank you for noticing the Shv constructs we have. We hope to further test subfunctions of Shv in the future.

      (6) At one point the authors discuss experiments that measure how much Shv is released by glia during neuronal stimulation. Then state that "These data indicate that glial Shv does not directly inhibit integrin signaling." But how this experiment relates to integrin signaling is not explained and unclear.

      We apologize for the confusion. We have now updated the text to better explain our logic: “This activity-induced decrease in glial Shv levels, along with reduced integrin activation (Fig. 5B), suggest that glial Shv does not act by directly inhibiting integrin signaling.”

      Reviewer #3 (Recommendations for the authors):

      Minor comments

      (1) Readers are left wondering what causes the increased baseline of GluR after glial shv RNAi at Fig 1, which is addressed much later. It would be helpful to preemptively mention this.

      Thank you for the suggestion. To maintain a logical flow, we chose to first present the phenotypic data in Figures 1 and 2 and then return to the mechanistic explanation once we introduced ambient glutamate measurements. 

      (2) Be consistent with eGFP vs EGFP.

      Thank you, we have corrected the inconsistencies.  

      (3) Scale bar for Fig 1B is missing in the low-magnification panel.

      Thank you for pointing out. We’ve put in the scale bar for Figure 1B.   

      (4) Fig 1C, it would be helpful to elaborate on the anatomy. For example, what NMJ/abdominal segment is this? Why only some axons are surrounded by glia?

      Figure 1C presents a single optical slice of the NMJ at muscle segment 2, chosen to highlight Shv-eGFP localization at a branch closely juxtaposed to the glial membrane. The glial stalk is not shown in this image because it resides in a different focal plane than the branch being visualized. We have now included this information in the figure legend.

      (5) For Fig 3B, while it is stated that "we observed normal synaptic remodeling using alrmGAL4," the effect size is smaller. There seems to be a decrease in the amount of synaptic remodeling occurring?

      Thank you for pointing this out. Our primary goal was to determine whether each genotype, regardless of baseline GluR levels, is capable of undergoing activitydependent synaptic remodeling in response to stimulation. For this reason, we focused on detecting the presence or absence of remodeling rather than comparing the extent of remodeling across genotypes. While a smaller effect on activity-induced bouton size was observed with alrm-GAL4, the change was still statistically significant, indicating that remodeling does occur in this genotype. Currently, we do not have a clear biological interpretation for differences in the magnitude of remodeling, and therefore chose not to emphasize cross-genotype comparisons.

    1. Author response:

      Reviewer #1 (Public review):

      Major Concerns:

      (1) Lack of Direct Evidence for RadD-NKp46 Interaction

      The central claim that RadD interacts with NKp46 is not formally demonstrated. A direct binding assay (e.g., Biacore, ELISA, or pull-down with purified proteins) is essential to support this assertion. The absence of this fundamental experiment weakens the mechanistic conclusions of the study.

      The reviewer is correct. Direct assays are currently quite impossible because RadD is huge protein and it will take years to purify it. Instead, we used immunoprecipitation assays using NKp46-Ig (Author response images 1 and 2). Fusobacteria were lysed using RIPA buffer, and the lysates were centrifuged twice to separate the supernatant from the pellet (which contains the bacterial membranes). The resulting lysates were incubated overnight with 2.5 µg of purified NKp46 and protein G-beads. After thorough washing, the bound proteins were placed in sample buffer and heated at 95 °C for 8 minutes. The eluates were run on a 10% acrylamide gel and visualized by Coomassie blue staining. As can be seen the NKp46-Ig was able to precipitate protein band around 350Kd in both F. polymorphum ATCC10953 (Author response image 1) and in F. nucleatum ATCC23726 (Author response image 2).

      Author response image 1. NKp46 immunoprecipitation with Fusobacterium polymorphum (ATCC 10953) lysates. The resulting lysates of supernatant and pellet of Fusobacterium were immunoprecipitated (IP) with 2.5 μg of control fusion protein (RBD-Ig) or with NKp46-Ig. A 2.5 μg of purified fusion proteins were also run on gel.

      Author response image 2. NKp46 immunoprecipitation with Fusobacterium nucleatum (ATCC 23726) lysates. The resulting lysates of supernatant and pellet of Fusobacterium were immunoprecipitated (IP) with 2.5 μg of Control fusion protein (RBD-Ig) or with NKp46-Ig. 2.5 μg of purified fusion proteins were also run on gel.

      (2) Figure 2: Binding Specificity and Bacterial Strains

      A CEACAM1-Ig control should be included in all binding experiments to distinguish between specific and non-specific Ig interactions. There is differential Ig binding between strains ATCC 23726 and 10953. The authors should quantify RadD expression in each strain to determine if the difference in binding is due to variation in RadD levels.

      No significant difference in mCEACAM-1-Ig binding was observed across multiple independent experiments. Author response image 3 shows a representative histogram showing mCEACAM-1-Ig binding to F. nucleatum ATCC 23726 and F. polymorphum ATCC 10953. Comparable binding levels were detected in both bacterial species (upper histogram). Similarly, NKp46-Ig and Ncr1-Ig fusion proteins exhibited comparable binding patterns (lower histogram). It is currently not possible to quantify RadD expression directly, as no anti-RadD antibody is available.

      Author response image 3. CEACAM-1 Ig binding to Fusobacterium ATCC 23726 and ATCC 10953. Upper histograms show staining with secondary antibody alone (gray) compared to CEACAM-1 Ig (black line). Lower histograms show binding of NKp46 and Ncr1 fusion proteins to the two Fusobacterium strains. Gray represent secondary antibody controls.

      (3) Figure 3: Flow Cytometry Inconsistencies and Missing Controls

      What do the FITC-negative, Ig-negative events represent? The authors should clarify whether these are background signals, bacterial aggregates, or debris.

      We now present the gating strategy used in these experiments (Author response image 4). Fusion negative Ig samples were the bacterial samples stained only with the secondary antibody APC (anti-human AF647). The TITC-negative represent unlabeled bacteria.

      Author response image 4. Gating strategy for FITC-labeled Fusobacterium stained with fusion proteins. Bacteria were first gated as shown in the left panel. The gated population was then further analyzed in the right plot: the lower-left quadrant represents bacterial debris, the upper-left quadrant corresponds to FITC-stained bacteria only, and the upper-right quadrant shows bacteria double-positive for FITC and APC, indicating binding of the fusion proteins.

      Panel B, CEACAM1-Ig binding appears markedly increased compared to WT bacteria. The reason for this enhancement should be discussed-does it reflect upregulation of the bacterial ligand or an artifact of overexpression? Fluorescence compensation should be carefully reviewed for the NKp46/NCR1-Ig binding assays to ensure that the signals are not due to spectral overlap or nonspecific binding. Importantly, binding experiments using the FadI/RadD double knockout strain are missing and should be included. This control is essential.

      We don’t know why expression of CEACAM1-Ig binding is increased. Indeed, it will be nice to have the FadI/RadD double knockout strain which we currently don’t have.

      In Panel E, the basis for calculating fold-change in MFI is unclear. Please indicate the reference condition to which the change is normalized.

      The mean fluorescence intensity (MFI) fold change was calculated by dividing the MFI obtained from staining with the fusion proteins by the MFI of the corresponding secondary antibody control (bacteria incubated without fusion proteins).

      (4) Figure 4: Binding Inhibition and Receptor Sensitivity

      Panel A lacks representative FACS plots and is currently difficult to interpret.

      Fusobacteria binding to CEACAM-1, NKp46, and NCR1 fusion proteins was tested in the presence of 5 and 10 mM L-arginine (Author response image 5). L-arginine inhibited the binding of NKp46-Ig and NCR1-Ig, whereas no effect was observed on CEACAM-1-Ig binding.

      Author response image 5. Fusobacterium binding inhibition by L-Arginine. The figure shows the binding of CEACAM1-Ig (left panel), NKp46-Ig (middle panel), and Ncr1-Ig (right panel) in the presence of 0 mM (black), 5 mM (red), and 10 mM (blue) L-arginine.

      Differences in the sensitivity of human vs. mouse NKp46 to arginine inhibition should be discussed, given species differences in receptor-ligand interactions.

      Ncr1, the murine orthologue of human NKp46, shares approximately 58% sequence identity with its human counterpart (1). The observed differences in arginine-mediated inhibition of bacterial binding between mouse and human NKp46 might stem from structural differences or distinct posttranslational modifications, such as glycosylation. Indeed, prediction algorithms combined with high-performance liquid chromatography analysis revealed that Ncr1 possesses two putative novel O-glycosylation sites, of which only one is conserved in humans (2).

      References

      (1) Biassoni R., Pessino A., Bottino C., Pende D., Moretta L., Moretta A. The murine homologue of the human NKp46, a triggering receptor involved in the induction of natural cytotoxicity. Eur J Immunol. 1999 Mar; 29(3).

      (2) Glasner A., Roth Z., Varvak A., Miletic A., Isaacson B., Bar-On Y., Jonjić S., Khalaila I., Mandelboim O. Identification of putative novel O-glycosylations in the NK killer receptor Ncr1 essential for its activity. Cell Discov. 2015 Dec 22; 1:15036.

      What are the inhibition results using F. nucleatum strains deficient in FadI?

      The inhibition pattern observed in the F. nucleatum ΔFadI mutant was comparable to that of the wild-type strain (Author response image 6). When cultured under identical conditions and exposed to increasing concentrations of arginine (0, 5, and 10 mM), the F. nucleatum ΔFadI strain also demonstrated a dose-dependent reduction in binding to NKp46 and Ncr1.

      Author response image 6. Arginine inhibition of NKp46-Ig and Ncr1-Ig binding in F. nucleatum ΔFadI. Histograms show NKp46-Ig (A, C) and Ncr1-Ig (B, D) binding to F. nucleatum ATCC10953 ΔFadI (A and B) and to F. nucleatum ATCC23726 ΔFadI (A and B) following exposure to 5 mM and 10 mM L-Arginine. Panels (E) and (F) display the mean fluorescence intensity (MFI) quantification corresponding to (A and B) and (C and D), respectively.

      In Panel B, CEACAM1-Ig and RadD-deficient bacteria must be included as negative controls for binding specificity upon anti-NKp46 blocking.

      We appreciate the request to include CEACAM1-Ig and RadD-deficient bacteria as negative controls for specificity under anti-NKp46 blocking. We don’t not think it is necessary since the 02 antibody is specific for NKp46, we used other anti0NKp46 antibodies that did not block the interaction and an irrelevant antibofy, we showed that arginine produced a dose-dependent reduction in NKp46/Ncr1 binding, consistent with an arginine-inhibitable RadD interaction already shown in our manuscript (Fig. 4A). The ΔRadD strains we used already demonstrate loss of NKp46/Ncr1 binding and loss of NK-boosting activity (Figs. 3, 5). Collectively, these data establish that NKp46/Ncr1 recognition of a high-molecular-weight ligand consistent with RadD is specific and functionally relevant.

      Figure 5: Functional NK Activation and Tumor Killing

      In Panels B and C, the key control condition (NK cells + anti-NKp46, without bacteria) is missing. This is needed to evaluate if NKp46 recognition is involved in tumor killing. The authors should explicitly test whether pre-incubation of NK cells with bacteria enhances their anti-tumor activity.

      No significant difference in NK cell cytotoxicity was observed between untreated NK cells and NK cells incubated with anti-NKp46 antibody in the absence of bacteria. Therefore, the NK + anti-NKp46 (O2) group was included as an additional control alongside the other experimental conditions shown in Figures 5b and 5c, and is presented in Author response image 7 below.

      Author response image 7. NK cytotoxicity against breast cancer cell lines. NK cell cytotoxicity against T47D (left) and MCF7 (right) breast cancer cell lines. This experiment follows the format of Figure 5b and 5c, with the addition of the NK cells + O2 antibody group. No significant differences were observed when values were normalized to NK cells alone.

      Could bacteria induce stress signals in tumor cells that sensitize them to NK killing? This distinction is critical.

      It remains unclear whether the bacteria induce stress-related signals in tumor cells that render them more susceptible to NK cell–mediated cytotoxicity.

      (6) Figure 5D: Mechanism of Peripheral Activation

      It is suggested that contact between bacteria and NK cells in the periphery leads to their activation. Can the authors confirm whether this pre-activation leads to enhanced killing of tumor targets, or if bacteria-tumor co-localization is required? The literature indicates that F. nucleatum localizes intracellularly within tumor cells. If so, how is RadD accessible to NKp46 on infiltrating NK cells?

      We do not expect that pre-activation of NK cells with bacteria would enhance their tumor-killing capacity. In fact, when NK cells were co-incubated with bacteria, we occasionally observed NK cell death. Although F. nucleatum can reside intracellularly, bacterial entry requires prior adhesion to tumor cells. At this stage—before internalization—the bacteria are accessible for recognition and binding by NK cells.

      (8) Figure 5E and In Vivo Relevance

      Surprisingly, F. nucleatum infection is associated with increased tumor burden. Does this reflect an immunosuppressive effect? Are NK cells inhibited or exhausted in infected mice (TGIT, SIGLEC7...)? If NK cell activation leads to reduced tumor control in the infected context, the role of RadD-induced activation needs further explanation. RadD-deficient bacteria, which do not activate NK cells, result in even poorer tumor control. This paradox needs to be addressed: how can NK activation impair tumor control while its absence also reduces tumor control?

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. The increased tumor burden observed in infected mice may therefore result from bacterial interference with immune cell infiltration and accumulation within the tumor microenvironment (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun 11, 3259 (2020)). Consequently, the NK cells that do reach the tumor site can recognize and kill F. nucleatum–bearing tumor cells through RadD–NKp46 interactions. In the absence of RadD, this recognition is impaired, leading to reduced NK-mediated cytotoxicity and increased tumor growth.

      (9) NKp46-Deficient Mice: Inconsistencies

      In Ncr1⁻/⁻ mice, infection with WT or RadD-deficient F. nucleatum has no impact on tumor burden. This suggests that NKp46 is dispensable in this context and casts doubt on the physiological relevance of the proposed mechanism. This contradiction should be discussed more thoroughly.

      Ncr1 is also directly involved in mediating NK cell–dependent killing of tumor cells, even in the absence of bacterial infection. Therefore, in Ncr1-deficient mice, F. nucleatum has no additional effect on tumor progression (Glasner, A., Ghadially, H., Gur, C., Stanietsky, N., Tsukerman, P., Enk, J., Mandelboim, O. Recognition and prevention of tumor metastasis by the NK receptor NKp46/NCR1. J Immunol. 2012).

      Reviewer #2 (Public review):

      Weaknesses:

      (1) A previous study by this group (PMID: 38952680) demonstrated that RadD of F. nucleatum binds to NK cells via Siglec-7, thereby diminishing their cytotoxic potential. They further proposed that the RadD-Siglec-7 interaction could act as an immune evasion mechanism exploited by tumor cells. In contrast, the present study reports that RadD of F. nucleatum can also bind to the activating receptor NKp46 on NK cells, thereby enhancing their cytotoxic function.

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. In contrast, NKp46 and its murine homologue, Ncr1, both recognize and bind the bacterium.

      While F. nucleatum-mediated tumor progression has been documented in breast and colon cancers, the current study proposes an NK-activating role for F. nucleatum in HNSC. However, it remains unclear whether tumor-infiltrating NK cells in HNSC exhibit differential expression of NKp46 compared to Siglec-7. Furthermore, heterogeneity within the NK cell compartment, particularly in the relative abundance of NKp46⁺ versus Siglec-7⁺ subsets, may differ substantially among breast, colon, and HNSC tumors. Such differences could have been readily investigated using publicly available single-cell datasets. A deeper understanding of this subset heterogeneity in NK cells would better explain why F. nucleatum is passively associated with a favorable prognosis in HNSC but correlates with poor outcomes in breast and colon cancers.

      Currently, there are no publicly available single-cell datasets suitable for characterizing NK cell heterogeneity in the context of F. nucleatum infection—particularly regarding the expression of Siglec-7, NKp46, or CEACAM1 and their potential association with poor clinical outcomes in breast, head and neck squamous cell carcinoma (HNSC), or colorectal cancer (CRC). Furthermore, no RNA-seq datasets are available for breast cancer cases specifically associated with F. nucleatum infection and poor prognosis. Therefore, we analyzed bulk RNA expression datasets for Siglec-7 and CEACAM1 and evaluated their associations with HNSC and CRC using the same patient databases utilized in our manuscript (Author response image 8). No significant differences in Siglec-7 expression were detected between HNSC and CRC samples (Author response image 8A). Although CEACAM1 mRNA levels did not differ between F. nucleatum–positive and –negative cases within either cancer type, its overall expression was higher in CRC compared to HNSC (Author response image 8B).

      Author response image 8. Siglec7 and Ceacam1 expression and the prognostic effect of F. nucleatum in a tumor-type-specific manner. Comparison of Siglec7 (A) and Ceacam1 (B) expression across HNSC and CRC tumors. Log₂ expression levels of NKp46 mRNA were compared across HNSC and CRC cohorts, stratified by F. nucleatum positive and negative. Results were analyzed by one-way ANOVA with Bonferroni post hoc correction.

      (2) The in vivo tumor data (Figure 5D-F) appear to contradict the authors' claims. Specifically, Figure 5E suggests that WT mice engrafted with AT3 breast tumors and inoculated with WT F. nucleatum exhibited an even greater tumor burden compared to mice not inoculated with F. nucleatum, indicating a tumor-promoting effect. This finding conflicts with the interpretation presented in both the results and discussion sections.

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. The increased tumor burden observed in infected mice may therefore result from bacterial interference with immune cell infiltration and accumulation within the tumor microenvironment (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun 11, 3259 (2020)). Consequently, the NK cells that do reach the tumor site can recognize and kill F. nucleatum–bearing tumor cells through RadD–NKp46 interactions. In the absence of RadD, this recognition is impaired, leading to reduced NK-mediated cytotoxicity and increased tumor growth.

      (3) Although the authors acknowledge that F. nucleatum may have tumor context-specific roles in regulating NK cell responses, it is unclear why they chose a breast cancer model in which F. nucleatum has been reported to promote tumor growth. A more appropriate choice would have been the well-established preclinical oral cancer model, such as the 4-nitroquinoline 1-oxide (4NQO)-induced oral cancer model in C57BL/6 mice, which would more directly relate to HNSC biology.

      The tumor model we employed is, to date, the only model in which F. nucleatum has been shown to exert a measurable effect, which is why we selected it for our study (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun. 2020; 11: 3259). We have not tested the 4-nitroquinoline-1-oxide (4NQO)–induced oral cancer model, and we are uncertain whether its use would be ethically justified.

      (4) Since RadD of F. nucleatum can bind to both Siglec-7 and NKp46 on NK cells, exerting opposing functional effects, the expression profiles of both receptors on intratumoral NK cells should be evaluated. This would clarify the balance between activating and inhibitory signals in the tumor microenvironment and provide a more mechanistic explanation for the observed tumor context-dependent outcomes.

      This question was answered in Author response image 8 above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This is an interesting study on the role of FGF signaling in the induction of primitive streak-like cells (PS-LC) in human 2D-gastruloids. The authors use a previously characterized standard culture that generates a ring of PSLCs (TBXT+) and correlate this with pERK staining. A requirement for FGF signaling in TBXT induction is demonstrated via pharmacological inhibition of MEK and FGFR activity. A second set of culture conditions (with no exogenous FGFs) suggests that endogenous FGFs are required for pERK and TBXT induction. The authors then characterize, via scRNA-seq, various components of the FGF pathway (genes for ligands, receptors, ERK regulators, and HSPG regulation). They go on to characterize the pFGFR1, receptor isoforms, and polarized localization of this receptor. Finally, they perform FGF4 inhibition and use a cell line with a limited FGF17 inactivation (heterozygous null) and show that loss of these FGFs reduces PS-LC and derivative cell types. 

      Strengths: 

      (1) As the authors point out, the role of FGF signaling in gastrulation is less well understood than other signaling pathways. Hence this is a valuable contribution to that field. 

      (2) The FGF4 and FGF17 loss-of-function experiments in Figure 5 are very intriguing. This is especially so given the intriguing observation that these FGFs appear to be dominating in this model of human gastrulation, in contrast to what FGFs dominate in mice, chicks, and frogs. 

      (3) In general this paper is valuable as a further development of the Human gastruloid system and the role of FGF signaling in the induction of PS-CLs. The wide net that the authors cast in characterizing the FGF ligand gene, receptor isoforms, and downstream components provides a foundation for future work. As the authors write near the beginning of the Discussion "Many questions remain." 

      We thank the reviewer for these positive comments.

      Weaknesses: 

      (1) FGFs are cell survival factors in various aspects of development. The authors fail to address cell death due to loss of FGF signaling in their experiments. For example, in Figure 1E (which requires statistical analysis) and 1G (the bottom FGFRi row), there appears to be a significant amount of cell loss. Is this due to cell death? The authors should address the question of whether the role of FGF/ERK signaling is to keep the cells alive. 

      Indeed, FGF also strongly affects cell survival and it is an interesting question to what extent this depends on ERK. Our manuscript focuses instead on the role of FGF/ERK signaling in cell fate patterning. As mentioned in our discussion, figure 1de show that doxycycline induced pERK leads to more TBXT+ cells than the control without restoring cell number, suggesting the role of FGF in controlling cell number is independent of the requirement for FGF/ERK in PS-LC differrentiation. To further support this, we have added data showing low doses of MEKi are sufficient to inhibit differentiation without affecting cell number (Supp. Fig. 1i).

      To address the reviewers question regarding the cause of cell loss, we now stained for BrdU and cleaved Cas3 to assess proliferation and apoptosis in the presence and absence of MEK and FGFR inhibition (new Supp. Fig.

      1ef). This shows that the effect of these inhibitors on cell number is primarily due to a reduction in proliferation. We have also included statistical analysis in Fig.1e. 

      (2) Regarding the sparse cells in 1G, is there a reduction in cell number only with FGFRi and not MEKi? Is this reproducible? Gattiglio et al (Development, 2023, PMID: 37530863) present data supporting a "community effect" in the FGF-induced mesoderm differentiation of mouse embryonic stem cells. Could a community effect be at play in this human system (especially given the images in the bottom row of 1G)? If the authors don't address this experimentally they should at least address the ideas in Gattoglio et al. 

      Indeed, FGFRi reproducibly affects cell number more than MEKi, in line with the fact that pathways other than MAPK/ERK downstream of FGF (e.g. PI3K) play important roles in cell survival and growth. However, we think the lack of differentiation in MEKi and FGFRi in Fig.1g cannot be attributed to a loss of cells combined with a community effect. This is because without FGFRi or MEKi cells efficiently differentiate to primitive streak at much lower densities than those originally shown, consistent with the data we discuss in response to (1) arguing against a primarily indirect effect of FGF on PS-LC differentiation through cell density. In the context of directed differentiation (rather than 2D gastruloids), we have now shown in a controlled manner that the effect of MEKi and FGFRi does not depend on a community effect by repeating the experiment in Fig.1g while adjusting cell seeding densities to obtain similar final cell densities in all three conditions (new Fig.1g, new Supp Fig.1g). Furthermore we have included new data showing extremely sparse cells without MEKi or FGFRi still differentiate without problems (new Supp Fig 1h). We have also include Gattoglio et al in our revised discussion.

      (3) Do the FGF4 and FGF17 LOF experiments in Figure 5 affect cell numbers like FGFRi in Figure 1? 

      We did not observe major changes in cell number in the FGF4 and FGF17 loss of function experiments. This is in line with our observation that low levels of ERK signaling are sufficient to maintain proliferation (new Supp. Fig. 1i), and the fact that low levels of ERK signaling are maintained in the absence of FGF4 and FGF17 (Fig.5), likely by FGF2 (Fig. 2). In contrast, FGFRi treatment in Fig.1 leads to a nearly complete loss of FGF signaling (ERK and other pathways) that has a dramatic effect on cell number.

      Why examine PS-LC induction only in FGF17 heterozygous cells and not homozygous FGF17 nulls? 

      We were unable to obtain homozygous FGF17 nulls, it is not clear if there is a reason for this. In the absence of homozygous nulls, we have now further corroborated our findings with additional knockdown data (described in response to other comments below).

      (4) The idea that FGF8 plays a dominant role during gastrulation of other species but not humans is so intriguing it warrants deeper testing. The authors dismiss FGF8 because its mRNA "...levels always remained low." (line 363) as well as the data published in Zhai et al (PMID: 36517595) and Tyser et al (PMID: 34789876). But there are cases in mouse development where a gene was expressed at levels so low, that it might be dismissed, and yet LOF experiments revealed it played a role or even was required in a developmental process. The authors should consider FGF8 inhibition or inactivation to explore its potential role, despite its low levels of expression. 

      We thank the reviewer for this suggestion. We have now analyzed the role of FGF8 using FISH to visualize its expression and siRNA to understand its function (Fig.5d,f,h; Supp.Fig.5e,g,6e). We found that FGF8 expression is higher earlier in differentiation, preceding most expression of TBXT. Our scRNA-seq only analyzed samples at 42h so did not capture this. Furthermore, FGF8 expression localized inside the PS-like ring rather than coinciding with it like FGF4. Surprisingly, FGF8 knockdown led to an increase in primitive streak-like differentiation, suggesting it may counteract FGF4. The results are shown in the revised Fig. 5 and Supplemental Fig. 5. While this certainly merits further investigation, understanding the role of FGF8 in more detail is beyond the scope of the current work. 

      (5) Redundancy is a common feature in FGF genetics. What is the effect of inhibiting FGF4 in FGF17 LOF cells? 

      Further siRNA and shRNA experiments showed that FGF17 knockdown had a much smaller effect than FGF4 knockdown on expression of primitive streak markers (Fig.5i, Supp.Fig.6f-i) but that FGF17 knockdown did lead to a complete loss of the mesoderm marker TBX6 (Fig.5j, Supp.Fig.6j). A double knockdown of FGF4+FGF17 looked similar to FGF4 alone (Supp.Fig.6k). Thus, we now think the more likely scenario is that FGF17 is downstream of FGF4-dependent PS-differentiation and although this may have a positive feedback effect whereby this FGF17 can then enhance further PS-differentiation, which we previously interpreted as partial redundancy, the primary role of FGF17 may be later, in mesoderm differentiation.

      (6) I suggest stating that the authors take more caution in describing FGF gradients. For example, in one Results heading they write "Endogenous FGF4 and FGF17 gradients underly the ERK activity pattern.", implying an FGF protein gradient. However, they only present data for FGF mRNA , not protein. This issue would be clarified if they used proper nomenclature for gene, mRNA (italics), and protein (no italics) throughout the paper. 

      Thank you for the suggestion. We have edited the paper to more clearly distinguish protein and mRNA. We do think our data provide substantial indirect evidence for a protein gradient which is what the results heading is meant to convey. Receptor activation is high where ERK activity is high (Fig.3), and receptor activation is limited by ligands, since creating a scratch to let exogenous FGF reach the basal side of cells in the center leads to receptor activation (Fig.4). This strongly suggests ERK activity reflects an FGF protein gradient. 

      Reviewer #2 (Public review): 

      Summary: 

      The role of FGFs in embryonic development and stem cell differentiation has remained unclear due to its complexity. In this study, the authors utilized a 2D human stem cell-based gastrulation model to investigate the functions of FGFs. They discovered that FGF-dependent ERK activity is closely linked to the emergence of primitive streak cells. Importantly, this 2D model effectively illustrates the spatial distribution of key signaling effectors and receptors by correlating these markers with cell fate markers, such as T and ISL1. Through inhibition and loss-of-function studies, they further corroborated the needs of FGF ligands. Their data shows that FGFR1 is the primary receptor, and FGF2/4/17 are the key ligands for primitive streak development, which aligns with observations in primate embryos. Additional experiments revealed that the reduction of FGF4 and FGF17 decreases ERK activity. 

      Strengths: 

      This study provides comprehensive data and improves our understanding of the role of FGF signaling in primate

      primitive streak formation. The authors provide new insights related to the spatial localization of the key components of FGF signaling and attempt to reveal the temporal dynamics of the signal propagation and cell fate decision, which has been challenging. 

      Weaknesses: 

      Given the solid data, the work only partially clarifies the complex picture of FGF signaling, so details remain somewhat elusive. The findings lack a strong punchline, which may limit their broader impact. 

      We thank this reviewer for their valuable feedback and compliment on the solidity of our data. The punchline of our work is that FGF4 and FGF17-dependent ERK signaling plays a key role in differentiation of human PS-like cells and mesoderm, and that these are different FGFs than those thought to drive mouse gastrulation. A second key point is that like BMP and TGFβ signaling, FGF signaling is restricted to the basolateral sides of pluripotent stem cell colonies due to polarized receptor expression, which is crucial for understanding the response to exogenous ligands added to the cell medium. Indeed, many facets of FGF signaling remain to be investigated in the future, such as how FGF regulates and is regulated by other signals, which we will dedicate a different manuscript to. 

      Reviewer #3 (Public review): 

      Jo and colleagues set out to investigate the origins and functions of localized FGF/ERK signaling for the differentiation and spatial patterning of primitive streak fates of human embryonic stem cells in a well-established micropattern system. They demonstrate that endogenous FGF signaling is required for ERK activation in a ringdomain in the micropatterns, and that this localized signaling is directly required for differentiation and spatial patterning of specific cell types. Through high-resolution microscopy and transwell assays, they show that cells receive FGF signals through basally localized receptors. Finally, the authors find that there is a requirement for exogenous FGF2 to initiate primitive streak-like differentiation, but endogenous FGFs, especially FGF4 and FGF17, fully take over at later stages. 

      Even though some of the authors' findings - such as the localized expression of FGF ligands during gastrulation and the importance of FGF/ERK signaling for cell differentiation in the primitive streak - have been reported in model organisms before, this is one of the first studies to investigate the role of FGF signaling during primitive streak-like differentiation of human cells. In doing so, the paper reports a number of interesting and valuable observations, namely the basal localization of FGF receptors which mirrors that of BMP and Nodal receptors, as well as the existence of a positive feedback loop centered on FGF signaling that drives primitive-streak differentiation. The authors also perform a comparison of the role of different FGFs across species and try to assign specific functions to individual FGFs. In the absence of clean genetic loss-of-function cell lines, this part of the work remains less strong. 

      We thank the reviewer for emphasizing the value of our findings in a human model for gastrulation. We agree more loss-of-function experiments would provide further insight into the role of different FGFs. While we did not manage to create knockout cell lines, we have now performed both siRNA and shRNA knock-down of all FGF4, and FGF17 in two different hPSC lines, performed siRNA knockdown of FGF8, and also made a FGF4+FGF17 shRNA double knockdown cell lines to more completely test the functions of the individual FGFs (Fig.5, Supp.Fig.5,6). Our data suggest FGF17 may be downstream of FGF4 and primarily required for mesoderm differentiation while FGF8 appears to counteract FGF4. In doing this we have added a large amount of new data to the manuscript and we have removed the heterozygous knockout data in the first version of the manuscript which we felt added little to the new data. Further experiments are still needed to solidify our interpretation but those are beyond the scope of the current work.   

      Reviewer #1 (Recommendations for the authors): 

      (1) FGF2 is added to culture experiments (e.g. Figure 4), but the commercial source is not mentioned in Methods. For example, it could be added to "Supplementary Table 1: Cell signaling reagents." 

      We apologize for this oversight and have now added the information to Supplementary Table 1.

      (2) Line 117-118: "For example, by controlling the expression of Wnt or Nodal which are both required for PS-like differentiation". It is clear what the authors mean, but this is not a complete sentence. 

      We edited this for clarity, it now reads: “First, is FGF/ERK signaling required directly for PS-like differentiation, or does it act indirectly? These possibilities are not mutually exclusive. For example, FGF/ERK could be required directly but also act indirectly by controlling Wnt or Nodal expression, as both Wnt and Nodal signaling are required for PS-like differentiation.”

      (3) Line 246 "...found its spatial pattern to strongly resembles that of pERK..." either remove "to" or change "resembles" to "resemble" 

      Thank you for catching this. We removed “to”.

      (4) Lines 391- 393 seem to be missing a word in the last phrase: "...with FGF17 more important continued differentiation to mesoderm and endoderm." Maybe "during" after the word "important"? 

      Thank you for catching this, indeed the word “during” was missing and we have now added it.

      (5) Please define acronyms in Figure 3D (PS-LC was defined previously, but not others). 

      We apologize for the oversight, we have now defined the acronyms.

      (6) The three blue lines in Figure 5B (right) are hard to discern (and I'm not colorblind). I suggest also using a variety of dotted lines in a subset of these FGFs. 

      Thanks you for the suggestion. We have now given all the FGFs colors that are more clearly distinct and made the TBXT and TBX6 lines dashed.  

      Reviewer #2 (Recommendations for the authors): 

      (1) The reviewer acknowledges that FGF signaling is complex, particularly when dynamics and its correlation with cell fates are considered. To improve the clarity of the findings, the authors are encouraged to provide an additional schematic figure that clearly delineates the main findings of this study.  

      Thank you for the suggestion. We have now added a summary figure (Fig.6) to our discussion, which we hope helps present our findings more clearly.

      (2) The data suggest that FGF signaling may function differently in mice compared to primates, and their stem cell model aligns more closely with the latter. While the authors discuss this in the contents only based on sequencing data, it would be valuable to conduct some experiments with mouse embryos to validate the key differences. 

      It is unclear to us which experiments the reviewer has in mind. There is ample data on FGF expression in the mouse literature, as are many knockout phenotypes. Furthermore, verifying loss of function phenotypes (e.g. FGF17 knockout) in mouse is beyond our expertise.

      (3) Heparan sulfate proteoglycan (HSPG) is mentioned as an important component of FGF signaling; however, the only data related to HSPG is single-cell sequencing results. The authors should consider performing immunostaining or other assays to validate HSPG expression and spatial distribution, similar to the approach they used for other signaling components. 

      Our scratch experiments in Fig. 4 strongly argue against HSPGs as being responsible for the spatial pattern of FGF receptor activation: after a scratch across the colony the response is strong all along the scratch as expected if presence of FGF (an FGF gradient) controls the level of activity. If HSPGs were limiting, FGF flowing in from the media show not be able to uniformly activate receptors around the scratch.

      In addtion, we have now included an immunostain for HS in a newly added Supp. Fig. 4 which does not explain the observed pattern of ERK signaling.

      (4) In the scratch experiment, particularly high PERK expression is observed at the edge of the scratch. The authors should provide an explanation for why this expression is significantly higher compared to the edges of the colony. Additionally, it would be interesting to investigate the fate of the cells with super high PERK expression.  

      We have now determined that adaptive response to FGF is the reason that the response around the scratch is initially much higher than in the ERK activity ring that overlaps with the primitive streak-like cells. We have added figures showing that although the intial response to FGF exposure after scratching is very high, the response around the scratch adapts to levels similar in those in the ERK ring over the course of 6 hours (Fig.4ij). 

      (5) For some of the key experiments, multiple cell lines should be used to ensure that the findings are reproducible and applicable across different human stem cell lines.

      We have now checked FISH stainings and knockdown phenotypes for different FGFs in two different cell lines: ESI17 (hESC, XX) and PGP1 (hiPSC, XY). These results are shown in Supplementary Figures 6. We found all results to be consistent.

      (6) Where applicable, the meaning of error bars needs to be more clearly presented, including details on the number of independent experiments or samples used. 

      Thank you for pointing this out. Where error bar definitions were missing we have now added them to the figure captions.

      Reviewer #3 (Recommendations for the authors): 

      (1) The authors only analyze the ppERK ring in micropatterns of a single size. What was the motivation for the choice of this size? Can the authors how the ppERK ring is expected to depend on colony size? 

      Much smaller patterns lose the interior pluripotent regions while much larger patters have a much larger pluripotent region, which requires larger tilings to image without providing additional insight. The colony sizedependence of cell fate patterning was described in the paper that established the 2D gastruloids model (Warmflash Nat Methods 2014) and we later showed this due to a fixed length scale of the BMP and Nodal signaling gradients from the colony edge (Jo et al Elife 2022). We have now included data showing that the ERK patterns behaves similarly, with a fixed length scale of the pattern implying that in smaller colonies the ERK ring becomes a disc and the entire center of the colony has high ERK signaling (Supp Fig 1a).

      (2) The scRNAseq is somewhat confusing - why do the two datasets not overlap in the PHATE representation? This is unexpected, because the two samples have been treated similarly, and the authors have integrated their data to iron out possible batch effects. This discrepancy should be discussed. The authors should also specify from which reference exactly the first dataset comes from.  

      The two datasets do overlap nicely, the same fates are well mixed in the same place and the gene expresison profiles for the integrated data (e.g., Fig.2e) look smooth, so we believe the integration is good, but different cell fates are represented to different degrees. In particular, sample 2 shows much more mesoderm differentiation making the mesoderm branch mostly orange. Occassionally samples differentiate faster or slower than average which we see here, and these samples were collected far apart in time. We do not believe this affects our conclusions, if anything, we think performing the analysis on two samples that differ this much should make the conclusions more robust.  

      (3) If find it intriguing that exogenous FGF2 is important early on for primitive streak-like differentiation, although the authors show that it does not reach the center of the colony. The authors may want to discuss this conundrum. Does the FGF2 effect propagate from the outside to the inside, or does it act at an early stage when the cells have not yet formed a tight epithelium on the micropattern? 

      The cells in the experiment in Fig. 5a were given 24h to epithelialize, so we we do believe it acts from the edge. We believe this may be due to FGF2 modulating the early BMP response on the edge and are working on a manuscript that further explores this pathway crosstalk.

      (4) The authors' statement that FGF4 and FGF17 have partially redundant functions is not very strong, mainly because the study lacks a full FGF17 loss-of-function cell line. If the authors wanted to improve on this point, they could knock down FGF4 in the FGF17 heterozygous line, or produce a homozygous FGF17 KO line. If there are specific reasons why FGF17 homozygous lines cannot be produced, this could be interesting to discuss, too. Finally, I noticed that the methods list experiments with an FGF17 siRNA, but these are not shown in the manuscript. 

      We agree our evidence was previously not as strong as it could be. While there is no reason we know of why homozygous knockout lines cannot be produced, we failed to produce on. To strengthen our evidence we have therefore included substantial new knockdown data.  We have now performed both siRNA and shRNA knockdown of all FGF4, and FGF17 in two different hPSC lines, performed siRNA knockdown of FGF8, and also made a FGF4+FGF17 shRNA double knockdown cell lines to more completely test the functions of the individual FGFs (Fig.5, Supp.Fig.5,6). These experiments showed that FGF17 knockdown had a much smaller effect than FGF4 knockdown on expression of primitive streak markers (Fig.5i, Supp.Fig.6f-i) but that FGF17 knockdown did lead to a complete loss of the mesoderm marker TBX6 (Fig.5j, Supp.Fig.6j). A double knockdown of FGF4+FGF17 looked similar to FGF4 alone (Supp.Fig.6k). Thus, we now think the more likely scenario is that FGF17 is downstream of FGF4-dependent PS-differentiation and although this may have a positive feedback effect whereby this FGF17 can then enhance further PS-differentiation, which we previously interpreted as partial redundancy, the primary role of FGF17 may be later, in mesoderm differentiation. Furthermore, our new data suggests FGF8 may counteract FGF4 and limit PS-like differentiation. 

      Minor 

      (5) Line 63: Reference(s) appear to be missing. 

      This whole paragraph summarizes the results of the references given on line 55, we have now repeated the relevant references where the reviewer indicated.

      (6) Supplementary Figure 1a,b does not show ppERK, unlike stated in lines 102 - 104. 

      Indeed, the data described in lines 102-104 is shown in Fig.1a and we have removed the original Supplementary Figure 1ab since it did not provide relevant information.

      (7) Line 201: It is not clear whether this is a new sequencing dataset, or if existing datasets have been reanalyzed. 

      We agree our description was unclear. We have edited the text, which now explicitly states that our analysis is based on one dataset we collected previously and a replicate that was newly collected and deposited on GEO for this manuscript.

      (8) Figure 2f; Supplementary Figure 2b, c: The colors need to be explained in scale bars. How has this data been normalized to allow for comparison between very different sample types? 

      We have now added color bars indicating the scale for each of these figure panels. As the caption stated, the interspecies comparison was normalized within each species, so the highest FGF level for any FGF at any time within each species is normalized to one. We are thus comparing between species the relative expression of different FGFs within each species. Indeed there is no good way to compare absolute expression between species. For extra clarity we have expanded our description of the interspecies comparison analysis and normalization in the methods section.

      (9) Line 232: Where is the expression of SEF shown? 

      It is shown in Fig. 2i, under the official gene name IL17RD.

      (10) Supplementary Figure 4 seems to be missing. 

      Thank you for pointing this out. We have now added a supplementary Fig.4.

      (11) Line 437: Citation needed. 

      We have included citations now.

      (12) Line 439: A similar feedback loop has been proposed to operate during mesoderm differentiation in mouse ESC (pmid: 37530863 ). The authors may consider citing this work. 

      Thank you for the suggestion, we have now included this work in the discussion. The feedback loop proposed in that work involves FGF8, while we were trying to explain why FGF4 and not FGF8 appears to be conserved across species by invoking an FGF4 feedback loop. Thus, it becomes even harder to explain differences in FGF4 and FGF8 expression between human and mouse gastrulation.

      (13) Supplementary Figure 6 is not described in the main text. 

      We have removed the original Supplementary Figure 6 and corresponding heterozygous knockout data in the main figure which we felt added little to the extensive knockdown data we now present. We did create a new Supplementary Figure 6 showing additional knockdown data which is described in the main tekst.

      (14) Submission of sequencing data to GEO needs to be updated. 

      We have now made the GEO data public.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #3 (Public review): 

      Summary: 

      The manuscript explores behavioral responses of C. elegans to hydrogen sulfide, which is known to exert remarkable effects on animal physiology in a range of contexts. The possibility of genetic and precise neuronal dissection of responses to H2S motivates the study of responses in C. elegans. The revised manuscript does not seem to have significantly addressed what was lacking in the initial version. 

      The authors have added further characterization of possible ASJ sensing of H2S by calcium imaging but ASJ does not appear to be directly involved. Genetic and parallel analysis of O2 and CO2 responsive pathways do not reveal further insights regarding potential mechanisms underlying H2S sensing. Gene expression analysis extends prior work. Finally, the authors have examined how H2S-evoked locomotory behavioral responses are affected in mutants with altered stress and detoxification response to H2S, most notably hif-1 and egl-9. These data, while examining locomotion, are more suggestive that observed effects on animal locomotion are secondary to altered organismal toxicity as opposed to specific behavioral responedse 

      Overall, the manuscript provides a wide range of intriguing observations, but mechanistic insight or a synthesis of disparate data is lacking. 

      We thank the reviewer for the valuable feedback. We agree that while our investigation provides broad coverage, it does not fully resolve the mechanisms of H<sub>2</sub>S perception. As both reviewers noted, the avoidance response to high levels of H<sub>2</sub>S is most likely driven by its toxicity, particularly at the level of mitochondria, rather than by direct perception of H<sub>2</sub>S. We also favor this model and have revised the results and discussion to highlight this interpretation, while acknowledging that other mechanisms cannot be excluded (main changes lines 387-402 and 535-547).

      Building on this view, our observations point toward mitochondrial ROS transients as the trigger for H<sub>2</sub>S avoidance. First, toxic levels of H<sub>2</sub>S are known to promote ROS production (1). Second, similar to acute H<sub>2</sub>S, brief exposure to rotenone, an ETC complex I inhibitor that rapidly generates mitochondrial ROS, triggers locomotory responses (Figure 7E) (Lines 393-396). Third, regardless of duration, rotenone exposure inhibits H<sub>2</sub>S-evoked avoidance (Figure 7E) (Lines 389-391), likely by preventing or dampening H<sub>2</sub>S-evoked mitochondrial ROS bursts when ETC function is impaired and ROS is already high. Notably, animals subjected to prolonged rotenone exposure, ETC mutants, and quintuple sod mutants, each experiencing chronically high ROS levels, fail to respond to H<sub>2</sub>S and display reduced locomotory activity, presumably due to ROS toxicity and/or activation of stress-adaptive mechanisms (Figure 7).

      Consistent with the activation of stress-responsive pathways, H<sub>2</sub>S exposure alters expression of genes controlled by SKN-1 and HIF-1 signaling. Both pathways are ROS-sensitive and promote adaptation to chronic ROS production (2-4). Their activation, as in egl-9, render these animals insensitive to H<sub>2</sub>S-evoked ROS transients (Figure 5B) (Lines 303-305). Conversely, mutants defective in these adaptive pathways, such as hif-1, still show initial locomotory responses to H<sub>2</sub>S, but rapidly lose activity during prolonged H<sub>2</sub>S exposure (Figure 5D) (Lines 318-319). These observations suggest that HIF-1 pathway is dispensable for initiating the response to H<sub>2</sub>S evoked ROS transients, but essential for protecting against ROS toxicity.

      In this context, the neural circuit we examined, such as ASJ neurons, is not directly involved in H<sub>2</sub>S perception (Line 165-169 and 448-457). Instead, it likely modulates a circuit that is responsive to ROS toxicity. This circuit is also influenced by ambient O<sub>2</sub> levels, the state of O<sub>2</sub> sensing circuit, and nutrient status, in a manner reminiscent of the CO<sub>2</sub> responses (5, 6).

      Reviewer #4 (Public review): 

      Summary: 

      The authors establish a behavioral paradigm for avoidance of H2S and conduct a large candidate screen to identify genetic requirements. They follow up by genetically dissecting a large number of implicated pathways - insulin, TGF-beta, oxygen/HIF-1, and mitochondrial ROS, which have varied effects on H2S avoidance. They additionally assay whole-animal gene expression changes induced by varying concentrations and durations of H2S exposure. 

      Strengths: 

      The implicated pathways are tested extensively through mutants of multiple pathway molecules. The authors address previous reviewer concerns by directly testing the ability of ASJ to respond to H2S via calcium imaging. This allows the authors to revise their previous conclusion and determine that ASJ does not directly respond to H2S and likely does not initiate the behavioral response. 

      We thank the reviewer for the supportive comments.

      Weaknesses: 

      Despite the authors focus on acute perception of H2S, I don't think the experiments tell us much about perception. I think they indicate pathways that modulate the behavior when disrupted, especially because most manipulations used broadly affect physiology on long timescales. For instance, genetic manipulation of ASJ signaling, oxygen sensing, HIF-1 signaling, mitochondrial function, as well as starvation are all expected to constitutively alter animal physiology, which could indirectly modulate responses to H2S. The authors rule out effects on general locomotion in some cases, but other physiological changes could relatively specifically modulate the H2S response without being involved in its perception. 

      I am actually not convinced that H2S is directly perceived by the C. elegans nervous system at all. As far as I can tell, the avoidance behavior could be a response to H2S-induced tissue damage rather than the gas itself. 

      We thank the reviewer for the valuable insights, and fully agree that the H<sub>2</sub>S may not be directly perceived by C. elegans. Please see detailed responses below.

      Reviewer #4 (Recommendations for the authors): 

      The clarity of the paper is improved in this version. My main issue has to do with "perception" of H2S. At times the authors suggest that hydrogen sulfide should be perceived by a neural circuit ("we did not specifically identify the neural circuit mediating H2S signaling"), while at other times they discuss the possibility that it is not directly perceived neuronally ("Supporting the idea that acute mitochondrial ROS generation initiates avoidance of high H2S levels,"). The authors should clearly state their model for H2S perception. Do they think there is a receptor and sensory neuron for H2S (not identified in this paper)? If not, what does it mean for there to be a neural circuit mediating the response? To me, it looks more like what is being "perceived" by a neural circuit is ROS-induced toxicity, not H2S itself. 

      To drill down on direct modulation of acute perception, are any of the pathway manipulations used in this paper performed on the timescale of perception? Rotenone for 10 mins is close to that timescale, and in fact it increases speed independently of H2S, consistent with ROSinduced toxicity, not H2S being the signal that induces the behavior. Optogenetic activation of RMG could also be on the acute timescale. Can the authors clarify for how long blue light was on the worms before the start of the assay? Or was it turned on at the same time as video acquisition commenced? This could be evidence that RMG acutely modulates this behavioral response. 

      I feel that the ASJ calcium imaging data should be in the main figure given its importance in revising the original model. 

      We thank the reviewer for the valuable advice.

      As suggested, ASJ calcium imaging data are displayed in the main figure (Figure 2I) (Line 167).

      As both reviewers noted, our initial presentation was not sufficiently clear regarding the mechanism underlying H<sub>2</sub>S avoidance. We agree with the reviewer that H<sub>2</sub>S avoidance is unlikely mediated by direct perception via a H<sub>2</sub>S-specific receptor, but likely arises from acute mitochondrial dysfunction and ROS generation. 

      ROS

      In line with the reviewer’s perspective, our observations point toward mitochondrial ROS transients as the trigger for H<sub>2</sub>S avoidance. First, toxic levels of H<sub>2</sub>S are known to promote ROS production (1). Second, similar to acute H<sub>2</sub>S, brief exposure to rotenone, an ETC complex I inhibitor that rapidly generates mitochondrial ROS, triggers locomotory responses (Figure 7E) (Lines 393-396). Third, regardless of duration, rotenone exposure inhibits H<sub>2</sub>S-evoked avoidance (Figure 7E) (Lines 389-391), likely by preventing or dampening H<sub>2</sub>S-evoked mitochondrial ROS bursts when ETC function is impaired and ROS is already high. Notably, animals subjected to prolonged rotenone exposure, ETC mutants, and quintuple sod mutants, each experiencing chronically high ROS levels, fail to respond to H<sub>2</sub>S and display reduced locomotory activity, presumably due to ROS toxicity and/or activation of stress-adaptive mechanisms (Figure 7). We revised the Results and Discussion to present the model more consistently (main changes lines 387-402 and 535-547).

      Consistent with the activation of stress-responsive pathways, H<sub>2</sub>S exposure alters expression of genes controlled by SKN-1 and HIF-1 signaling. Both pathways are ROS-sensitive and promote adaptation to chronic ROS production (2-4). Their activation, as in egl-9, render these animals insensitive to H<sub>2</sub>S-evoked ROS transients (Figure 5B) (Lines 303-305). Conversely, mutants defective in these adaptive pathways, such as hif-1, still show initial locomotory responses to H<sub>2</sub>S, but rapidly lose activity during prolonged H<sub>2</sub>S exposure (Figure 5D) (Lines 318-319). These observations suggest that HIF-1 pathway is dispensable for initiating the response to H<sub>2</sub> Sevoked ROS transients, but essential for protecting against ROS toxicity.

      ASJ neurons

      ASJ neurons and DAF-11 signaling are required for H<sub>2</sub>S-evoked behavioral responses. However, ASJ does not exhibit an H<sub>2</sub>S-evoked calcium transient. It suggests that ASJ neurons do not directly detect H<sub>2</sub>S (Line 165-169 and 448-457), but likely modulate the circuit responsive to ROS toxicity. This circuit can also be modulated by ambient O<sub>2</sub> levels, the state of O<sub>2</sub> sensing circuit, and nutrient status, in a manner reminiscent of the CO<sub>2</sub> responses (5, 6). 

      O<sub>2</sub> sensing circuit

      Consistent with the reviewer’s view, we favor the model that H<sub>2</sub>S avoidance is likely induced by ROS transients. We believe that the state of O<sub>2</sub> sensing circuit, similar to ASJ neurons, modulates the neural circuit that is responsive to H<sub>2</sub>S-evoked ROS toxicity. This circuit is inhibited as long as O<sub>2</sub> sensing circuit is active. In the RMG optogenetic experiment, channelrhodopsin was photo-stimulated as soon as the assay was initiated at 7% O<sub>2</sub> (Methods Lines 633-634 and Figure legend Lines 1177-1178), therefore RMG remained active throughout the assay including at 7% O<sub>2</sub>. Our interpretation is that RMG activation inhibits this ROSresponsive circuit and H<sub>2</sub>S avoidance. However, these observations do not resolve if H<sub>2</sub>S is acutely and directly perceived. The modulation of H<sub>2</sub>S response by O<sub>2</sub> circuit was discussed between Lines 437-447.

      References

      (1) J. Jia et al., SQR mediates therapeutic effects of H(2)S by targeting mitochondrial electron transport to induce mitochondrial uncoupling. Sci Adv 6, eaaz5752 (2020).

      (2) S. J. Lee, A. B. Hwang, C. Kenyon, Inhibition of Respiration Extends C. elegans Life Span via Reactive Oxygen Species that Increase HIF-1 Activity. Current Biology 20, 2131-2136 (2010).

      (3) C. Lennicke, H. M. Cocheme, Redox metabolism: ROS as specific molecular regulators of cell signaling and function. Mol Cell 81, 3691-3707 (2021).

      (4) D. A. Patten, M. Germain, M. A. Kelly, R. S. Slack, Reactive oxygen species: stuck in the middle of neurodegeneration. J Alzheimers Dis 20 Suppl 2, S357-367 (2010).

      (5) A. J. Bretscher, K. E. Busch, M. de Bono, A carbon dioxide avoidance behavior is integrated with responses to ambient oxygen and food in Caenorhabditis elegans. Proc Natl Acad Sci U S A 105, 8044-8049 (2008).

      (6) E. A. Hallem, P. W. Sternberg, Acute carbon dioxide avoidance in Caenorhabditis elegans. Proc Natl Acad Sci U S A 105, 8038-8043 (2008).

  6. drive.google.com drive.google.com
    1. As the responses from chatbots are generated from information collectedby the program rather than the fallible human memory, one may incorrectlyassume that no errors exist. However, there is material risk in relying exclusivelyon AI-generated responses without verification of the generated content.

      The argument the author makes here is reasonable and sound because there are still errors that can be made using AI. There is a common belief that AI is super trusting when it comes to its information, and that is because AI is not a person, so we think it only gives correct and accurate information. Since this is not true, there is still risk associated with AI usage, since its information can be misleading.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This preprint from Shaowei Zhao and colleagues presents results that suggest tumorous germline stem cells (GSCs) in the Drosophila ovary mimic the ovarian stem cell niche and inhibit the differentiation of neighboring non-mutant GSC-like cells. The authors use FRT-mediated clonal analysis driven by a germline-specific gene (nos-Gal4, UASp-flp) to induce GSC-like cells mutant for bam or bam's cofactor bgcn. Bam-mutant or bgcn-mutant germ cells produce tumors in the stem cell compartment (the germarium) of the ovary (Figure 1). These tumors contain non-mutant cells - termed SGC for single-germ cells. 75% of SGCs do not exhibit signs of differentiation (as assessed by bamP-GFP) (Figure 2). The authors demonstrate that block in differentiation in SGC is a result of suppression of bam expression (Figure 2). They present data suggesting that in 73% of SGCs, BMP signaling is low (assessed by dad-lacZ) (Figure 3) and proliferation is less in SGCs vs GSCs. They present genetic evidence that mutations in BMP pathway receptors and transcription factors suppress some of the non-autonomous effects exhibited by SGCs within bam-mutant tumors (Figure 4). They show data that bam-mutant cells secrete Dpp, but this data is not compelling (see below) (Figure 5). They provide genetic data that loss of BMP ligands (dpp and gbb) suppresses the appearance of SGCs in bam-mutant tumors (Figure 6). Taken together, their data support a model in which bam-mutant GSC-like cells produce BMPs that act on nonmutant cells (i.e., SGCs) to prevent their differentiation, similar to what is seen in the ovarian stem cell niche. 

      Strengths:

      (1) Use of an excellent and established model for tumorous cells in a stem cell microenvironment.

      (2) Powerful genetics allow them to test various factors in the tumorous vs nontumorous cells.

      (3) Appropriate use of quantification and statistics.

      We greatly appreciate these comments.

      Weaknesses:

      (1) What is the frequency of SGCs in nos>flp; bam-mutant tumors? For example, are they seen in every germarium, or in some germaria, etc, or in a few germaria?

      This is a great question. Because the SGC phenotype depends on the presence of germline tumor clones, our quantification was restricted to germaria that contained them.These quantification data ("SGCs and/or germline cysts per germarium with germline clones") will be presented in the revised Figure 1.

      (2) Does the breakdown in clonality vary when they induce hs-flp clones in adults as opposed to in larvae/pupae?

      Our initial attempts to induce ovarian hs-flp germline clones by heat-shocking adult flies were unsuccessful, with very few clones being observed. Therefore, we shifted our approach to an earlier developmental stage. Successful induction was achieved by subjecting late-L3/early-pupal animals to a twice-daily heatshock at 37°C for 6 consecutive days (2 hours per session with a 6-hour interval, see Lines 325-329) (Zhao et al., 2018).

      (3) Approximately 20-25% of SGCs are bam+, dad-LacZ+. Firstly, how do the authors explain this? Secondly, of the 70-75% of SGCs that have no/low BMP signaling, the authors should perform additional characterization using markers that are expressed in GSCs (i.e., Sex lethal and nanos).

      These 20-25% of SGCs are bamP-GFP<sup>+</sup> dad-lacZ-, not bam<sup>+</sup> dad-lacZ<sup>+</sup> (see Figure 2C and 3D). They would be cystoblast-like cells that may have initiated a differentiation program toward forming germline cysts (see Lines 109-117). The 70-75% of SGCs that have low BMP signaling exhibit GSC-like properties, including: 1) dot-like spectrosomes; 2) dad-lacZ positivity; 3) absence of bamP-GFP expression. While additional markers would be beneficial, we think that this combination of properties is sufficient to classify these cells as GSC-like. 

      (4) All experiments except Figure 1I (where a single germarium with no quantification) were performed with nos-Gal4, UASp-flp. Have the authors performed any of the phenotypic characterizations (i.e., figures other than Figure 1) with hs-flp?

      Yes, we initially identified the SGC phenotype through hs-flp-mediated mosaic analysis of bam or bgcn mutant in ovaries. However, as noted in our response to Weakness (2), this approach was very labor-intensive. Therefore, we switched to using the more convenient nos::flp system for subsequent experiments. To our observation, there was no difference in the SGC phenotype between these two approaches, confirming that the nos::flp system is a valid and more practical alternative for its study. 

      (5) Does the number of SGCs change with the age of the female? The experiments were all performed in 14-day-old adult females. What happens when they look at a young female (like 2-day-old). I assume that the nos>flp is working in larval and pupal stages, and so the phenotype should be present in young females. Why did the authors choose this later age? For example, is the phenotype more robust in older females? Or do you see more SGCs at later time points?

      These are very good questions. Such time-course analysis data will be provided in revised Figure 1. The SGC phenotype depends on the presence of bam or bgcn mutant germline clones. Germaria from 14-day-old flies contained bigger and more such clones than those from younger flies. This age-dependent increase in clone size and frequency significantly enhanced the efficiency of our quantification (see Lines 129-131). 

      (6) Can the authors distinguish one copy of GFP versus 2 copies of GFP in germ cells of the ovary? This is not possible in the Drosophila testis. I ask because this could impact the clonal analyses diagrammed in Figure 4A and 4G and in 6A and B. Additionally, in most of the figures, the GFP is saturated, so it is not possible to discern one vs two copies of GFP.

      We greatly appreciate this comment. It was also difficult for us to distinguish 1 and 2 copies of GFP in the Drosophila ovary. In Figure 4A-F, to resolve this problem, we used a triplecolor system, in which red germ cells (RFP<sup>+/+</sup> GFP<sup>-/-</sup>) are bam mutant, yellow germ cells (RFP<sup>+/-</sup> GFP<sup>+/-</sup>) are wild-type, and green germ cells (RFP<sup>-/-</sup> GFP<sup>+/+</sup>) are punt or med mutant. In Figure 4G-J, we quantified the SGC phenotype only in black germ cells (GFP<sup>-/-</sup>), which are wild-type (control) or mad mutant.  In Figure 6, we quantified the SGC phenotype only in green germ cells (both GFP<sup>+/+</sup> and GFP<sup>+/-</sup>), all of which are wild-type.

      (7) More evidence is needed to support the claim of elevated Dpp levels in bam or bgcn mutant tumors. The current results with the dpp-lacZ enhancer trap in Figure 5A, B are not convincing. First, why is the dpp-lacZ so much brighter in the mosaic analysis (A) than in the no-clone analysis (B)? It is expected that the level of dpplacZ in cap cells should be invariant between ovaries, and yet LacZ is very faint in Figure 5B. I think that if the settings in A matched those in B, the apparent expression of dpp-lacZ in the tumor would be much lower and likely not statistically significant. Second, they should use RNA in situ hybridization with a sensitive technique like hybridization chain reactions (HCR) - an approach that has worked well in numerous Drosophila tissues, including the ovary.

      We appreciate this critical comment. The settings of immunofluorescent staining and confocal parameters in Figure 5A were the same as those in 5B. To our observation, the level of dpp-lacZ in cap cells was variable across germaria, even within the same ovary, as quantified in Figure 5C. We will provide RNA in situ hybridization data to further strengthen the conclusion that bam or bgcn mutant germline tumors secret BMP ligands.  

      (8) In Figure 6, the authors report results obtained with the bamBG allele. Do they obtain similar data with another bam allele (i.e., bamdelta86)?

      No. Given that bam<sup>BG</sup> was functionally indistinguishable from bam<sup>Δ86</sup> in inducing the SGC phenotype (compare Figure 6F, I with Figure 6-figure supplement 3C), we believe that repeating these experiments with bam<sup>Δ86</sup> would be redundant and would not alter the key conclusion of our study. Thanks for the understanding!

      Reviewer #2 (Public review):

      While the study by Zhang et al. provides valuable insights into how germline tumors can non-autonomously suppress the differentiation of neighboring wild-type germline stem cells (GSCs), several conceptual and technical issues limit the strength of the conclusions.

      Major points:

      (1) Naming of SGCs is confusing. In line 68, the authors state that "many wild-type germ cells located outside the niche retained a GSC-like single-germ-cell (SGC) morphology." However, bam or bgcn mutant GSCs are also referred to as "SGCs," which creates confusion when reading the text and interpreting the figures. The authors should clarify the terminology used to distinguish between wild-type SGCs and tumor (bam/bgcn mutant) SGCs, and apply consistent naming throughout the manuscript and figure legends.

      We apologize for any confusion. In our manuscript, the term "SGC" is reserved specifically for wild-type germ cells that maintain a GSC-like morphology outside the niche. bam or bgcn mutant germ cells are referred to as GSC-like tumor cells (Lines 87-88), not SGCs.

      (a) The same confusion appears in Figure 2. It is unclear whether the analyzed SGCs are wild-type or bam mutant cells. If the SGCs analyzed are Bam mutants, then the lack of Bam expression and failure to differentiate would be expected and not informative. However, if the SGCs are wild-type GSCs located outside the niche, then the observation would suggest that Bam expression is silenced in these wildtype cells, which is a significant finding. The authors should clarify the genotype of the SGCs analyzed in Figure 2C, as this information is not currently provided.

      The SGCs analyzed in Figure 2A-C are wild-type, GSC-like cells located outside the niche. They were generated using the same genetic strategy depicted in Figures 1C and 1E (with the schematic in Figure 1B). The complete genotypes for all experiments are available in Source data 1. 

      (b) In Figures 4B and 4E, the analysis of SGC composition is confusing. In the control germaria (bam mutant mosaic), the authors label GFP⁺ SGCs as "wild-type," which makes interpretation unclear. Note, this is completely different from their earlier definition shown in line 68.

      The strategy to generate SGCs in Figure 4B-F (with the schematic in Figure 4A) is completely different from that in Figure 1C-F, H, and I (with the schematic in Figure 1B). In Figure 4B-F, we needed to distinguish punt<sup>-/-</sup> (or med<sup>-/-</sup>) with punt<sup>+/-</sup> (or med<sup>+/-</sup>) germ cells. As noted in our response to Reviewer #1’s Weakness (6), it was difficult for us to distinguish 1 and 2 copies of GFP in the Drosophila ovary. Therefore, we chose to use the triple-color system to distinguish these germ cells in Figure 4B-F (see genotypes in Source data 1). 

      (c) Additionally, bam⁺/⁻ GSCs (the first bar in Figure 4E) should appear GFP⁺ and Red⁺ (i.e., yellow). It would be helpful if the authors could indicate these bam⁺/⁻ germ cells directly in the image and clarify the corresponding color representation in the main text. In Figure 2A, although a color code is shown, the legend does not explain it clearly, nor does it specify the identity of bam⁺/⁻ cells alone. Figure 4F has the same issue, and in this graph, the color does not match Figure 4A.

      The color-to-genotype relationships for the schematics in Figures 2A and 4E are provided in Figures 1B and 4A, respectively. Due to the high density of germ cells, it is impractical to label each genotype directly in the images. In contrast to Figure 4E, the colors in Figure 4F do not represent genotypes; instead, blue denotes the percentage of SGCs, and red denotes the percentage of germline cysts, as indicated below the bar chart. 

      (2) The frequencies of bam or bgcn mutant mosaic germaria carrying [wild-type] SGCs or wild-type germ cell cysts with branched fusomes, as well as the average number of wild-type SGCs per germarium and the number of days after heat shock for the representative images, are not provided when Figure 1 is first introduced. Since this is the first time the authors describe these phenotypes, including these details is essential. Without this information, it is difficult for readers to follow and evaluate the presented observations.

      Thanks for this constructive suggestion. We will include such quantification data in the revised manuscript.

      (3) Without the information mentioned in point 2, it causes problems when reading through the section regarding [wild-type] SGCs induced by impairment of differentiation or dedifferentiation. In lines 90-97, the authors use the presence of midbodies between cystocytes as a criterion to determine whether the wild-type GSCs surrounded by tumor GSCs arise through dedifferentiation. However, the cited study (Mathieu et al., 2022) reports that midbodies can be detected between two germ cells within a cyst carrying a branched fusome upon USP8 loss.

      Unlike wild-type cystocytes, which undergo incomplete cytokinesis and lack midbodies, those with USP8 loss undergo complete cell division, with the presence of midbodies (white arrow, Figure 1F’ from Mathieu et al., 2022) as a marker of the late cytokinesis stage (Mathieu et al., 2022). 

      (a) Are wild-type germ cell cysts with branched fusomes present in the bam mutant mosaic germaria? What is the proportion of germaria containing wild-type SGCs versus those containing wild-type germ cell cysts with branched fusomes?

      (b) If all bam mutant mosaic germaria carry only wild-type GSCs outside the niche and no germaria contain wild-type germ cell cysts with branched fusomes, then examining midbodies as an indicator of dedifferentiation may not be appropriate.

      We greatly appreciate this critical comment. bam mutant mosaic germaria indeed contained wild-type germline cysts, as evidenced by an SGC frequency of ~70%, rather than 100% (see Figures 2H, 4F, 4J, 6F, 6I, and Figure 6-figure supplement 3C). Since the SGC phenotype depends on the presence of bam or bgcn mutant germline tumors, we quantified it as “the percentage of SGCs relative to the total number of SGCs and germline cysts that are surrounded by germline tumors” (see Lines 124-129). Quantifying the SGC phenotype as "the percentage of germaria with SGCs" would be imprecise. This is because the presence and number of SGCs were highly variable among germaria with bam mutant germline clones, and a small number of germaria entirely lacked these clones. We will provide the data of "SGCs and/or germline cysts per germarium with germline clones" in revised Figure 1.

      (c) If, however, some germaria do contain wild-type germ cell cysts with branched fusomes, the authors should provide representative images and quantify their proportion.

      Such representative germaria are shown in Figure 2G, 3B, 3C, 6D, 6E, and 6H. The percentage of germline cysts can be calculated by “100% - SGC%”.

      (d) In line 95, although the authors state that 50 germ cell cysts were analyzed for the presence of midbodies, it would be more informative to specify how many germaria these cysts were derived from and how many biological replicates were examined.

      As noted in our response to points a) and b) above, the germ cells surrounded by germline tumors, rather than germarial numbers, are more precise for analyzing the phenotype. For this experiment, we examined >50 such germline cysts via confocal microscopy. As the analysis was performed on a defined cellular population, this sample size should be sufficient to support our conclusion. 

      (4) Note that both bam mutant GSCs and wild-type SGCs can undergo division to generate midbodies (double cells), as shown in Figure 4H. Therefore, the current description of the midbody analysis is confusing. The authors should clarify which cell types were examined and explain how midbodies were interpreted in distinguishing between cell division and differentiation.

      We assayed for the presence of midbodies or not specifically within the germline cysts surrounded by bam mutant tumors, not within the tumors themselves (Lines 94-95). As detailed in Lines 88-97, the absence of midbodies was used as a key criterion to exclude the possibility of dedifferentiation.  

      (5) The data in Figure 5 showing Dpp expression in bam mutant tumorous GSCs are not convincing. The Dpp-lacZ signal appears broadly distributed throughout the germarium, including in escort cells. To support the claim more clearly, the authors should present corresponding images for Figures 5D and 5E, in which dpp expression was knocked down in the germ cells of bam or bgcn mutant mosaic germaria. Showing these images would help clarify the localization and specificity of Dpp-lacZ expression relative to the tumorous GSCs.

      We greatly appreciate this comment. RNA in situ hybridization data will be provided to further strengthen the conclusion that bam or bgcn mutant germline tumors secret BMP ligands.

      (6) While Figure 6 provides genetic evidence that bam mutant tumorous GSCs produce Dpp to inhibit the differentiation of wild-type SGCs, it should be noted that these analyses were performed in a dpp⁺/⁻ background. To strengthen the conclusion, the authors should include appropriate controls showing [dpp⁺/⁻; bam⁺/⁻] SGCs and [dpp⁺/⁻; bam⁺/⁻] germ cell cysts without heat shock (as referenced in Figures 6F and 6I).

      Schematic cartoons in Figure 6A and 6B demonstrate that these analyses were performed in a dpp<sup>+/-</sup> background. Figure 6-figure supplement 1 indicates that dpp<sup>+/-</sup> or gbb<sup>+/-</sup> does not affect GSC maintenance, germ cell differentiation, and female fly fertility. Figure 6C is the control for 6D and 6E, and 6G is the control for 6H, with quantification in 6F and 6I.  We used nos::flp, not the heat shock method, to induce germline clones in these experiments (see genotypes in Source data 1).

      (7) Previous studies have reported that bam mutant germ cells cause blunted escort cell protrusions (e.g., Kirilly et al., Development, 2011), which are known to contribute to germ cell differentiation (e.g., Chen et al., Frontiers in Cell and Developmental Biology, 2022). The authors should include these findings in the Discussion to provide a broader context and to acknowledge how alterations in escort cell morphology may further influence differentiation defects in their model.

      Thanks for teaching us! Such discussion will be included in the revised manuscript.

      (8) Since fusome morphology is an important readout of SGCs vs differentiation. All the clonal analysis should have fusome staining.

      SGC is readily distinguishable from multi-cellular germline cyst based on morphology. In some clonal analysis experiments, fusome staining was not feasible due to technical limitations such as channel saturation or antibody incompatibility. Thanks for the understanding! 

      (9) Figure arrangement. It is somewhat difficult to identify the figure panels cited in the text due to the current panel arrangement.

      The figure panels were arranged to optimize space while ensuring that related panels are grouped in close proximity for logical comparison. We would be happy to consider any specific suggestions for an alternative layout that could improve clarity. Thanks!

      (10) The number of biological replicates and germaria analyzed should be clearly stated somewhere in the manuscript-ideally in the Methods section or figure legends. Providing this information is essential for assessing data reliability and reproducibility.

      Thanks for this constructive suggestion. Such information will be included in figure legends in the revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      Zhang et al. investigated how germline tumors influence the development of neighboring wild-type (WT) germline stem cells (GSC) in the Drosophila ovary. They report that germline tumors inhibit the differentiation of neighboring WT GSCs by arresting them in an undifferentiated state, resulting from reduced expression of the differentiation-promoting factor Bam. They find that these tumor cells produce low levels of the niche-associated signaling molecules Dpp and Gbb, which suppress bam expression and consequently inhibit the differentiation of neighboring WT GSCs non-cell-autonomously. Based on these findings, the authors propose that germline tumors mimic the niche to suppress the differentiation of the neighboring stem cells.

      Strengths:

      This study addresses an important biological question concerning the interaction between germline tumor cells and WT germline stem cells in the Drosophila ovary. If the findings are substantiated, they could provide valuable insights applicable to other stem cell systems.

      We greatly appreciate these comments.

      Weaknesses:

      Previous work from Xie's lab demonstrated that bam and bgcn mutant GSCs can outcompete WT GSCs for niche occupancy. Furthermore, a large body of literature has established that the interactions between escort cells (ECs) and GSC daughters are essential for proper and timely germline differentiation (the differentiation niche). Disruption of these interactions leads to arrest of germline cell differentiation in a status with weak BMP signaling activation and low bam expression, a phenotype virtually identical to what is reported here. Thus, it remains unclear whether the observed phenotype reflects "direct inhibition by tumor cells" or "arrested differentiation due to the loss of the differentiation niche". Because most data were collected at a very late stage (more than 10 days after clonal induction), when tumor cells already dominate the germarium, this question cannot be solved. To distinguish between these two possibilities, the authors could conduct a time-course analysis to examine the onset of the WT GSC-like singlegerm-cell (SGC) phenotype and determine whether early-stage tumor clones with a few tumor cells can suppress the differentiation of neighboring WT GSCs with only a few tumor cells present. If tumor cells indeed produce Dpp and Gbb (as proposed here) to inhibit the differentiation of neighboring germline cells, a small cluster or probably even a single tumor cell generated at an early stage might prevent the differentiation of their neighboring germ cells.

      Thanks for this critical comment. Such time-course analysis data will be provided in revised Figure 1.

      The key evidence supporting the claim that tumor cells produce Gpp and Gbb comes from Figures 5 and 6, which suggest that tumor-derived dpp and gbb are required for this inhibition. However, interpretation of these data requires caution. In Figure 5, the authors use dpp-lacZ to support the claim that dpp is upregulated in tumor cells (Figure 5A and 5B). However, the background expression in somatic cells (ECs and pre-follicular cells) differs noticeably between these panels. In Figure 5A, dpp-lacZ expression in somatic cells in 5A is clearly higher than in 5B, and the expression level in tumor cells appears comparable to that in somatic cells (dpplacZ single channel). Similarly, in Figure 5B, dpp-lacZ expression in germline cells is also comparable to that in somatic cells. Providing clear evidence of upregulated dpp and gbb expression in tumor cells (for example, through single-molecular RNA in situ) would be essential.

      We greatly appreciate this critical comment. In our data, the expression of dpp-lacZ in cap cells was variable across germaria, even within the same ovary, as quantified in Figure 5C. The images in Figures 5A and 5B were selected as representative examples of positive signaling. To directly address the reviewer's point and strengthen our conclusion, we will perform RNA in situ hybridization data in the revised manuscript to visualize the expression of BMP ligands within the bam or bgcn mutant germline tumor cells.

      Most tumor data present in this study were collected from the bam[86] null allele, whereas the data in Figure 6 were derived from a weaker bam[BG] allele. This bam[BG] allele is not molecularly defined and shows some genetic interaction with dpp mutants. As shown in Figure 6E, removal of dpp from homozygous bam[BG] mutant leads to germline differentiation (evidenced by a branched fusome connecting several cystocytes, located at the right side of the white arrowhead). In Figure 6D, fusome is likely present in some GFP-negative bam[BG]/bam[BG] cells. To strengthen their claim that the tumor produces Dpp and Gbb to inhibit WT germline cell differentiation, the authors should repeat these experiments using the bam[86] null allele.

      Although a structure resembling a "branched fusome" is visible in Figure 6E (right of the white arrowhead), it is an artifact resulting from the cytoplasm of GFP-positive follicle cells, which also stain for α-Spectrin, projecting between germ cells of different clones (see the merged image). In both our previous (Zhang et al., 2023) and current studies, bam<sup>BG</sup> was functionally indistinguishable from bam<sup>Δ86</sup> in its ability to block GSC differentiation and induce the SGC phenotype (compare Figure 6F, I with Figure 6-figure supplement 3C). Given this, we believe that repeating the extensive experiments in Figure 6 with the bam<sup>Δ86</sup> allele would be scientifically redundant and would not change the key conclusion of our study. We thank the reviewer for their consideration.

      It is well established that the stem niche provides multiple functional supports for maintaining resident stem cells, including physical anchorage and signaling regulation. In Drosophila, several signaling molecules produced by the niche have been identified, each with a distinct function - some promoting stemness, while others regulate differentiation. Expression of Dpp and Gbb alone does not substantiate the claim that these tumor cells have acquired the niche-like property. To support their assertion that these tumors mimic the niche, the authors should provide additional evidence showing that these tumor cells also express other niche-associated markers. Alternatively, they could revise the manuscript title to more accurately reflect their findings.

      Dpp and Gbb are the key niche signals from cap cells for maintaining GSC stemness. Our work demonstrates that germline tumors can specifically mimic this signaling function, not the full suite of cap cell properties, to create a non-cell-autonomous differentiation block. The current title “Tumors mimic the niche to inhibit neighboring stem cell differentiation” reflects this precise concept: a partial, functional mimicry of the niche's most relevant activity in this context. We feel it is an appropriate and compelling summary of our main conclusion.

      In the Method section, the authors need to provide details on how dpp-lacZ expression levels were quantified and normalized.

      Thanks for this suggestion. Such information will be included in the revised manuscript.

    1. Those with disabilities often find ways to cope with their disability, that is, find ways to work around difficulties they encounter and seek out places and strategies that work for them (whether realizing they have a disability or not). Additionally, people with disabilities might change their behavior (whether intentionally or not) to hide the fact that they have a disability, which is called masking and may take a mental or physical toll on the person masking, which others around them won’t realize. For example, kids who are nearsighted and don’t realize their ability to see is different from other kids will often seek out seats at the front of classrooms where they can see better. As for us two authors, we both have ADHD and were drawn to PhD programs where our tendency to hyperfocus on following our curiosity was rewarded (though executive dysfunction with finishing projects created challenges)[1]. This way of managing disabilities puts the burden fully on disabled people to manage their disability in a world that was not designed for them, trying to fit in with “normal” people.

      Reading this section personally hit very close to home for me. I also have ADHD, and have a tendency to hyperfocus when it comes to art as I can spend hours on a single painting, but then struggle to focus for more than 30 minutes on a homework assignment. I find it very true that people tend to "mask" their disabilities, as almost everyone I know has something they hide in order to fit in. I as well often hide that fact that homework often takes longer for me due to my ADHD. Knowing this, this is why I strongly think it's important we as a society work to becoming more inclusive in all aspects of life!

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank all the reviewers for their comments and suggestions, which has helped in revising the manuscript for a broader audience. Some of the experiments that was suggested by the reviewers has been performed and included in the revised manuscript. The response to reviewers is provided below their comments.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      MprF proteins exist in many bacteria to synthesize aminoacyl phospholipids that have diverse biological functions, e.g. in the defense against small cationic peptides. They integrate two functions, the aminoacylation of lipids, i.e. the transfer of Lys, Arg or Ala from tRNAs to the head group, and the flipping of these modified lipids to the membrane outer leaflet. The authors present structures of MprF from Pseudomonas aeruginosa and describe these structures in great detail. As MprF enzymes confer antibiotic resistance and are therefore highly important, studying them is significant and interesting. Consequently, their structures have been substantially characterized in recent years, including the publication of the dimeric full-length MpfR from Rhizobium (Song et al., 2021).

      While the structural work appears to be solid and carried out well on the technical part, one big criticism is how the data are presented in the manuscript, how they are analyzed and how they are put into relation to previous work. As structures of Mpfr from Rhizobium have been published, it is not required and rather distracting to explain the methodological details and the structure of Pseudomonas MprF in such great detail. Instead, the manuscript would benefit very strongly from reaching the interesting and novel parts, the comparison with the previous structures, as early as possible. Overall, the manuscript should be substantially shortened to not divert the reader's attention away from the novel parts by drowning them in miniscule description of the structural features such as secondary structure elements or lipid molecule positions where it remains completely unclear what their relevance is to the story and the message of the paper. Finally, during this revision, care should be taken to improve the language and maybe involve a native speaker in doing so.

      It is true that we have described the experimental details of PaMprF in detail including the constructs. We had reconstructed the map of dimeric PaMprF in 2020 but with the publication of the homologues structures (Song et al 2021 and the unpublished Rhizobium etli structure), we had to make sure the PaMprF dimer is not an artefact. Hence, our attempts to rule out this with different constructs and extensive testing with various detergents. Thus, we would like to keep this in the manuscript. We realise the importance of focusing on novel/interesting parts and have reshuffled sections (comparing structures and validating the dimer interface) followed by description of modelling of lipid molecules.

      Even more importantly, since the authors observe a dimer interface which strongly deviates from the previously presented arrangement of another species, the most important thing would be to properly characterize this interface and experimentally validate it, both of which has not been done sufficiently. When also taking into account that there were significant differences in the arrangement of the dimer between their structures in GDN and nanodisc, and that in the GDN structure, the cholesterol backbone of GDN appears to be involved in the interface (there should not be any cholesterol in native bacterial membranes!), there is a realistic chance that the observed dimer is an artefact. If the authors cannot convincingly rule out this possibility, all their conclusions are meaningless.

      The trials with cholesterol hemisuccinate stems more of out of curiosity (we are aware that no cholesterol is present in bacterial membranes). We had started the initial analysis of PaMprF with DDM and by itself it was largely monomeric (unpublished observation and supported by recent publication of PaMprF in DDM – Hankins et al 2025). When we observed that GDN was essential for the stability of the dimer (and not even LMNG), we asked if a combination of CHS with DDM will keep the dimer intact, which didn’t work and GDN was found to be important. The use of CHS for prokaryotic membrane protein studies has now been reported in few different systems and a recent one includes – Caliseki et al., 2025. We would like to keep the observation with CHS in the manuscript, and we have moved this figure to Appendix Fig. S3C.

      In addition, in a recent report on MgtA, a magnesium transporter (Zeinert et al., 2025), it was observed that DDM/LMNG resulted in monomeric enzyme, while GDN resulted in dimeric enzyme albeit, the dimer interface was in the soluble domain. We have added this reference and observation of MgtA in the discussion (page 13, lines 407-411).

      We like to think that the milder GDN tends to keep the membrane proteins or oligomers of membrane proteins more stable but further studies on multiple labile membrane protein systems will be required to substantiate this.

      Hence, while I think that the data presented here would be worth publishing. However, a major drawback is that the authors do not sufficiently analyse, characterise and validate the dimer interface and fail to show that the dimer is biologically relevant.

      Further major points: - The authors always jump between their structures in detergent and nanodisc during all the descriptions, which makes following the story even more difficult. Please first describe one of the structures and then (briefly) discuss relevant similarities and differences afterwards.

      The flow and description of the structures is now modified and the figures have now been rearranged to make it easier to follow. The panel in figure 2 describing the overlay of the GDN and nanodisc is now moved to Appendix Fig. S2B. Thus, figure 2 has only description of salient features of the structures (the interacting residues between the membrane and soluble domain) and the terminal helix.

      • The difference in dimerization between Pseudomonas and Rhizobium is the most interesting and surprising feature (if true) of the new structures. However, it is not really presented as such. The authors should put more emphasis on making clear that this is a complete rotation of the monomers with respect to each other (by how many degrees?) and they should visualize it even more clearly in Figure 4 (and label the figure so that it is possible to understand it without having to read the text or the legend first).

      We thought the colouring of the TM helices should make the difference in interface more obvious (the N and C-terminal TM helices in different colours). Now, we have also labelled the TM helices, so that it is easier to follow (this was also shown in panel E). The rotation is ~180° and this is now mentioned in the figure legend.

      • P. 10: The authors insinuate that only one of the dimer interfaces, either Pseudomonas or Rhizobium could be real, but disregard the possibility that both might be the biologically relevant interfaces of the respective species and that there might have been a switch of interfaces during evolution. They should also mention and discuss this possibility.

      We didn’t imply that one of the interfaces is real but clearly mentioned that it could also be different conformational state (page 7, lines 226-228). In the revised version, we have included a multiple sequence alignment (we had not included in the initial draft as it had been presented in several previous publications). The MSA (Appendix Fig. S6) reveals that neither of the interfaces are highly conserved.

      • Fig. 5G: The authors claim that the higher molecular band that appears in the mutant is a "dimer with aberrant migration" of >250 kDa as opposed to the expected 150 kDa. They should explain how they came to this conclusion and how they can be sure that the band does not correspond to a higher oligomer (trimer or tetramer). They could show, by extraction and purification scheme similar to the wildtype using first LMNG and then GDN, followed by at least a preliminary EM analysis, that the crosslinked mutant MprF is indeed a dimer, or use other biophysical methods to do the same, otherwise this experiment does not show much. Furthermore, they should also include a cysteine mutant in the part of Pseudomonas MprF that would be involved in a Rhizobium-like interface in their crosslinking experiments to check whether they could also stabilize dimers in this case.

      The band of the double mutant after crosslinking (or even without crosslinking) migrates at higher molecular weight than that expected for a dimer, and could potentially be a higher molecular band that a dimer. We also note that in the previous publication by Song et al 2021, the crosslinking of RtMprF also resulted in a higher molecular weight band (shown also by Western blot).

      We now substantiate the dimer of PaMprF with different approaches. We employed blue-native gel and also SDS-PAGE of the purified protein. This clearly shows that the higher molecular band after crosslinking is a dimer (Figure 4B and Fig. EV4D). In particular, in the BN-PAGE, the treatment of mutants with crosslinkers revealed a dimeric band even in the presence of SDS. Further, we have performed cryoEM analysis of the mutants - H386C/F389C and H566C. The images, classes and reconstruction show that the enzyme forms a dimer similar to the WT. Interestingly, we also observe in H566C mutant in nanodisc, a small population that has similar architecture to the Rhizobium-like interface (classes shown in Fig. EV7 and Appendix Fig. S5). This prompted us to look closely at other datasets and it is clear that during the process of reconstitution in nanodisc, we observe both kinds of dimer interface but the PaMprF dimer is predominant. We also observe higher order oligomers (tetramer) in GDN but as only few views are visible, a reconstruction could not be obtained (Appendix Fig. S5). In addition, we also introduced two cysteines on the Rhizobium-like interface and no crosslinking on the membranes were observed (Figure 4B). But it is possible that these chosen mutants are not accessible to the crosslinker. Thus, we conclude that the oligomers of PaMprF is sensitive to nature of detergents and labile.

      • As the question whether the observed interface is real or an artefact is very central to the value of the structural data and the drawn conclusions from it, the authors should make more effort to analyze and try to validate the interface. First, an analysis of interface properties (buried surface area, nature of the interactions, conservation) should be performed for the interface as observed in the Pseudomonas structure but also for a (hypothetical) Rhizobium-like interface of two Pseudomonas monomers (such a model of a dimer should be easily obtainable by AlphaFold using the available Rhizobium structures as models). Then, experimental methods such as FRET or crosslinking-MS would allow to draw more solid conclusions on the distances between potential interface residues. While these experiments are a certain effort, the question whether the dimer interface is real is so central to the paper that it would be worthwhile to make this effort.

      We have included the interface area and nature of interactions in the revised manuscript (page 7, lines 221-223).

      We attempted AlphaFold for predicting the dimeric structure of PaMprF (and included RtMprF also). Some of the attempts from the predictions is summarised in figure 1.

      The prediction of monomer is of high confidence but the oligomer (here dimer) is of low confidence (from ipTM values). Even the prediction for Rhizobium enzyme has low confidence, and gives a complete different architecture (and in some trials with lipids, it gives an inverted or non-physiological dimer). Only when the monomer of PaMprF with lipids and tRNA was given as input (requested by reviewer 2 and described below), it predicts oligomeric structure with some confidence but rest were not informative.

      • As it seems that detergents might disrupt or modify the dimer interface, it might be an alternative to solubilize the protein in a more native environment by polymer-stabilized nanodiscs using DIBMA or similar molecules.

      We have tried to use SMALPs for extraction of PaMprF. We were able to solubilise but unable to enrich the enzyme sufficient for structural studies currently and will require further optimisation.

      • Since parts of the Discussion are mostly repetitions of the Results part and other parts of the Discussion also contain a large extend of structure analysis one would usually rather expect in the Results part instead of the Discussion, the authors should consider condensing both to a combined (and overall much shorter) Results & Discussion section.

      We have rewritten much of the discussion section and removed any repetition from the results sections. We would prefer to keep the results and discussion separate.

      Minor points: - Explain abbreviations the first time they appear in the text, e.g. TTH

      This is now expanded in the first instance

      • Figure labels are very minimalistic. This should be improved, e.g. by putting labels to important structural features that appear in the text, otherwise the figures are not an adequate support for the text.

      The font size for the labels have been increased.

      • Figure 5: Label where the different oligomers run on the gels

      Labelled.

      Reviewer #1 (Significance (Required)):

      While the structural work appears to be solid and carried out well on the technical part, one big criticism is how the data are presented in the manuscript, how they are analyzed and how they are put into relation to previous work. As structures of Mpfr from Rhizobium have been published, it is not required and rather distracting to explain the methodological details and the structure of Pseudomonas MprF in such great detail. Instead, the manuscript would benefit very strongly from reaching the interesting and novel parts, the comparison with the previous structures, as early as possible. Overall, the manuscript should be substantially shortened to not divert the reader's attention away from the novel parts by drowning them in miniscule description of the structural features such as secondary structure elements or lipid molecule positions where it remains completely unclear what their relevance is to the story and the message of the paper. Finally, during this revision, care should be taken to improve the language and maybe involve a native speaker in doing so.

      Even more importantly, since the authors observe a dimer interface which strongly deviates from the previously presented arrangement of another species, the most important thing would be to properly characterize this interface and experimentally validate it, both of which has not been done sufficiently. When also taking into account that there were significant differences in the arrangement of the dimer between their structures in GDN and nanodisc, and that in the GDN structure, the cholesterol backbone of GDN appears to be involved in the interface (there should not be any cholesterol in native bacterial membranes!), there is a realistic chance that the observed dimer is an artefact. If the authors cannot convincingly rule out this possibility, all their conclusions are meaningless.

      Hence, while I think that the data presented here would be worth publishing. However, a major drawback is that the authors do not sufficiently analyse, characterise and validate the dimer interface and fail to show that the dimer is biologically relevant.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Shaileshanand J. et al., reported the structures of Multiple Peptide Resistance Factor, MprF, which is a bi-functional enzyme in bacteria responsible for aminoacylation of lipid head groups. The authors purified MprF from Pseudomonas aeruginosa in GDN micelles and nanodiscs, and by applying cryo-EM single particle method, they successfully reached near-atomic resolution, and built corresponding atomic models. By applying structural analysis as well as biochemistry methods, the authors demonstrated dimeric formation of MprF, exhibited the dynamic nature of the catalytic domain of this enzyme, and proposed a possible model on tRNA binding and aminoacylation.

      Major comments 1. In abstract, the authors stated 'Several lipid-like densities are observed in the cryoEM maps, which might indicate the path taken by the lipids and the coupling function of the two functional domains. Thus, the structure of a well characterised PaMprF lays a platform for understanding the mechanism of amino acid transfer to a lipid head group and subsequent flipping across the leaflet that changes the property of the membrane.' Firstly, those lipid-like densities were demonstrated in Fig 3A, since densities of lipids of purified membrane proteins often exist within regions of relatively low local resolution, or low quality, I think more detailed description on how the authors defined which part of the density belongs to lipid and how they acquired the modeling of some of the lipids is required. And the authors modeled phosphatidylglycerol into the GDN MprF, I would require additional experiment, for instance, mass spectrometry over the purified sample, to demonstrate the existence of this specific lipid with the sample. Secondly, regarding the last sentence in the abstract, how these structures lay a platform for further understanding was poorly discussed in both result section and discussion section, since the authors clearly stated 'This cavity perhaps provides a path for holding lipids...', then the statement in the next sentence 'Taken together... the vicinity to the cavities described above indicates the possible path taken by the lipids to enter and exit the enzyme' does not have a reliable evidence to support this conclusion, I would suggest the authors move these statements into discussion section, and elaborate more over this issue since it is an important part in the abstract, or make a more solid proof using other approaches, such as molecular dynamics simulation, to make these statements solid in the result section.

      The membranes of E. coli have predominantly phosphatidyl ethanolamine (PE) and phosphatidyl glycerol (PG) as the next abundant lipid with cardiolipin though smaller in number, plays an important role in functioning of many membrane proteins. In our map, the non-protein density are unambiguous and they can be observed as long density reflective of acyl chains (note that GDN used in purification has no acyl chain) and hence attributed these densities to lipids (Fig. EV4E/F and Figure 5A). Only in few of these densities, head group could be modelled and the identity of the lipid as PG at the dimer interface is based on the requirement of negatively charged lipids for oligomerisation of membrane proteins in general (for example – KcsA tetramer formation requires PG, Marius et al., 2005; Valiyaveetil et al., 2002;2004). It is true that the lipid densities are at the peripheral regions of the map but here only acyl chains have been modelled. Within the membrane domain, one reasonably ordered lipid is observed and by analogy with R. tropici structure, it is possible to build a modified-PG (in PaMprF here ala-PG). However, the density of the head group is not unambiguous (unlike lysine in the R. tropici, whose density stands out) and hence we have modelled it as PG alone. In the methods (page 20, lines 649-650), the identification and modelling of lipid densities is described.

      We agree that mass spectrometry analysis of purified lipids will be useful but it will not be able to tell the position of the lipid in the map (model) and for this we still require a map at higher resolution with better ordered lipids. We have recently built/developed the workflow for native MS and we plan to initiate analysis of PaMprF in the near future, which will provide details for the lipid purified with the enzyme.

      We had initiated molecular dynamics simulation during the review process, and we had included tRNA molecules (shorter version) as we felt the connection between tRNA binding and lipid modification was important. This would have also explained the path taken by lipids (performed by Hankins et al., 2025 in their publication). However, this is likely to require more work (and computing resources) and both mass spectrometry and molecular dynamics will be part of the future work.

      We have rewritten the discussion and changed the last line of the abstract to the following

      “From the structures, the binding modes of tRNA and lipid transport can be postulated and the mobile secondary structural elements in the synthase domain might play a mechanistic role”.

      (in the abstract, lines 24-26).

      Fig 2B, it seems the H566 sidechains were overlapping in the zoom-in figure of distance measurement between H566 residues, to clarify this, authors should either present another figure with rotation, to better demonstrate their relative locations, or swap this zoom-in figure with another figure with rotations. Also, could the authors briefly commenting on why they chose H566 for distance measurement specifically?

      The side chain of residue H566 in the nanodisc model face towards each other at the interface, hence this residue was chosen to shown the proximity.

      Related to previous comment, I see one additional green square in Fig. 2A and an additional green square in Fig. 2B, without any zoom-in images provided on these regions. Besides, they're focusing on two different domains with same color, any particular reason why they're there? If so, please provide the information in figure legends.

      The green squares in panels 2A and 2B are the regions that have been zoomed in panels 2D and 2E showing the interactions of the TTH. This is now made clear in the legend as well as in the figure.

      Related to previous comment, authors should also provide distance measurement over electrostatic interaction sites in Fig. 2A, since distance plays as an important factor in these forces.

      The electrostatic interactions have been included.

      For Fig. 2C, since in Fig. 1, the authors have already indicated the differences between reconstruction of the GDN and nanodisc datasets, this information provided here seems to be a bit abundant, I suggest either move this panel to Fig. 1, to make a visualization on both electron densities as well as atomic models, or move this panel to supplementary figures.

      We thank the reviewer for the suggestion. The panel, figure 2C is moved to Appendix Fig. S2B.

      Fig. 3B, some of the spheres of the lipids were also marked as red, any particular reason why they're red? Do they indicate they're phosphate heads? If so, could the authors provide evidences how they define these orientations of the lipid heads? If not, any particular reason why they're red?

      Although, there are non-protein densities (i.e., density beyond noise that remain after modelling of protein residues and found individually) have been modelled as lipids (In Fig. EV4E, these additional densities are shown). Except for few, all these densities have been modelled only as acyl chain. The lipids modelled with head group and phosphate (that have oxygen) and the fit of the density are shown in both figure 3A and EV4F. Hence, the red (oxygen) is seen in the space filling model of lipids (the density for few lipids are shown, also in the response to the comment below).

      Fig. 3C, the fitted model of lipid and its corresponding density should be added to Fig. S4, to give more detailed view on the quality of the fitting.

      The figure 3 has now been reorganised and the new figure (fig. 5) has only 3 panels. We have provided an enlarged view of the lipids in the membrane domain along with unmodelled densities in 3A. In addition, in fig. EV4F, fit of the lipid to density (select lipids) are shown.

      Fig. 4D and 4E, could the authors also indicate the RMSD values when comparing the differences of RtMprF, PaMprF, ReMprF, this information would be helpful to understand how big of a difference within these three models.

      The RMSD values of the structural comparison is given in the text.

      Fig. 6E, the coloring used for CCA-Ala were similar to the blue part of soluble domain, could the authors change the coloring a bit? Also, for Fig. 6F, I would suggest the authors provide a prediction model, such as using AlphaFold3, of this tRNA interaction site, to further validate this proposed model.

      The colour of the CCA part is changed in the revised figure. Following the suggestion of the reviewer, we used AlphaFold3 to predict the complex formation of PaMprF with tRNA (or shorter version) (Figure 2). As mentioned above in response to reviewer 1, the prediction of dimeric enzyme was of low confidence and this is also reflected when a combination of tRNA, lipids and enzyme sequence are given. Instead of full-length tRNA, if only the CCA end is provided, then the prediction program does position this in the postulated cavity. Only with the monomeric enzyme and tRNA does one get a reasonable model. With respect to the proposed model in 6F, currently we don’t have any evidence and this remains a postulate. In the revised manuscript, we have replaced this with conservation figure, which we thought is more relevant.

      In Supplementary Figures S1 and S3, the angular distribution of maps exhibited preferred orientation to certain extent, 3D FSC estimation should also be supplied for these maps, as an indication of whether the reconstructed densities were affected or not.

      We have included the 3DFSC plots for all the data sets (including the new ones in figures EV1, 2, 5, 6, 7). It is evident that the nanodisc datasets in general are slightly anisotropic.

      For Fig S3B, could the authors switch to another image with better contrast?

      This is now replaced with an image to show the particles.

      Minor comments 1. Fig. 2E and 2F, distance measurement should also be supplied to these two panels.

      We have now included the distance measurement in both the panels, which are now Fig. 2D and 2E.

      Fig. 5D, since in Fig. 4F and 4G already mentioned the skeleton of GDN, this modeling part should be presented before exhibit it in dimer interface, the authors should rearrange the sequence over these three panels.

      The figures in the revised manuscript has been rearranged. Figure 5 (now figure 4) has been modified to include the biochemical analysis (crosslinking studies) and the panel 5D has been removed.

      In Supplementary Figure S3, which density was shown for the PaMprF local resolution estimation result? Authors should provide this information as two maps were shown in this figure.

      The local resolution is for C2 symmetrised map and this is now mentioned in the panel.

      CROSS-REFEREE COMMENTS Both Reviewer #1 and #3 made comments over technical issue, their evaluation over functional aspects of this protein is what I was lacking over my comments, also, their evaluation of the biological narrative, relevance toward previous research is also more insightful. Finally, they offer valuable suggestions on how to adjust the article to make it more readable, and better describing the biological story which I would suggest the authors to pay attention to.

      Reviewer #2 (Significance (Required)):

      Significance The authors mainly focused on the structure of MprF in Pseudomonas aeruginosa, this protein is essential for the resistance to cationic antimicrobial peptides. A combination of structural and biochemical analysis provided evidences to the dimeric formation to this enzyme, and the analysis over differences of purified proteins using GDN and nanodisc was particular interesting, which provide new insight regarding the flexible nature of this enzyme, and potentially could be beneficial to the membrane protein community, as it demonstrates the differences in detergent/nanodisc of choice could affect the assembly of the protein of interest. Still, some of the statements in the manuscript, for instance, the assignment of lipids was over-claimed and could be benefited from additional approaches to support the issue. I would suggest some refinement in the discussion section as well as some of the figures.

      My expertise: cryo-EM single particle analysis; cryo-ET; sub-tomo averaging; cryo-FIB;

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Jha and Vinothkumar characterize the cryoEM structure of the alanyl-phosphatidylglycerol producing multiple peptide resistance factor (MprF) of Pseudomonas aeruginosa. MprF proteins mediate the transfer of amino acids from aminoacyl-tRNAs to negatively charged phospholipids resulting in reduced membrane interactions with cationic antimicrobial peptides (produced by the host and competing microorganisms). The phospholipid modifications involve in most cases the transfer of lysine or alanine to phosphatidylglycerol. MprF proteins are membrane proteins consisting of a soluble and hydrophobic domain. Multiple functional studies have shown that the soluble domain of MprF mediates the aminoacylation of phosphatidylglycerol, while the hydrophobic domain mediates the "flipping" of aminoacylated phospholipids across the membrane, a process that is crucial to repulse or prevent the interaction of antimicrobial peptides encountered at the outer leaflet of bacterial membranes. Aside from its role in conferring antimicrobial peptide resistance, other roles of MprF have been described including more physiological roles such as improving growth under acidic conditions. Interestingly, MprF proteins are also found in Gram-negative bacteria which are already protected by an additional membrane that includes LPS. However, in Pseudomonas aeruginosa, MprF confers phenotypes that are similar to those observed in Gram-positive bacteria. Importantly, crystal structures of the soluble domain have led to important insights into aminoacyl phospholipid synthesis and recent studies on the cryoEM structure of Rhizobium tropici have confirmed functional and preliminary structural studies with other MprF proteins. The cryoEM structure from R. tropici confirmed the dimeric structure of MprF and supported a role of the hydrophobic domain in flipping lysyl-phosphatidylglycerol across the membrane. A comparison of the structures of lysyl-phosphatidylglycerol with alanyl-phosphatidylglycerol producing MprFs could reveal new insights into the mechanism of transferring aminoacyl-phospholipids from the soluble domain to the hydrophobic domain and translocation of alanyl- vs lysyl-phosphatidylglycerol across the membrane.

      Major concerns

      1. The study by Jha and Vinothkumar provides the cryoEM structure of an alanyl-phosphatidylglycerol producing MprF protein which is in principle an important milestone in gaining a better understanding of the mechanism of aminoacyl-phospholipid synthesis and flipping, including the potentially different requirements of accommodating different aminoacyl -tRNAs and aminoacyl-phospholipid species. However, this is not addressed. The authors present a "distinct architecture" compared to the structure of R. tropici- MprF, without providing functional insights and the focus of the study shifts to the role of detergents in determining MprF structures via cryoEM. Thus, after fundamental discoveries have been made with crystal structures of the soluble domain and cryoEM structure of R. tropici, this study -while valuable as a resource- seems to offer only an incremental advance in understanding the mode of action of MprF and the potential different requirements for transferring alanyl-phosphatidylglycerol to the hydrophobic domain and flipping across the membrane. The reader is left with the finding of a distinct architecture with no further explanation or hypothesis.

      We thank the reviewer for his/her comments. It is true that the crystal structures of soluble domains of MprF (from 3 species) and the cryoEM structures are now available (two Rhizobium species). However, the cryoEM maps that we have obtained has several salient features including the distinct dimeric interface and the position of the C-terminal helix of the soluble domain. This in particular is important. In the previous study, Hebecker et al 2011 had reported that the terminal helix of PaMprF was important for the activity and the construct without the TM domain can also function in modifying the lipids. The full-length cryoEM map of PaMprF in GDN now provides an idea how this occurs, with the terminal helix buried at the interface. Further, the proposed tRNA binding site (from Hebecker et al 2015, lysine amide bound structure) face other in the dimeric architecture of R. tropici and it is not clear how the full-length tRNA will bind without disrupting the dimer. In contrast, the dimer architecture observed for PaMprF has the tRNA binding site facing away and they can bind to the enzyme without any constraints. We think the mobile/dynamic elements (or secondary structure) of the synthase domain play a major role in interaction with substrates and mechanism. The current structures provide some evidence for this and form the basis of future studies. Instead of cartoon description, we have now included a conservation plot of the molecule in explaining the possible mechanism along with the surface representation in figure 6.

      Differences to R.tropici MprF and other studies are difficult to follow as only a topological map of the Pseudomonas MprF is provided and conserved amino acids that have been shown to be crucial in mediating synthesis and flipping are not highlighted in the text or in the figures, specifically addressed, or discussed. Conserved amino acids in the presented cryoEM structure could provide important mechanistic insights and could address substrate specificity/requirements for aminoacyl phospholipid synthesis, transfer to the hydrophobic domain and flipping.

      The conservation of residues across MprF homologues have been presented in previous published articles and hence, initially we had not included in the manuscript. We have now included multiple sequence alignment of select homologues of MprF highlighting conserved residues (Appendix Fig. S6) as well a figure (Fig. 6F) colouring the molecule with conservation scores with CONSURF. In figure 6F, zoomed in version, we highlight the many of the conserved residues in the synthase domain as they play a role in substrate selectivity.

      Authors characterize an alanyl-phosphatidylglycerol producing MprF but do not detect the lipid in the cryoEM structure. Thus, the potential path taken by alanyl-phosphatidylglycerol remains unclear. Authors model the detected lipids as phosphatidylglycerol, which may be an interesting finding as it would indicate that MprF is generally capable of flipping phospholipids (this is however not discussed). While it is plausible that MprF flippases may be able to flip phosphatidyglycerol it could have a different path and structural requirements. It is also difficult to follow what the suggested pathway of flipping is in the Pseudomonas-MprF flippase (compared to R.tropici). Authors could provide a similar overview figure as in Song et al. and indicate what the potential differences are.

      We modelled phosphatidylglycerol as the lipid as the current density doesn’t allow to model ala-PG ambiguously though it is found in the same position as the lys-PG in the R. tropici maps. The recent in-vitro assay by Hankins et al 2025 shows that PaMprF is able to flip wide range of lipids and we would also like to point out that PG from outer leaflet can be flipped, whose headgroup can be modified at the inner leaflet and flipped back. As shown by Song et al 2021 and Hebecker et al 2011, the specificity for the substrates is in the synthase domain (by mutagenesis and swapping). We don’t think there will be any difference between the lys-PG and Ala-PG path but in our opinion the positional relation between the soluble and membrane domain is the most important and has remained the focus of the manuscript along with the dimeric architecture. The figure 6 in the manuscript is descriptive of this and provides a summary of the structural observation from the presented structures.

      Minor concerns

      • Page 13: the following sentence should be rephrased: "Among the missing links in the current cryoEM maps is the lack of well-ordered density for lipid molecules on the inner leaflet closer to the re-entrant helices but it is reasonable to assume from the cluster of positive charge that there will be lipid molecules and are dynamic. "

      This is has been rephrased.

      • Page 4: Klein et al do not show that the Pseudomonas aeruginosa MprF mediates flipping

      Corrected to reflect only the modification of lipid and not flipping.

      Reviewer #3 (Significance (Required)):

      General assessment: see review

      Advance: Minor

      Audience: Specialized

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors report the structure of the human CTF18-RFC complex bound to PCNA. Similar structures (and more) have been reported by the O'Donnell and Li labs. This study should add to our understanding of CTF18-RFC in DNA replication and clamp loaders in general. However, there are numerous major issues that I recommend the authors fix. 

      Strengths: 

      The structures reported are strong and useful for comparison with other clamp loader structures that have been reported lately. 

      Weaknesses: 

      The structures don't show how CTF18-RFC opens or loads PCNA. There are recent structures from other groups that do examine these steps in more detail, although this does not really dampen this reviewer's enthusiasm. It does mean that the authors should spend their time investigating aspects of CTF18-RFC function that were overlooked or not explored in detail in the competing papers. The paper poorly describes the interactions of CTF18-RFC with PCNA and the ATPase active sites, which are the main interest points. The nomenclature choices made by the authors make the manuscript very difficult to read. 

      Reviewer #2 (Public review): 

      Summary 

      Briola and co-authors have performed a structural analysis of the human CTF18 clamp loader bound to PCNA. The authors purified the complexes and formed a complex in solution. They used cryo-EM to determine the structure to high resolution. The complex assumed an auto-inhibited conformation, where DNA binding is blocked, which is of regulatory importance and suggests that additional factors could be required to support PCNA loading on DNA. The authors carefully analysed the structure and compared it to RFC and related structures. 

      Strength & Weakness 

      Their overall analysis is of high quality, and they identified, among other things, a human-specific beta-hairpin in Ctf18 that flexibly tethers Ctf18 to Rfc2-5. Indeed, deletion of the beta-hairpin resulted in reduced complex stability and a reduction in a primer extension assay with Pol ε. This is potentially very interesting, although some more work is needed on the quantification. Moreover, the authors argue that the Ctf18 ATP-binding domain assumes a more flexible organisation, but their visual representation could be improved. 

      The data are discussed accurately and relevantly, which provides an important framework for rationalising the results. 

      All in all, this is a high-quality manuscript that identifies a key intermediate in CTF18dependent clamp loading. 

      Reviewer #3 (Public review): 

      Summary: 

      CTF18-RFC is an alternative eukaryotic PCNA sliding clamp loader that is thought to specialize in loading PCNA on the leading strand. Eukaryotic clamp loaders (RFC complexes) have an interchangeable large subunit that is responsible for their specialized functions. The authors show that the CTF18 large subunit has several features responsible for its weaker PCNA loading activity and that the resulting weakened stability of the complex is compensated by a novel beta hairpin backside hook. The authors show this hook is required for the optimal stability and activity of the complex. 

      Relevance: 

      The structural findings are important for understanding RFC enzymology and novel ways that the widespread class of AAA ATPases can be adapted to specialized functions. A better understanding of CTF18-RFC function will also provide clarity into aspects of DNA replication, cohesion establishment, and the DNA damage response. 

      Strengths: 

      The cryo-EM structures are of high quality enabling accurate modelling of the complex and providing a strong basis for analyzing differences and similarities with other RFC complexes. 

      Weaknesses: 

      The manuscript would have benefitted from more detailed biochemical analysis to tease apart the differences with the canonical RFC complex. 

      I'm not aware of using Mg depletion to trap active states of AAA ATPases. Perhaps the authors could provide a reference to successful examples of this and explain why they chose not to use the more standard practice in the field of using ATP analogues to increase the lifespan of reaction intermediates. 

      Overall appraisal: 

      Overall the work presented here is solid and important. The data is sufficient to support the stated conclusions and so I do not suggest any additional experiments. 

      Reviewer #1 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and for their thorough review. All raised points have been addressed below.

      Major points 

      (1) The nomenclature used in the paper is very confusing and sometimes incorrect. The authors refer to CTF18 protein as "Ctf18", and the entire CTF18-RFC complex as "CTF18". This results in massive confusion because it is hard to ascertain whether the authors are discussing the individual subunits or the entire complex. Because these are human proteins, each protein name should be fully capitalized (i.e. CTF18, RFC4 etc). The full complex should be referred to more clearly with the designation CTF18-RFC or CTF18-RLC (RFC-like complex). Also, because the yeast and human clamp loader complexes use the same nomenclature for different subunits, it would be best for the authors to use the "A, B, C, D, E subunit" nomenclature that has been standard in the field for the past 20 years. Finally, the authors try to distinguish PCNA subunits by labeling them "PCNA2" or "PCNA1" (see Page 8 lines 180,181 for an example). This is confusing because the names of the RFC subunits have similar formats (RFC2, RFC3, RFC4, etc). In the case of RFC this denotes unique genes, whereas PCNA is a homotrimer. Could the authors think of another way to denote the different subunits, such as super/subscript? PCNA-I, PCNA-II, PCNA-III? 

      We thank the reviewer for pointing out the confusing nomenclature. Following the referee suggestion, we now refer to the CTF18 full complex as “CTF18-RFC”. We prefer keeping the nomenclature used for CTFC18 subunits as RFC2, RFC3 etc., as recently used in Yuan et al, Science, 2024. However, we followed the referee’s suggestion for PCNA subunits, now referred to as PCNA-I, PCNA-II and PCNA-III.

      (2) I believe that the authors are over-interpreting their data in Figure 1. The claim that "less sharp definition" of the map corresponding to the AAA+ domain of Ctf18 supports a relatively high mobility of this subunit is largely unsubstantiated. There are several reasons why one could get varying resolution in a cryo-EM reconstruction, such as compositional heterogeneity, preferred orientation artifacts, or how the complex interacts with the air-water interface. If other data were presented that showed this subunit is flexible, this evidence would support that data but cannot alone as justification for subunit mobility. Along these lines, how was the buried surface area (2300 vs 1400 A2) calculated? Is this the total surface area or only the buried surface area involving the AAA+ domains? It is surprising that these numbers are so different considering that the subunits and complexes look so similar (Figures 1c and 2b). 

      We respectfully disagree with the suggestion that our interpretation of local flexibility in the AAA+ domain of Ctf18 is overreaching. Several lines of evidence support this interpretation. First, compositional heterogeneity is unlikely, as the A′ domain of Ctf18 is well-resolved and forms stable interactions with RFC3, indicating that Ctf18 is consistently incorporated into the complex. Second, preferred orientation artifacts are excluded, as the particle distribution shows excellent angular coverage (Fig. S9a). Third, we now include a 3D variability analysis (3DVA; Supplementary Video 1), which reveals local conformational heterogeneity centered around the AAA+ domain of Ctf18, consistent with intrinsic flexibility.

      Regarding the buried surface area values, the reported numbers refer specifically to the interfaces between the AAA+ domain of Ctf18 and RFC2, and are derived from buried surface area calculations performed with PISA. The smaller interface (~1400 Ų) compared to RFC1–RFC2 (~2300 Ų) reflects low sequence identity (~26%) and divergent structural features, including the absence of conserved elements such as the canonical PIP-box in Ctf18. We have clarified and expanded this explanation in the revised manuscript (Page 7).

      (3) The authors very briefly discuss interactions with PCNA and how the CTF18-RFC complex differs from the RFC complex. This is amongst the most interesting results from their work, but also not well-developed. Moreover, Figure 3D describing these interactions is extremely unclear. I feel like this observation had potential to be interesting, but is largely ignored by the authors. 

      We thank the referee for pointing this out. We have expanded the section describing the interactions of CTF18-RFC and PCNA (Page 9 in the new manuscript), and made a new panel figure with further details (Fig. 3D).  

      (4) The authors make the observation that key ATP-binding residues in RFC4 are displaced and incompatible with nucleotide binding in their CTF18-RFC structure compared to the hRFC structure. This should be a main-text figure showing these displacements and how it is incompatible with ATP binding. Again, this is likely an interesting finding that is largely glossed over by the authors. 

      We now discuss this feature in detail (Pag 11 in the new manuscript), and added two figure insets (Fig. 4c) describing the incompatibility of RFC4 with nucleotide binding.

      (5) The authors claim that the work of another group (citation 50) "validate(s) our predictions regarding the significant similarities between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction." However, as far as this reviewer can tell the work in citation 50 was posted online before the first draft of this manuscript appeared on biorxiv, so it is dubious to claim that these were "predictions." 

      We agree with the referee about this claim. We have now revised the text as follows:

      “While our work was being finalized, several cryo-EM structures of human CTF18-RFC bound to PCNA and primer/template DNA were reported by another group (He et al, PNAS, 2024). These findings are consistent with the distinct features of CTF18-RFC observed in our structures and independently support the notion of significant mechanistic similarity between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction”.

      (6) The authors use a primer extension assay to test the effects of truncating the Nterminal beta hairpin of CTF18. However, this assay is only a proxy for loading efficiency and the observed effects of the mutation are rather subtle. The authors could test their hypothesis more clearly if they performed an ATPase assay or even better a clamp loading assay. 

      We thank the referee for this valuable suggestion. In response, we have performed clamp loading assays comparing the activities of human RFC, wild-type CTF18-RFC, and the β-hairpin–truncated CTF18-RFC mutant. The results, now presented in Fig. 6 and Table 1 of the revised manuscript, clearly show that truncation of the N-terminal βhairpin results in a slower rate of PCNA loading. We propose that this reduced loading rate likely contributes to the diminished Pol ε–mediated DNA synthesis observed in the primer extension assays.

      Minor points 

      (1) Page 3 line 53 the introduction suggests that ATP hydrolysis prompts clamp closure. While this may be the case, to my knowledge all recent structural work shows that closure can occur without ATP hydrolysis. It may be better to rephrase it to highlight that under normal loading conditions, ATP hydrolysis occurs before clamp closure. 

      The text now reads (Page 3): 

      “DNA binding prompts the closure of the clamp and hydrolysis of ATP induces the concurrent disassembly of the closed clamp loader from the sliding clamp-DNA complex, completing the cycle necessary for the engagement of the replicative polymerases to start DNA synthesis.”

      (2) Page 3 line 60, I do not see how the employment of alternative loaders highlights the specificity of the loading mechanism - would it not be possible for multiple loaders to have promiscuous clamp loading? 

      We thank the referee for this comment. The text now reads (Page 3):

      “However, eukaryotes also employ alternative loaders (20), including CTF18-RFC (6, 21-24), which likely use a conserved loading mechanism but are functionally specialized through specific protein interactions and context-dependent roles in DNA replication.”

      (3) Page 4 line 75 could you please cite a study that shows Ctf8 and Dcc1 bind to the Ctf18 C-terminus and that a long linker is predicted to be flexible? 

      Two references have been added (Stokes et al, NAR, 2020 and Grabarczyk et al, Structure, 2018)

      (4) Figure 2A has the N-terminal region of Ctf18 as bound to RFC3 but should likely be labeled as bound to RFC5. This caused significant confusion while trying to parse this figure. Further, the inclusion of "X" as a sequence - does this refer to a sequence that was not buildable in the cryo-EM map? I would be surprised that density immediately after the conserved DEXX box motif is unbuildable. If this is the case, it should be clearly stated in the figure legend that "X" denotes an unbuildable sequence. For the conserved beta-hairpin in the sequence, could the authors superimpose the AlphaFold prediction onto their structure? It would be more informative than just looking at the sequence. 

      We apologize for this confusion. The error in Figure 2A has been corrected. The figure caption now explicitely says that “X” refers to amino acid residues in the sequence which were not modelled. A superposition of the cryo-EM model of the N-terminal Beta hairpin in human Ctf18 and AlphaFold predictions for this feature in drosophila and yeast Ctf18 is now presented in Figure 2A.

      (5) Page 8 line 168, the use of the term "RFC5" here feels improper, since the "C" subunit is not RFC5 in all lower eukaryotes (see comment above about nomenclature). For instance, in S cerevisiae, the C subunit is RFC3. I would expect this interaction to be maintained in all C subunits, not all RFC5 subunits. 

      The text now reads (Page 8):

      “Therefore, lower eukaryotes may use a similar b-hairpin motif to bind the corresponding subunit of the RFC-module complex (RFC5 in human, Rfc3 in S. cerevisiae), emphasizing its importance.”  

      (6) Page 10 line 228, the authors claim that hydrolysis is dispensable at the Ctf18/RFC2 interface based on evidence from RFC1/RFC2 interface, by analogy that this is the "A/B" interface in both loaders. However, the wording makes it sound as if the cited data were collected while studying Ctf18 loaders. The authors should clarify this point. 

      The text has been modified as follows (Pag 11): 

      “Prior research has indicated that hydrolysis at the large subunit/RFC2 interface is not essential for clamp loading by various loaders (48-51), while the others are critical for the clamp-loading activity of eukaryotic RFCs. “

      (7) Page 11 line 243/244 the authors introduce the separation pin. Could they clarify whether Ctf18 contains any aromatic residues in this structural motif that would suggest it serves the same functional purpose? Also, the authors highlight this is similar to yeast RFC, which makes it sound like this is not conserved in human RFC, but the structural motif is also conserved in human RFC. 

      We thank the reviewer for this helpful comment. We have clarified in the revised text (Page 12) that the separation pin is conserved not only in yeast RFC but also in human RFC, and now note that human Ctf18 also harbors aromatic residues at the corresponding positions. This observation is supported by the new panel in Figure 4e.

      Minutia 

      (1) Page 2 line 37 please remove the word "and" before PCNA. 

      This has been corrected.

      (2) Please define AAA+ and update the language to clarify that not all pentameric AAA+ ATPases are clamp loaders. 

      AAA+ has been now defined (Page 3).

      (3) Page 4 line 86 Given the relatively weak interaction of Pol ε. 

      This has been corrected.

      (4) Page 8 line 204 the authors likely mean "leucine" and not "lysine". 

      We thank the reviewer for catching this. The error has been corrected.

      (5) Page 14 line 300, the authors claim that CTF18 utilizes three subunits but then list four. 

      We have corrected this.

      Reviewer #2 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and valuable suggestions. The points raised by the referee have been addressed below.

      Major point: 

      (1) Please quantify Figure 6 and S9 from 3 independent repeats and determine the standard deviation to show the variability of the Ctf18 beta hairpin deletion.  The authors suggest that a suboptimal Ctf18 complex interaction with PCNA impacts the stability of the complex, but do not test this hypothesis. Could the suboptimal PIP motif in Ctf18 be changed to an improved motif and the impact tested in the primer extension assay? Although not essential, it would be a nice way to explore the mechanism. 

      We thank the reviewer for the suggestion. However, we note that Figure 6b (now 7b) already presents the quantification of the primer extension assay from three independent replicates, with error bars showing standard deviations, and includes the calculated rate of product accumulation. These data clearly indicate a 42% reduction in primer synthesis rate upon deletion of the Ctf18 β-hairpin.

      We agree that we do not provide direct evidence of impaired complex stability upon deletion of the Ctf18 β-hairpin. However, the 2D classification of the cryo-EM dataset (Figure S9) shows a marked reduction in the number of particles corresponding to intact CTF18-RFC–PCNA complexes in the β-hairpin deletion sample, with the majority of particles corresponding to free PCNA. This contrasts with the wild-type dataset, where complex particles are predominant. These findings indirectly suggest that deletion of the β-hairpin compromises the stability or assembly of the clamp-loader–clamp complex.

      We thank the reviewer for the valuable suggestion to mutate the weak PIP-box of Ctf18. While an interesting direction, we instead sought to directly test the mechanism by performing quantitative clamp loading assays. These assays revealed a significant reduction in the rate of PCNA loading by the CTF18<sup>Δ165–194</sup>-RFCmutant (Figure 6), supporting the conclusion that the β-hairpin contributes to productive PCNA loading. This loading delay likely underlies the reduced rate of primer extension observed in the Pol ε assay (Figure 7), consistent with impaired formation of processive polymerase– clamp complexes.

      (2) I did not see the method describing how the 2D classes were quantified to evaluate the impact of the Ctf18 beta hairpin deletion on complex formation. Please add the relevant information. 

      The relevant information has been added to the Method section:

      “For quantification of complex stability, the number of particles contributing to each 2D class was extracted from the classification metadata (Datasets 1 and 3). All classes showing isolated PCNA rings were summed and compared to the total number of particles in classes representing intact CTF18-RFC–PCNA complexes. This analysis was performed for both wild-type and β-hairpin deletion mutant datasets. Notably, no 2D classes corresponding to free PCNA were observed in the wild-type dataset, whereas in the mutant dataset, a substantial fraction of particles corresponded to isolated PCNA, suggesting reduced stability of the mutant complex.”

      Minor point: 

      (1) Page 2, line 25. Detail what type of mobility is referred to. Do you mean flexibility in the EM-map? 

      We have clarified this. The text now reads:

      “The unique RFC1 (Ctf18) large subunit of CTF18-RFC, which based on the cryo-EM map shows high relative flexibility, is anchored to PCNA through an atypical low-affinity PIP box”

      (2) Page 4, line 82. Please introduce CMGE, or at least state what the abbreviation stands for. 

      This has been addressed.

      (3) Page 4, line 89. Specify that the architecture of the HUMAN CTF18-RFC module is not known, as the yeast one has been published. 

      At the time our study was initiated, the architecture of the human CTF18-RFC module was unknown. A structure of the human complex was published by another group during the final stages of our work and is now properly acknowledged in the Discussion.

      (4) Page 6. Is it possible to illustrate why the autoinhibited state cannot bind to DNA? A visual representation would be nice. 

      We thank the reviewer for this suggestion. Figure 4b in the original manuscript already illustrates why the autoinhibited, overtwisted conformation of the CTF18-RFC pentamer cannot accommodate DNA. In this state, the inner chamber of the loader is sterically occluded, precluding the binding of duplex DNA.

      Reviewer #3 (Recommendations for the authors): 

      We thank Reviewer #3 for their constructive feedback and positive overall assessment of our work.

      We also thank the reviewer for their remarks on the use of Mg depletion to halt hydrolysis. Magnesium is an essential cofactor for ATP hydrolysis, and its depletion is expected to effectively prevent catalysis by destabilizing the transition state, possibly more completely than the use of slowly hydrolysable analogues such as ATPγS. We have recently employed Mg<sup>²+</sup> depletion to successfully trap a pre-hydrolytic intermediate in a replicative AAA+ helicase engaged in DNA unwinding (Shahid et al., Nature, 2025). This precedent supports the rationale for our choice, and the reference has now been included in the revised manuscript.

      I think the authors deposited the FSC curve for the +Mg structure in the -Mg structure PDB/EMDB entry according to the validation report. 

      We thank the reviewer for their careful inspection of the deposition materials. The discrepancy in the deposited FSC curve has now been corrected, and the appropriate FSC curves have been assigned to the correct PDB/EMDB entries.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This paper measures the positioning and diffusivity of RNaseE-mEos3.2 proteins in E. coli as a function of rifampicin treatment, compares RNaseE to other E. coli proteins, and measures the effect of changes in domain composition on this localization and motion. The straightforward study is thoroughly presented, including very good descriptions of the imaging parameters and the image analysis/modeling involved, which is good because the key impact of the work lies in presenting this clear methodology for determining the position and mobility of a series of proteins in living bacteria cells. 

      Thank you for the nice summary and positive feedback on the descriptions and methodology. 

      My key notes and concerns are listed below; the most important concerns are indicated with asterisks. 

      (1) The very start of the abstract mentions that the domain composition of RNase E varies among species, which leads the reader to believe that the modifications made to E. coli RNase E would be to swap in the domains from other species, but the experiment is actually to swap in domains from other E. coli proteins. The impact of this work would be increased by examining, for instance, RNase E domains from B. subtilis and C. crescentus as mentioned in the introduction. 

      Thank you for the suggestions. We agree that the sentence may convey an unintended expectation. Our original intention was to note the presence and absence of certain domains of RNase E (e.g. membrane-binding motif and CTD) vary across species, rather than the actual sequence variations. To avoid any misinterpretation, we decided to remove the sentence from the abstract. Using the domains of B. subtilis and C. crescentus RNase E in E. coli is a very interesting suggestion, but we will leave that for a future study. 

      (2) Furthermore, the introduction ends by suggesting that this work will modulate the localization, diffusion, and activity of RNase E for "various applications", but no applications are discussed in the discussion or conclusion. The impact of this work would be increased by actually indicating potential reasons why one would want to modulate the activity of RNase E. 

      Thank you for this suggestion. For example, an E. coli strain expressing membranebound RNase E without CTD can help stabilize mRNAs and enhance protein expression. In fact, this idea was used in a commercial BL21 cell line (Invitrogen’s One Shot BL21 Star), to increase the yield of protein expression. We also think that environmentally modulated MB% of RNase E can be useful for controlling the mRNA half-lives and protein expression levels in different conditions. We discussed these ideas at the end of the Discussion.

      (3) Lines 114 - 115: "The xNorm histogram of RNase E shows two peaks corresponding to each side edge of the membrane": "side edge" is not a helpful term. I suggest instead: "...corresponding to the membrane at each side of the cell" 

      Thank you. We made the suggested change.

      (4) A key concern of this reviewer is that, since membrane-bound proteins diffuse more slowly than cytoplasmic proteins, some significant undercounting of the % of cytoplasmic proteins is expected due to decreased detectability of the faster-moving proteins. This would not be a problem for the LacZ imaging where essentially all proteins are cytoplasmic, but would significantly affect the reported MB% for the intermediate protein constructs. How is this undercounting considered and taken into account? One could, for instance, compare LacZ vs. LacY (or RNase E) copy numbers detected in fixed cells to those detected in living cells to estimate it.  

      Thank you for raising this point and suggesting a possible way to address this. We compared the number of tracks for mEos3.2-fused proteins in live vs fixed cells and tested the undercounting effect of cytoplasmic molecules. We compared WT RNase E molecules in live and fixed cells and found that there are about 50% lower molecules detected in the fixed cells, which agrees with the expectation that fluorescent proteins lose their signal upon fixation. Similarly, cytoplasmic RNase E (RNase E ΔMTS) copy number was also ~50% less in the fixed cells compared to live cells. If cytoplasmic molecules were undercounted compared membrane-bound molecules in live cells, fixation would reduce the copy number less than 50%. The comparable ratio of 50% indicates that the undercounting issue is not significant. This control analysis is provided in Figure S1B-C, and we made corresponding textual change in the result section as below:

      For this analysis, we first confirmed that proteins localized on the membrane and in the cytoplasm are detected with equal probability, despite differences in their mobilities (Fig. S1B-C). 

      (5) The rifampicin treatment study is not presented well. Firstly, it is found that LacY diffuses more rapidly upon rifampicin treatment. This change is attributed to changes in crowding at the membrane due to mRNA. Several other things change in cells after adding rif, including ATP levels, and these factors should be considered. More importantly, since the change in the diffusivity of RNaseE is similar to the change in diffusivity of LacY, then it seems that most of the change in RNaseE diffusion is NOT due to RNaseE-mRNAribosome binding, but rather due to whatever crowding/viscosity effects are experienced by LacY (along these lines: the error reported for D is SEM, but really should be a confidence interval, as in Figure 1, to give the reader a better sense of how different (or similar) 1.47 and 1.25 are). 

      We agree with the reviewer that upon rifampicin treatment, RNase E’s D increases to a similar extent as that of LacY. Hence, the increase likely arises from a factor common to both proteins. We have added the reviewer’s suggested interpretation as a possible explanation in the manuscript as below. 

      The similar fold change in D<sub>RNE</sub> and D<sub>LacY</sub> upon rif treatment suggests that the change in RNE diffusion may largely be attributed to physical changes in the intracellular environment (such as reduced viscosity or macromolecular crowding[41,42]), rather than a loss of RNA-RNE interactions.

      As requested by the reviewer, we have provided confidence intervals for our D values in Table S8. Because these intervals are very narrow, we chose to present the SEM as the error metric for D and have also reported the corresponding errors for the fold-change values whenever we describe the fold differences between D values. 

      (6) Lines 185-189: it is surprising to me that the CTD mutants both have the same change in D (5.5x and 5.3x) relative to their full-length counterparts since D for the membranebound WT protein should be much less sensitive to protein size than D for the cytoplasmic MTS mutant. Can the authors comment? 

      Perhaps the reviewer understood that these differences are the ratios between +/-CTD (e.g. WT RNE vs ΔCTD). However, the differences we mentioned were from membrane-bound vs cytoplasmic versions of RNase E with comparable sizes (e.g. WT RNase E vs RNase E ΔMTS). We modified text and added a summary sentence at the end of the paragraph to clarify the point.

      We found that D<sub>ΔMTS</sub> is ~5.5 times that of D<sub>RNE</sub> (Fig. 3B). [...] Together, these results suggest that the membrane binding reduces RNE mobility by a factor of 5.

      That being said, we also realized a similar fold difference between +/-CTD. Specifically, WT RNE vs RNE ΔCTD (both membrane-bound) show a ~4.1-fold difference and RNE ΔMTS vs RNE ΔMTS ΔCTD (both cytoplasmic) show ~3.9-fold difference. We do not currently do not have a clear explanation for this pattern. Given that these two pairs have a similar change in mass, we speculate that the relationship between D and molecular mass may be comparable for membrane-bound and free-floating RNE variants. 

      (7) Lines 190-194. Again, the confidence intervals and experimental uncertainties should be considered before drawing biological conclusions. It would seem that there is "no significant change" in the rhlB and pnp mutants, and I would avoid saying "especially for ∆pnp" when the same conclusion is true for both (one shouldn't say 1.04 is "very minute" and 1.08 is just kind of small - they are pretty much the same within experiments like this). 

      Thank you for raising this point, which we fully agree with. That being said, we decided to remove results related to the degradosome proteins to improve the flow of the paper. We are preparing another paper related to the RNA degradosome complex formation. 

      (8) Lines 221-223 " This is remarkable because their molecular masses (and thus size) are expected to be larger than that of MTS" should be reconsidered: diffusion in a membrane does not follow the Einstein law (indeed lines 223-225 agree with me and disagree with lines 221-223). (Also the discussion paragraph starting at line 375). Rather, it is generally limited by the interactions with the transmembrane segments with the membrane. So Figure 3D does not contain the right data for a comparison, and what is surprising to me is that MTS doesn't diffuse considerably faster than LacY2. 

      We agree with the reviewer’s point that diffusion in a membrane does not follow the Stokes-Einstein law. That is why we introduced Saffman’s model. However, even in this model, proteins of larger size (or mass) should be slower than smaller size (a reason why we presented Figure 3D, now 4D). In other words, both Einstein and Saffman models predict that larger particles diffuse slower, although the exact scaling relationship differs between two models. Here, we assume that mass is related to the size. Contrary to Saffman’s model for membrane proteins, LacY2 diffuses faster than MTS despite of large size. Using MD simulations, we showed that this discrepancy can be explained by different interaction energies as the reviewer mentioned. This analysis further demonstrates that the size is not the only factor to consider protein diffusion in the membrane. We edited the paragraph to clarify the expectations and our interpretations.

      According to the Stokes-Einstein relation for diffusion in simple fluids[49] and the Saffman-Delbruck diffusion model for membrane proteins, D decreases as particle size increases, albeit with different scaling behaviors. […] Thus, if size (or mass) were the primary determinant of diffusion, LacY2 and LacY6 would diffuse more slowly than the smaller MTS. The observed discrepancy instead implies that D may be governed by how each motif interacts with the membrane. For example, the way that TM domains are anchored to the membrane may facilitate faster lateral diffusion with surrounding lipids. 

      (9) The logical connection between the membrane-association discussion (which seems to ignore associations with other proteins in the cell) and the preceding +/- rifampicin discussion (which seeks to attribute very small changes to mRNA association) is confusing.

      Thank you for raising this point. We re-arranged the second result section to present diffusion due to membrane binding first before rifampicin. Furthermore, we stated our hypothesis and expectations in the beginning of the results section. This addition will legitimate our logic flow.

      (10) Separately, the manuscript should be read through again for grammar and usage. For instance, the title should be: "Single-molecule imaging reveals the *roles* of *the* membrane-binding motif and *the* C-terminal domain of RNase E in its localization and diffusion in Escherichia coli". Also, some writing is unwieldy, for instance, "RNase E's D" would be easier to read if written as D_{RNaseE}. (underscore = subscript), and there is a lot of repetition in the sentence structures. 

      Thank you for catching grammar mistakes. We went through extensive proofreading to avoid these mistakes and also used simple notation suggested by the reviewer, such as D<sub>RNE</sub>, to make it easier to read. Thank you again for your suggestions.

      Reviewer #2 (Public review): 

      Summary: 

      Troyer and colleagues have studied the in vivo localisation and mobility of the E.coli RNaseE (a protein key for mRNA degradation in all bacteria) as well as the impact of two key protein segments (MTS and CTD) on RNase E cellular localisation and mobility. Such sequences are important to study since there is significant sequence diversity within bacteria, as well as a lack of clarity about their functional effects. Using single-molecule tracking in living bacteria, the authors confirmed that >90% of RNaseE localised on the membrane, and measured its diffusion coefficient. Via a series of mutants, they also showed that MTS leads to stronger membrane association and slower diffusion compared to a transmembrane motif (despite the latter being more embedded in the membrane), and that the CTD weakens membrane binding. The study also rationalised how the interplay of MTS and CTD modulate mRNA metabolism (and hence gene expression) in different cellular contexts. 

      Strengths: 

      The study uses powerful single-molecule tracking in living cells along with solid quantitative analysis, and provides direct measurements for the mobility and localisation of E.coli RNaseE, adding to information from complementary studies and other bacteria. The exploration of different membrane-binding motifs (both MTS and CTD) has novelty and provides insight on how sequence and membrane interactions can control function of protein-associated membranes and complexes. The methods and membrane-protein standards used contribute to the toolbox for molecular analysis in live bacteria. 

      Thank you for the nice summary of our work and positive comments about the paper’s strengths.

      Weaknesses: 

      The Results sections can be structured better to present the main hypotheses to be tested. For example, since it is well known that RNase E is membrane-localised (via its MTS), one expects its mobility to be mainly controlled by the interaction with the membrane (rather than with other molecules, such as polysomes and the degradosome). The results indeed support this expectation - however, the manuscript in its current form does not lay down the dominant hypothesis early on (see second Results chapter), and instead considers the rifampicin-addition results as "surprising"; it will be best to outline the most likely hypotheses, and then discuss the results in that light. 

      Thank you for this comment. We addressed this point by stating our main hypothesis from the beginning of the results section. We also agree with the reviewer that the membrane binding effect should be discussed first; hence, we re-arranged the result section. In the revised manuscript, we discuss the effect of membrane binding on diffusion first, followed by rif effects.

      Similarly, the authors should first discuss the different modes of interaction for a peripheral anchor vs a transmembrane anchor, outline the state of knowledge and possibilities, and then discuss their result; in its current version, the ms considers the LacY2 and LacY6 faster diffusion compared to MTS "remarkable", but considering the very different mode of interaction, there is no clear expectation prior to the experiment. In the same section, it would be good to see how the MD simulations capture the motion of LacY6 and LacY12, since this will provide a set of results consistent with the experimental set. 

      Thank you for pointing this out. In fact, there is little discussion in the literature about the different modes of interaction for a peripheral anchor vs a transmembrane anchor. To our knowledge, our work (experiments and MD simulations) is the first that directly compared the two to reveal that the peripheral anchor has higher interaction energy than the transmembrane anchor. We added a sentence “Despite the prevalence of peripheral membrane proteins, how they interact with the membrane and how this differs from TM proteins remain poorly understood”. Furthermore, we added the MD simulation result of LacY6 and LacY12 in Figure 4E-F.

      The work will benefit from further exploration of the membrane-RNase E interactions; e.g., the effect of membrane composition is explored by just using two different growth media (which on its own is not a well-controlled setting), and no attempts to change the MTS itself were made. The manuscript will benefit from considering experiments that explore the diversity of RNaseE interactions in different species; for example, the authors may want to consider the possibility of using the membrane-localisation signals of functional homologs of RNaseE in different bacteria (e.g., B. subtilis). It would be good to look at the effect of CTD deletions in a similar context (i.e., in addition to the MTS substitution by LacY2 and LacY6). 

      Thank you very much for this suggestion. During revision, we engineered point mutations in MTS and analyzed critical hydrophobic residues for membrane binding. We characterized MB% in both +/-CTD variants (Fig. 2 and Fig. S6) and their effect on lacZ mRNA degradation (Fig. 6). We will leave the use of membrane motif of B. subtilis RNase E for future study. 

      The manuscript will benefit from further discussion of the unstructured nature of the CTD, especially since the RNase CTD is well known to form condensates in Caulobacter crescentus; it is unclear how the authors excluded any roles for RNaseE phase separation in the mobility of RNaseE in E.coli cells. 

      Yes, we agree with the reviewer that the intrinsically disordered nature of the CTD might contribute to condensate formation. We explored this possibility using both epifluorescence microscopy (with a YFP fusion) and single-molecule imaging with cluster analysis (using an mEos3.2 fusion). Please see Figure S8. We did observe some weak de-clustering of RNase E upon CTD deletion. In the current study, we are unable to quantify the extent to which clustering contributes to the slow diffusion of RNase E. However, we speculate that the clustering may be linked to the low MB% of certain RNE mutants containing CTD, and we discussed this possibility in the Discussion.

      […] further supporting that the CTD decreases membrane association across RNE variants. We speculate that this effect may be related to the CTD’s role in promoting phase-separated ribonucleoprotein condensates, as observed in Caulobacter crescentus[19]. In E. coli, we also observed a modest increase in the clustering tendency of RNE compared to ΔCTD (Fig. S8). 

      Some statements in the Discussion require support with example calculations or toning down substantially. Specifically, it is not clear how the authors conclude that RNaseE interacts with its substrate for a short time (and what this time may actually be); further, the speculation about the MTS "not being an efficient membrane-binding motif for diffusion" lacks adequate support as it stands. 

      Thank you for these points. To elaborate our point on transient interaction between RNase E and RNA, we added a sentence “Specifically, if RNE interacts with mRNAs for ~20 ms or less, the slow-diffusing state would last shorter than the frame interval and remain undetected in our experiment.” Also, we added this sentence in the discussion.

      One possible explanation is that RNA-bound RNE (and RNase Y) is short-lived compared to our frame interval (~20 ms), unlike other RNA-binding proteins related to transcription and translation, interacting with RNA for ~1 min for elongation [48].

      Plus, we clarified the wording used in the second sentence that the reviewer pointed out as follows,

      Lastly, the slow diffusion of the MTS in comparison to LacY2 and LacY6 suggests that MTS is less favorable for rapid lateral motion in the membrane. 

      Reviewer #3 (Public review): 

      Summary: 

      The manuscript by Troyer et al quantitatively measured the membrane localization and diffusion of RNase E, an essential ribonuclease for mRNA turnover as well as tRNA and rRNA processing in bacteria cells. Using single-molecule tracking in live E. coli cells, the authors investigated the impact of membrane targeting sequence (MTS) and the Cterminal domain (CTD) on the membrane localization and diffusion of RNase E under various perturbations. Finally, the authors tried to correlate the membrane localization of RNase E to its function on co- and post-transcriptional mRNA decay using lacZ mRNA as a model. 

      The major findings of the manuscripts include: 

      (1) WT RNase E is mostly membrane localized via MTS, confirming previous results. The diffusion of RNase E is increased upon removal of MTS or CTD, and more significantly increased upon removal of both regions. 

      (2) By tagging RNase E MTS and different lengths of LacY transmembrane domain (LacY2, LacY6, or LacY12) to mEos3.2, the results demonstrate that short LacY transmembrane sequence (LacY2 and LacY6) can increase the diffusion of mEos3.2 on the membrane compared to MTS, further supported by the molecular dynamics simulation. A similar trend was roughly observed in RNase E mutants with MTS switched to LacY transmembrane domains. 

      (3) The removal of RNase E MTS significantly increases the co-transcriptional degradation of lacZ mRNA, but has minimal effect on the post-transcriptional degradation of lacZ mRNA. Removal of CTD of RNase E overall decreases the mRNA decay rates, suggesting the synergistic effect of CTD on RNase E activity. 

      Strengths: 

      (1) The manuscript is clearly written with very detailed method descriptions and analysis parameters. 

      (2) The conclusions are mostly supported by the data and analysis. 

      (3) Some of the main conclusions are interesting and important for understanding the cellular behavior and function of RNase E. 

      Thank you for your thorough summary of our work and positive comments.

      Weaknesses: 

      (1) Some of the observations show inconsistent or context-dependent trends that make it hard to generalize certain conclusions. Those points are worth discussion at least. Examples include: 

      (a) The authors conclude that MTS segment exhibits reduced MB% when succinate is used as a carbon source compared to glycerol, whereas LacY2 segment maintains 100% membrane localization, suggesting that MTS can lose membrane affinity in the former growth condition (Ln 341-342). However, the opposite case was observed for the WT RNase E and RNase E-LacY2-CTD, in which RNase E-LacY2-CTD showed reduced MB% in the succinate-containing M9 media compared to the WT RNase E (Ln 264-267). This opposite trend was not discussed. In the absence of CTD, would the media-dependent membrane localization be similar to the membrane localization sequence or to the fulllength RNase E? 

      This is a great point. Thank you for pointing out the discrepancy in data. We think the weak membrane interaction of RNaseE-lacY2-CTD likely stems from the structure instability in the presence of the CTD. Our data shows that an RNase E variant with a cytoplasmic population under a normal growth condition exhibits a greater cytoplasmic fraction in a poor growth media. In contrast, RNaseE-MTS and RNaseE-LacY2 lacking the CTD both showed 100% MB% under both normal and poor growth conditions. These results are presented in Figure S6 and further discussed in the Discussion section.

      The loss of MB% in LacY2-based RNE was observed only in the presence of the CTD (Fig. S6D), suggesting that the CTD negatively affects membrane binding of RNE, possibly by altering protein conformation. In fact, all ΔCTD RNE mutants we tested exhibited higher MB% than their CTD-containing counterparts (Fig. S6A-B). 

      (b) When using mEos3.2 reporter only, LacY2 and LacY6 both increase the diffusion of mEos3.2 compared to MTS. However, when inserting the LacY transmembrane sequence into RNase E or RNase E without CTD, only the LacY2 increases the diffusion of RNase E. This should also be discussed. 

      Thank you for raising this point. As the reviewer pointed out, as the membrane motifs, both LacY2 and LacY6 diffuse faster than the MTS, but when they are fused to RNE, only LacY2-based RNE diffuses faster than MTS-based RNE. We speculate that it is possibly due to a structural reason—having four (large) LacY6 in a tetrameric arrangement may cancel out the original fast-diffusing property of LacY6. We added this idea in the result section:

      This result may be due to the high TM load (24 helices) created by four LacY6 anchors in the RNE tetramer. Although all constructs are tetrameric, the 24-helix load (LacY6), compared with 8 (LacY2) and 4 (MTS), likely enlarges the membrane-embedded footprint and increases drag, thereby changing the mobility advantages assessed as standalone membrane anchors.

      (2) The authors interpret that in some cases the increase in the diffusion coefficient is related to the increase in the cytoplasm localization portion, such as for the LacY2 inserted RNase E with CTD, which is rational. However, the authors can directly measure the diffusion coefficient of the membrane and cytoplasm portion of RNase E by classifying the trajectories based on their localizations first, rather than just the ensemble calculation. 

      Thank you for this suggestion. Currently, because of the 2D projection effect from imaging, we cannot clearly distinguish which individual tracks are from the cytoplasm or from the inner membrane based on the localization. Therefore, we are unable to assign individual tracks as membrane-bound or cytoplasmic. However, we can demonstrate that the xNorm data can be separated into two different spatial populations based on the diffusion coefficient. D. That is we can plot xNorm of slow tracks vs xNorm of fast tracks. This analysis showed that the slow tracks have LacY-like xNorm profiles while the fast tracks have LacZ-like xNorm profiles, also quantitatively supporting our MB% fitting results. We have added this analysis to Figure S2.

      (3) The error bars of the diffusion coefficient and MB% are all SEM from bootstrapping, which are very small. I am wondering how much of the difference is simply due to a batch effect. Were the data mixed from multiple biological replicates? The number of biological replicates should also be reported. 

      Thank you for raising this point. In the original manuscript, we reported the number of tracks analyzed and noted that all data was from at least three separate biological replicates (measurements were repeated at least three different days). Furthermore, in the revised manuscript, we have provided the number of cells imaged in Table S6. 

      (4) Some figures lack p-values, such as Figures 4 and 5C-D. Also, adding p-values directly to the bar graphs will make it easier to read. 

      Thank you for checking these details. We added p values in the graphs showing k<sub>d1</sub> and k<sub>d2</sub> (Table S7).

      Reviewer #2 (Recommendations for the authors): 

      Minor and technical points: 

      (1) Clarity and flow will be improved if each section first highlights the objective for the experiments that are described (e.g., line 240). 

      Thank you for the suggestion. We addressed this point by editing the beginning of each subsection in the Results. 

      (2) Line 272 (and elsewhere)."1.33-times faster is wrong". The authors mean 33% faster (from 0.075 to 1, see Figure 4G), and not 133% faster. Needs fixing. 

      Thanks for pointing this out. We changed this as well as other incidences where we talk about the fold difference. For example, this particular incidence was changed to:

      Indeed, in the absence of the CTD, we found that the D of LacY2-based RNE was 1.33 ± 0.01 times as fast as the MTS-based RNE. 

      (3) The authors need to consider the fitting of two species on their D population. e.g., how will a 93% - 7% split between diffusive species would have looked for the distribution in S4B? Note also the L1 profile in Fig S4C - while it is not hugely different from Figure S4B, the analysis gives a 41% amplitude for the fast-diffusing species. The 2-species analysis can also be used on some of the samples with much higher cytoplasmic components. Further, tracks that are in the more central region can be analysed to see whether the fast-diffusing species increase in amplitude. 

      Thank you for this comment. The D histograms of L1 and RNase E show a dominant peak at around 0.015, but L1 has a residual population in the shoulder (note the difference between L1’s experimental data and D1 fit, a yellow line in now Figure S3B). This residual shoulder population is absent in the D histogram of RNase E. We also performed two-species analysis as suggested by the reviewer and provided the result in Figure S3C. The analysis shows that the two-population fit (black line) is very close to one one-population fit (yellow line). While we agree with the reviewer that subpopulation analysis is helpful for other proteins that show <90% MB% (>10% significant cytoplasmic population). we found it useful to divide xNorm histogram into two populations based on the diffusivity (rather than doing two-population fit to the D histogram, which does not have spatial information). This analysis, shown in Figure S2, supports our MB% fit results.

      (4) The authors suggest that the sequestration of RNaseE to the membrane limits its interaction with cytoplasmic mRNAs, and may increase mRNA lifetime. While this is true and supported by the authors' preprint (Ref15), it will also be good to consider (and discuss) that highly-transcribed regions are in the nucleoid periphery (and thus close to the membrane) and that ribosomes/polysomes are likewise predominantly peripheral (coregulation of transcription/translation) and membrane proximal. 

      This is an interesting point, which we appreciate very much. The lacZ gene, when induced, is shown to move to the nucleoid periphery (Yang et al. 2019, Nat Comm). Also, in our preprint (Ref 15), we engineered to have lacZ closer to the membrane, by translationally fusing it to lacY. However, the degradation rate of lacZ mRNA was not enhanced by the proximity to the membrane (for both k<Sub>d1</sub> and k<sub>d2</sub>). For lacZ mRNA, we mainly see the change in k<sub>d1</sub> when RNE localization changes. We think it is due to the slow diffusion of the nascent mRNA (attached to the chromosome) and the slow diffusion of membrane-bound RNE, such that regardless of the location of the nascent mRNA, the degradation by the membrane-bound RNE is inefficient. Only when RNE is free diffusing in the cytoplasm, it seems to increase k<sub>d1</sub> (the decay of nascent mRNAs).

      Reviewer #3 (Recommendations for the authors):

      (1) It will increase the clarity of the manuscript if the authors can provide better nomenclatures for different constructs, such as for different membrane targeting sequences fused to mEos3.2, full-length RNase E, or CDT truncated RNaseE. 

      Thank you for this suggestion. We agree that many constructions were discussed, and their naming can be confusing. To help with clarity, we have abbreviated RNase E as RNE throughout the text where appropriate. 

      (2) Line 342, Figure S7D should be cited instead of S6D. 

      Thank you for finding this error. We made a proper change in the revised manuscript.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors describe the results of a single study designed to investigate the extent to which horizontal orientation energy plays a key role in supporting view-invariant face recognition. The authors collected behavioral data from adult observers who were asked to complete an old/new face matching task by learning broad-spectrum faces (not orientation filtered) during a familiarization phase and subsequently trying to label filtered faces as previously seen or novel at test. This data revealed a clear bias favoring the use of horizontal orientation energy across viewpoint changes in the target images. The authors then compared different ideal observer models (cross-correlations between target and probe stimuli) to examine how this profile might be reflected in the image-level appearance of their filtered images. This revealed that a model looking for the best matching face within a viewpoint differed substantially from human data, exhibiting a vertical orientation bias for extreme profiles. However, a model forced to match targets to probes at different viewing angles exhibited a consistent horizontal bias in much the same manner as human observers.

      Strengths:

      I think the question is an important one: The horizontal orientation bias is a great example of a low-level image property being linked to high-level recognition outcomes, and understanding the nature of that connection is important. I found the old/new task to be a straightforward task that was implemented ably and that has the benefit of being simple for participants to carry out and simple to analyze. I particularly appreciated that the authors chose to describe human data via a lower-dimensional model (their Gaussian fits to individual data) for further analysis. This was a nice way to express the nature of the tuning function, favoring horizontal orientation bias in a way that makes key parameters explicit. Broadly speaking, I also thought that the model comparison they include between the view-selective and view-tolerant models was a great next step. This analysis has the potential to reveal some good insights into how this bias emerges and ask finegrained questions about the parameters in their model fits to the behavioral data.

      We thank the reviewer for their positive appraisal of the importance of our research question as well as of the soundness of our approach to it.

      Weaknesses:

      I will start with what I think is the biggest difficulty I had with the paper. Much as I liked the model comparison analysis, I also don't quite know what to make of the view-tolerant model. As I understand the authors' description, the key feature of this model is that it does not get to compare the target and probe at the same yaw angle, but must instead pick a best match from candidates that are at different yaws. While it is interesting to see that this leads to a very different orientation profile, it also isn't obvious to me why such a comparison would be reflective of what the visual system is probably doing. I can see that the view-specific model is more or less assuming something like an exemplar representation of each face: You have the opportunity to compare a new image to a whole library of viewpoints, and presumably it isn't hard to start with some kind of first pass that identifies the best matching view first before trying to identify/match the individual in question. What I don't get about the view-tolerant model is that it seems almost like an anti-exemplar model: You specifically lack the best viewpoint in the library but have to make do with the other options. Again, this is sort of interesting and the very different behavior of the model is neat to discuss, but it doesn't seem easy to align with any theoretical perspective on face recognition. My thinking here is that it might be useful to consider an additional alternate model that doesn't specifically exclude the best-matching viewpoint, but perhaps condenses appearance across views into something like a prototype. I could even see an argument for something like the yaw-averages presented earlier in the manuscript as the basis for such a model, but this might be too much of a stretch. Overall, what I'd like to see is some kind of alternate model that incorporates the existence of the best-match viewpoint somehow, but without the explicit exemplar structure of the view-specific model.

      The view-tolerant model was designed so that identity needed to be abstracted away from variations in yaw to support face recognition. We believe this model aligns with the notion of tolerant recognition.

      The tolerance of identity recognition is presumably empowered by the internal representation of the natural statistics of identity, i.e. the stable traits and (idiosyncratic) variability of a face, which builds up through the varied encounters with a given face (Burton, Jenkins et al. 2005, Burton, Jenkins and Schweinberger 2011, Jenkins and Burton 2011, Jenkins, White et al. 2011, Burton, Kramer et al. 2016, Menon, Kemp and White 2018).

      The average of various images of a face provides its appearance distribution (i.e., variability) and central tendency (i.e., stable properties; Figure 1) and could be used as a reasonable proxy of its natural statistical properties (Burton, Jenkins et al. 2005). We thus believe that the alternate model proposed by the reviewer is relevant to existing theories of face identity recognition and agree that our current model observers do not fully capture this aspect. It is thus an excellent idea to examine the orientation tuning profile of a model observer that compares a specific view of a face to the average encompassing all views of a face identity. Since the horizontal range is proposed to carry the view-stable cues to identity, we expect that such a ‘viewpoint-average’ model observer will perform best with horizontally filtered faces and that its orientation tuning profile will significantly predict human performance across views. We expect the viewpointtolerant and viewpoint-average observers will behave similarly as they manifest the stability of the horizontal identity cues across variations in viewpoint.

      Besides this larger issue, I would also like to see some more details about the nature of the crosscorrelation that is the basis for this model comparison. I mostly think I get what is happening, but I think the authors could expand more on the nature of their noise model to make more explicit what is happening before these cross-correlations are taken. I infer that there is a noise-addition step to get them off the ceiling, but I felt that I had to read between the lines a bit to determine this.

      The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

      Another thing that I think is worth considering and commenting on is the stimuli themselves and the extent to which this may limit the outcomes of their behavioral task. The use of the 3D laserscanned faces has some obvious advantages, but also (I think) removes the possibility for pigmentation to contribute to recognition, removes the contribution of varying illumination and expression to appearance variability, and perhaps presents observers with more homogeneous faces than one typically has to worry about. I don't think these negate the current results, but I'd like the authors to expand on their discussion of these factors, particularly pigmentation. Naively, surface color and texture seem like they could offer diagnostic cues to identity that don't rely so critically on horizontal orientations, so removing these may mean that horizontal bias is particularly evident when face shape is the critical cue for recognition.

      We indeed got rid of surface color by converting images to gray scales. While we acknowledge that the conversion to grayscales may have removed one potential source of surface information, it is unlikely that our stimuli fully eliminated the contribution of surface pigmentation in our study. Pigmentation refers to all surface reflectance property (Russell, Sinha et al. 2006) and hue (color) is only one surface cue among others. The grayscaled 3D laser scanned faces used here still contained natural variations in crucial surface cues such as skin albedo (i.e., how light or dark the surface appears) and texture (i.e., spatial variation in how light is reflected). Both color and grayscale stimuli (2D face pictures or 3D laser scanned faces like ours) have actually been used to disentangle the role of shape and surface cues to identity recognition (e.g., Troje and Bulthoff 1996, Vuong, Peissig et al. 2005, Russell, Sinha et al. 2006, Russell, Biederman et al. 2007, Jiang, Dricot et al. 2009).

      More fundamentally, we demonstrated that the diagnosticity of the horizontal range of face information is not restricted to the transmission of shape cues. Our recent work has indeed shown that the processing of both face shape and surface most critically relies on horizontal information (Dumont, Roux-Sibilon and Goffaux 2024).

      Reviewer #2 (Public review):

      This study investigates the visual information that is used for the recognition of faces. This is an important question in vision research and is critical for social interactions more generally. The authors ask whether our ability to recognise faces, across different viewpoints, varies as a function of the orientation information available in the image. Consistent with previous findings from this group and others, they find that horizontally filtered faces were recognised better than vertically filtered faces. Next, they probe the mechanism underlying this pattern of data by designing two model observers. The first was optimised for faces at a specific viewpoint (viewselective). The second was generalised across viewpoints (view-tolerant). In contrast to the human data, the view-specific model shows that the information that is useful for identity judgements varies according to viewpoint. For example, frontal face identities are again optimally discriminated with horizontal orientation information, but profiles are optimally discriminated with more vertical orientation information. These findings show human face recognition is biased toward horizontal orientation information, even though this may be suboptimal for the recognition of profile views of the face.

      One issue in the design of this study was the lowering of the signal-to-noise ratio in the viewselective observer. This decision was taken to avoid ceiling effects. However, it is not clear how this affects the similarity with the human observers.

      The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

      Another issue is the decision to normalise image energy across orientations and viewpoints. I can see the logic in wanting to control for these effects, but this does reflect natural variation in image properties. So, again, I wonder what the results would look like without this step.

      Energy of natural images is disproportionately distributed across orientations (e.g., Hansen, Essock et al. 2003). Images of faces cropped from their background as used here contain most of their energy in the horizontal range (Keil 2009, Goffaux and Greenwood 2016, Goffaux 2019). If not normalized after orientation filtering, such uneven distribution of energy would boost recognition performance in the horizontal range across views. Normalization was performed across our experimental conditions merely to avoid energy from explaining the influence of viewpoint on the orientation tuning profile.

      We are not aware of any systematic natural variations of energy across face views. To address this, we measured face average energy (i.e., RMS contrast) in the original stimulus set, i.e., before the application of any image processing or manipulation. Background pixels were excluded from these image analyses. Across yaws, we found energy to range between .11 and .14 on a 0 to 1 grayscale. This is moderate compared to the range of energy variations we measured across identities (from .08 to .18). This suggests that variations in energy across viewpoints are moderate compared to variations related to identity. It is unclear whether these observations are specific to our stimulus set or whether they are generalizable to faces we encounter in everyday life. They, however, indicate that RMS contrast did not substantially vary across views in the present study and suggest that RMS normalization is unlikely to have affected the influence of viewpoint on recognition performance.

      Nonetheless, we acknowledge the importance of this issue regarding the trade-off between experimental control and stimulus naturalness, and we will refer to it explicitly in the methods section.

      Despite the bias toward horizontal orientations in human observers, there were some differences in the orientation preference at each viewpoint. For example, frontal faces were biased to horizontal (90 degrees), but other viewpoints had biases that were slightly off horizontal (e.g., right profile: 80 degrees, left profile: 100 degrees). This does seem to show that differences in statistical information at different viewpoints (more horizontal information for frontal and more vertical information for profile) do influence human perception. It would be good to reflect on this nuance in the data.

      Indeed, human performance data indicates that while identity recognition remains tuned to horizontal information, horizontal tuning shows some variation across viewpoints. We primarily focused on the first aspect because of its direct relevance to our research objective, but also discussed the second aspect: with yaw rotation, certain non-horizontal morphological features such as the jaw line or nose bridge, etc. may increasingly contribute to identity recognition, whereas at frontal or near frontal views, features are mostly horizontally-oriented (e.g., Keil 2008, Keil 2009). We will relate this part of the discussion more explicitly to the observation of the fluctuation of the peak location as a function of yaw.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer 1:

      The authors frequently refer to their predictions and theory as being causal, both in the manuscript and in their response to reviewers. However, causal inference requires careful experimental design, not just statistical prediction. For example, the claim that "algorithmic differences between those with BPD and matched healthy controls" are "causal" in my opinion is not warranted by the data, as the study does not employ experimental manipulations or interventions which might predictably affect parameter values. Even if model parameters can be seen as valid proxies to latent mechanisms, this does not automatically mean that such mechanisms cause the clinical distinction between BPD and CON, they could plausibly also refer to the effects of therapy or medication. I recommend that such causal language, also implicit to expressions like "parameter influences on explicit intentional attributions", is toned down throughout the manuscript.

      Thankyou for this chance to be clearer in the language. Our models and paradigm introduce a from of temporal causality, given that latent parameter distributions are directly influenced by latent parameter estimates at a previous point in time (self-uncertainty and other uncertainty directly governs social contagion). Nevertheless, we appreciate the reviewers perspective and have now toned down the language to reflect this.

      Abstract:

      ‘Our model makes clear predictions about the mechanisms of social information generalisation concerning both joint and individual reward.’

      Discussion:

      ‘We can simulate this by modelling a framework that incorporates priors based on both self and a strong memory impression of a notional other (Figure S3).’

      ‘We note a strength of this work is the use of model comparison to understand algorithmic differences between those with BPD and matched healthy controls.’

      Although the authors have now much clearer outlined the stuy's aims, there still is a lack of clarity with respect to the authors' specific hypotheses. I understand that their primary predictions about disruptions to self-other generalisation processes underlying BPD are embedded in the four main models that are tested, but it is still unclear what specific hypotheses the authors had about group differences with respect to the tested models. I recommend the authors specify this in the introduction rather than refering to prior work where the same hypotheses may have been mentioned.

      Thankyou for this further critique which has enabled us to more cleary refine our introduction. We have now edited our introduction to be more direct about our hypotheses, that these hypotheses are instantiated into formal models, and what our predictions were. We have also included a small section on how previous predictions from other computational assessments of BPD link to our exploratory work, and highlighted this throughout the manuscript.

      ‘This paper seeks to address this gap by testing explicitly how disruptions in self-other generalization processes may underpin interpersonal disruptions observed in BPD. Specifically, our hypotheses were: (i) healthy controls will demonstrate evidence for both self-insertion and social contagion, integrating self and other information during interpersonal learning; and (ii) individuals with BPD will exhibit diminished self-other integration, reflected in stronger evidence for observations that assume distinct self-other representations.

      We tested these hypotheses by designing a dynamic, sequential, three-phase Social Value Orientation (Murphy & Ackerman, 2014) paradigm—the Intentions Game—that would provide behavioural signatures assessing whether BPD differed from healthy controls in these generalization processes (Figure 1A). We coupled this paradigm with a lattice of models (M1-M4) that distinguish between self-insertion and social contagion (Figure 1B), and performed model comparison:

      M1. Both self-to-other (self-insertion) and other-to-self (social contagion) occur before and after learning M2. Self-to-other transfer only occurs M3. Other-to-self transfer only occurs M4. Neither transfer process, suggesting distinct self-other representations

      We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Caveats should also be added about the exploratory nature of the many parameter group comparisons. If there are any predictions about group differences that can be made based on prior literature, the authors should make such links clear.

      Thank you for this. We have now included caveats in the text to highlight the exploratory nature of these group comparisons, and added direct links to relevant literature where able:

      Introduction

      ‘We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Model Comparison

      ‘We found that CON participants were best fit at the group level by M1 (Frequency = 0.59, Exceedance Probability = 0.98), whereas BPD participants were best fit by M4 (Frequency = 0.54, Exceedance Probability = 0.86; Figure 2A). This suggests CON participants are best fit by a model that fully integrates self and other when learning, whereas those with BPD are best explained as holding disintegrated and separate representations of self and other that do not transfer information back and forth.

      We first explore parameters between separate fits (see Methods). Later, in order to assuage concerns about drawing inferences from different models, we examined the relationships between the relevant parameters when we forced all participants to be fit to each of the models (in a hierarchical manner, separated by group). In sum, our model comparison is supported by convergence in parameter values when comparisons are meaningful (see Supplementary Materials). We refer to both types of analysis below.’

      Phase 2 analysis

      ‘Prior work predicts those with BPD should focus more intently on public social information, rather than private information that only concerns one party (Henco et al., 2020). In BPD participants, only new beliefs about the relative reward preferences – mutual outcomes for both player - of partners differed (see Fig 2E): new median priors were larger than median preferences in phase 1 (mean = -0.47; = -6.10, 95%HDI: -7.60, -4.60).’

      ‘Models of moral preference learning (Story et al., 2024) predicts that BPD vs non-BPD participants have more rigid beliefs about their partners. We found that BPD participants were equally flexible around their prior beliefs about a partner’s relative reward preferences (= -1.60, 95%HDI: -3.42, 0.23), and were less flexible around their beliefs about a partner’s absolute reward preferences (=-4.09, 95%HDI: -5.37, -2.80), versus CON (Figure 2B).’

      Phase 3 analysis

      ‘Prior work predicts that human economic preferences are shaped by observation (Panizza, et al., 2021; Suzuki et al. 2016; Yu et al, 2021), although little-to-no work has examined whether contagion differs for relative vs. absolute preferences. Associative models predict that social contagion may be exaggerated in BPD (Ereira et al., 2018).… As a whole, humans are more susceptible to changing relative preferences more than selfish, absolute reward preferences, and this is disrupted in BPD.’

      Psychometric and Intentional Attribution analysis

      ‘Childhood trauma, persecution, and poor mentalising in BPD are all predicted to disrupt one’s ability to change (Fonagy & Luyten, 2009).’

      ‘Prior work has also predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021).’

      I'm not sure I understand why the authors, after adding multiple comparison correction, now list two kinds of p-values. To me, this is misleading and precludes the point of multiple comparison corrections, I therefore recommend they report the FDR-adjusted p-values only. Likewise, if a corrected p-value is greater than 0.05 this should not be interpreted as a result.

      We have now adjusted the exploratory results to include only the FDR corrected values in the text.

      ‘We assessed conditional psychometric associations with social contagion under the assumption of M3 for all participants. We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).

      Prior work has predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021). We tested parameter influences on explicit intentional attributions in Phase 2 while controlling for group status. Attributions included the degree to which they believed their partner was motived by harmful intent (HI) and self-interest (SI). According with prior work (Barnby et al., 2022), greater disparity of absolute preferences before learning was associated on a trend level with reduced attributions of SI (<= -0.23, p[fdr]=0.08), and greater disparity of relative preferences before learning exaggerated attributions of HI = 0.21, p[fdr]=0.08), but did not survive correction (Figure S4B). This is likely due to partners being significantly less individualistic and prosocial on average compared to participants (= -5.50, 95%HDI: -7.60, -3.60; = 12, 95%HDI: 9.70, 14.00); partners are recognised as less selfish and more competitive.’

      Can the authors please elaborate why the algorithm proposed to be employed by BPD is more 'entropic', especially given both their self-priors and posteriors about partners' preferences tended to be more precise than the ones used by CON? As far as I understand, there's nothing in the data to suggest BPD predictions should be more uncertain. In fact, this leads me to wonder, similarly to what another reviewer has already suggested, whether BPD participants generate self-referential priors over others in the same way CON participants do, they are just less favourable (i.e., in relation to oneself, but always less prosocial) - I think there is currently no model that would incorporate this possibility? It should at least be possible to explore this by checking if there is any statistical relationship between the estimated θ_ppt^m and 〖p(θ〗_par |D^0).

      Thank you for this opportunity to be clearer in our wording. We belief the reviewer is referring to this line in the discussion: ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in entropy and emphasises a less stable or reliable process of inference.’

      We note in the revised Figure 2 panel E and in the results that those with BPD under M4 show insertion along absolute reward (they still expect diminished selfishness in others), but neutral priors over relative reward (around 0, suggesting expectations of neither prosocial or competitive tendencies of others). Thus, θ_ppt^m (self preference) and θ_par^m (other preference) are tightly associated for absolute, but not relative reward.

      In our wording, we meant that whether under model M4 or M1, those with BPD either show a neutral prior over relative reward (M4) or a prior with large variance over relative reward (M1), showing expectations of difference between themselves and their partner. In both cases, expectation about a partner’s absolute reward preferences is diminished vs. CON participants. We have strengthened our language in the discussion to clarify this:

      ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in uncertainty, whether through a neutral central tendency (M4) or large variance (M1) prior over relative reward in phase 2, and emphasises a less certain and reliable expectation about others.’

      To note, social contagion under M3 was highly correlated with contagion under M1 (see Fig S11). This provides some preliminary evidence that trauma impacts beliefs about individualism directly, whereas trauma and persecutory beliefs impact beliefs about prosociality through impaired trait mentalising" - I don't understand what the authors mean by this, can they please elaborate and add some explanation to the main text?

      We have now clarified this in the text:

      ‘Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      I noted that at least some of the newly added references have not been added to the bibliography (e.g., Hitchcock et al. 2022).

      Thankyou for noticing this omission. We have now ensured all cited works are in the reference list.

      Reviewer 2:

      The paper is not based on specific empirical hypotheses formulated at the outset, but, rather, it uses an exploratory approach. Indeed, the task is not chosen in order to tackle specific empirical hypotheses. This, in my view, is a limitation since the introduction reads a bit vague and it is not always clear which gaps in the literature the paper aims to fill. As a further consequence, it is not always clear how the findings speak to previous theories on the topic.’

      As I wrote in the public review, however, I believe that an important limitation of this work is that it was not based on testing specific empirical hypotheses formulated at the outset, and on selecting the experimental paradigm accordingly. This is a limitation because it is not always clear which gaps in the literature the paper aims to fill. As a consequence, although it has improved substantially compared to the previous version, the introduction remains a bit vague. As a further consequence, it is not always clear how the findings speak to previous theories on the topic. Still, despite this limitation, the paper has many strengths, and I believe it is now ready for publication

      Thank you for this further critique. We appreciate your appraisal that the work has improved substantially and is ready for publication. We nevertheless have opted to clarify our introduction and aprior predictions throughout the manuscript (please see response to Reviewer 1).

      Reviewer 3:

      Although the authors note that their approach makes "clear and transparent a priori predictions," the paper could be improved by providing a clear and consolidated statement of these predictions so that the results could be interpreted vis-a-vis any a priori hypotheses.

      In line with comments from both Reviewer 1 and 2, we have clarified our introduction to make it clear what our aprior predictions and hypotheses are about our core aims and exploratory analyses (see response to Reviewer 1).

      The approach of using a partial correlation network with bootstrapping (and permutation) was interesting, but the logic of the analysis was not clearly stated. In particular, there are large group (Table 1: CON vs. BPD) differences in the measures introduced into this network. As a result, it is hard to understand whether any partial correlations are driven primarily by mean differences in severity (correlations tend to be inflated in extreme groups designs due to the absence of observation in middle of scales forming each bivariate distribution). I would have found these exploratory analyses more revealing if group membership was controlled for.

      Thank you for this chance to be clearer in our methods. We have now written a more direct exposition of this exploratory method:

      ‘Exploratory Network Analysis

      To understand the individual differences of trait attributes (MZQ, RGPTSB, CTQ) with other-to-self information transfer () across the entire sample we performed a network analysis (Borsboom, 2021). Network analysis allows for conditional associations between variables to be estimated; each association is controlled for by all other associations in the network. It also allows for visual inspection of the conditional relationships to get an intuition for how variables are interrelated as a whole (see Fig S11). We implemented network analysis with the bootNet package in r using the ‘estimateNetwork’ function with partial correlations (Epskamp, Borsboom & Fried, 2018). To assess the stability of the partial correlations we further implemented bootstrap resampling with 5000 repetitions using the ‘bootnet’ function. We then additionally shuffled the data and refitted the network 5000 times to determine a p<sub>permuted</sub> value; this indicates the probability that a conditional relationship in the original network was within the null distribution of each conditional relationship. We then performed False Discovery Rate correction on the resulting p-values. We additionally controlled for group status for all variables in a supplementary analysis (Table S4).’

      We have also further corrected for group status and reported these results as a supplementary table, and also within the main text alongside the main results. We have opted to relegate Figure 4 into a supplementary figure to make the text clearer.

      ‘We explored conditional psychometric associations with social contagion under the assumption of M3 for all participants (where everyone is able to be influenced by their partner). We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      Discussion first para: "effected -> affected"

      Thanks for spotting this. We have now changed it.

      Add "s" to "participant: "Notably, despite differing strategies, those with BPD achieved similar accuracy to CON participant."

      We have now changed this.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Measurement of BOLD MR imaging has regularly found regions of the brain that show reliable suppression of BOLD responses during specific experimental testing conditions. These observations are to some degree unexplained, in comparison with more usual association between activation of the BOLD response and excitatory activation of the neurons (most tightly linked to synaptic activity) in the same brain location. This paper finds two patients whose brains were tested with both non-invasive functional MRI and with invasive insertion of electrodes, which allowed the direct recording of neuronal activity. The electrode insertions were made within the fusiform gyrus, which is known to process information about faces, in a clinical search for the sites of intractable epilepsy in each patient. The simple observation is that the electrode location in one patient showed activation of the BOLD response and activation of neuronal firing in response to face stimuli. This is the classical association. The other patient showed an informative and different pattern of responses. In this person, the electrode location showed a suppression of the BOLD response to face stimuli and, most interestingly, an associated suppression of neuronal activity at the electrode site.

      Strengths:

      Whilst these results are not by themselves definitive, they add an important piece of evidence to a long-standing discussion about the origins of the BOLD response. The observation of decreased neuronal activation associated with negative BOLD is interesting because, at various times, exactly the opposite association has been predicted. It has been previously argued that if synaptic mechanisms of neuronal inhibition are responsible for the suppression of neuronal firing, then it would be reasonable

      Weaknesses:

      The chief weakness of the paper is that the results may be unique in a slightly awkward way. The observation of positive BOLD and neuronal activation is made at one brain site in one patient, while the complementary observation of negative BOLD and neuronal suppression actually derives from the other patient. Showing both effects in both patients would make a much stronger paper.

      We thank reviewer #1 for their positive evaluation of our paper. Obviously, we agree with the reviewer that the paper would be much stronger if BOTH effects – spike increase and decrease – would be found in BOTH patients in their corresponding fMRI regions (lateral and medial fusiform gyrus) (also in the same hemisphere). Nevertheless, we clearly acknowledge this limitation in the (revised) version of the manuscript (p.8: Material and Methods section).

      Note that with respect to the fMRI data, our results are not surprising, as we indicate in the manuscript: BOLD increases to faces (relative to nonface objects) are typically found in the LatFG and BOLD decreases in the medialFG (in the revised version, we have added the reference to an early neuroimaging paper that describes this dissociation clearly:

      Pelphrey, K. A., Mack, P. B., Song, A., Güzeldere, G., & McCarthy, G. Faces evoke spatially differentiated patterns of BOLD activation and deactivation. Neuroreport 14, 955–959 (2003).

      This pattern of increase/decrease in fMRI can be appreciated in both patients on Figure 2, although one has to consider both the transverse and coronal slices to appreciate it.

      Regarding electrophysiological data, in the current paper, one could think that P1 shows only increases to faces, and P2 would show only decreases (irrespective of the region). However, that is not the case since 11% of P1’s face-selective units are decreases (89% are increases) and 4% of P2’s face-selective units are increases. This has now been made clearer in the revised manuscript (p.5).

      As the reviewer is certainly aware, the number and positions of the electrodes are based on strict clinical criteria, and we will probably never encounter a situation with two neighboring (macro-micro hybrid electrodes), one with microelectrodes ending up in the lateral MidFG, the other in the medial MidFG, in the same patient. If there is no clinical value for the patient, this cannot be done.

      The only thing we can do is to strengthen these results in the future by collecting data on additional patients with an electrode either in the lateral or the medial FG, together with fMRI. But these are the only two patients we have been able to record so far with electrodes falling unambiguously in such contrasted regions and with large (and comparable) measures.

      While we acknowledge that the results may be unique because of the use of 2 contrasted patients only (and this is why the paper is a short report), the data is compelling in these 2 cases, and we are confident that it will be replicated in larger cohorts in the future.

      Finally, information regarding ethics approval has been provided in the paper.

      Reviewer #2 (Public review):

      Summary:

      This is a short and straightforward paper describing BOLD fMRI and depth electrode measurements from two regions of the fusiform gyrus that show either higher or lower BOLD responses to faces vs. objects (which I will call face-positive and facenegative regions). In these regions, which were studied separately in two patients undergoing epilepsy surgery, spiking activity increased for faces relative to objects in the face-positive region and decreased for faces relative to objects in the face-negative region. Interestingly, about 30% of neurons in the face-negative region did not respond to objects and decreased their responses below baseline in response to faces (absolute suppression).

      Strengths:

      These patient data are valuable, with many recording sessions and neurons from human face-selective regions, and the methods used for comparing face and object responses in both fMRI and electrode recordings were robust and well-established. The finding of absolute suppression could clarify the nature of face selectivity in human fusiform gyrus since previous fMRI studies of the face-negative region could not distinguish whether face < object responses came from absolute suppression, or just relatively lower but still positive responses to faces vs. objects.

      Weaknesses:

      The authors claim that the results tell us about both 1) face-selectivity in the fusiform gyrus, and 2) the physiological basis of the BOLD signal. However, I would like to see more of the data that supports the first claim, and I am not sure the second claim is supported.

      (1) The authors report that ~30% of neurons showed absolute suppression, but those data are not shown separately from the neurons that only show relative reductions. It is difficult to evaluate the absolute suppression claim from the short assertion in the text alone (lines 105-106), although this is a critical claim in the paper.

      We thank reviewer #2 for their positive evaluation of our paper. We understand the reviewer’s point, and we partly agree. Where we respectfully disagree is that the finding of absolute suppression is critical for the claim of the paper: finding an identical contrast between the two regions in terms of RELATIVE increase/decrease of face-selective activity in fMRI and spiking activity is already novel and informative. Where we agree with the reviewer is that the absolute suppression could be more documented: it wasn’t, due to space constraints (brief report). We provide below an example of a neuron showing absolute suppression to faces (P2), as also requested in the recommendations to authors. In the frequency domain, there is only a face-selective response (1.2 Hz and harmonics) but no significant response at 6 Hz (common general visual response). In the time-domain, relative to face onset, the response drops below baseline level. It means that this neuron has baseline (non-periodic) spontaneous spiking activity that is actively suppressed when a face appears.

      Author response image 1.

      (2) I am not sure how much light the results shed on the physiological basis of the BOLD signal. The authors write that the results reveal "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain" (line 120). But I think to make this claim, you would need a region that exclusively had neurons showing absolute suppression, not a region with a mix of neurons, some showing absolute suppression and some showing relative suppression, as here. The responses of both groups of neurons contribute to the measured BOLD signal, so it seems impossible to tell from these data how absolute suppression per se drives the BOLD response.

      It is a fact that we find both kinds of responses in the same region. We cannot tell with this technique if neurons showing relative vs. absolute suppression of responses are spatially segregated for instance (e.g., forming two separate sub-regions) or are intermingled. And we cannot tell from our data how absolute suppression per se drives the BOLD response. In our view, this does not diminish the interest and originality of the study, but the statement "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain” has been rephrased in the revised manuscript: "that BOLD decreases can be due to relative, or absolute (or a combination of both), spike suppression in the human brain”.

      Reviewer #3 (Public review):

      In this paper the authors conduct two experiments an fMRI experiment and intracranial recordings of neurons in two patients P1 and P2. In both experiments, they employ a SSVEP paradigm in which they show images at a fast rate (e.g. 6Hz) and then they show face images at a slower rate (e.g. 1.2Hz), where the rest of the images are a variety of object images. In the first patient, they record from neurons over a region in the mid fusiform gyrus that is face-selective and in the second patient, they record neurons from a region more medially that is not face selective (it responds more strongly to objects than faces). Results find similar selectivity between the electrophysiology data and the fMRI data in that the location which shows higher fMRI to faces also finds face-selective neurons and the location which finds preference to non faces also shows non face preferring neurons.

      Strengths:

      The data is important in that it shows that there is a relationship between category selectivity measured from electrophysiology data and category-selective from fMRI. The data is unique as it contains a lot of single and multiunit recordings (245 units) from the human fusiform gyrus - which the authors point out - is a humanoid specific gyrus.

      Weaknesses:

      My major concerns are two-fold:

      (i) There is a paucity of data; Thus, more information (results and methods) is warranted; and in particular there is no comparison between the fMRI data and the SEEG data.

      We thank reviewer #3 for their positive evaluation of our paper. If the reviewer means paucity of data presentation, we agree and we provide more presentation below, although the methods and results information appear as complete to us. The comparison between fMRI and SEEG is there, but can only be indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance). In addition, our manuscript aims at providing a short empirical contribution to further our understanding of the relationship between neural responses and BOLD signal, not to provide a model of neurovascular coupling.

      (ii) One main claim of the paper is that there is evidence for suppressed responses to faces in the non-face selective region. That is, the reduction in activation to faces in the non-face selective region is interpreted as a suppression in the neural response and consequently the reduction in fMRI signal is interpreted as suppression. However, the SSVEP paradigm has no baseline (it alternates between faces and objects) and therefore it cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      We understand the concern of the reviewer, but we respectfully disagree that our paradigm cannot distinguish between lower firing rate to faces vs. suppression of response to faces. Indeed, since the stimuli are presented periodically (6 Hz), we can objectively distinguish stimulus-related activity from spontaneous neuronal firing. The baseline corresponds to spikes that are non-periodic, i.e., unrelated to the (common face and object) stimulation. For a subset of neurons, even this non-periodic baseline activity is suppressed, above and beyond the suppression of the 6 Hz response illustrated on Figure 2. We mention it in the manuscript, but we agree that we do not present illustrations of such decrease in the time-domain for SU, which we did not consider as being necessary initially (please see below for such presentation).

      (1) Additional data: the paper has 2 figures: figure 1 which shows the experimental design and figure 2 which presents data, the latter shows one example neuron raster plot from each patient and group average neural data from each patient. In this reader's opinion this is insufficient data to support the conclusions of the paper. The paper will be more impactful if the researchers would report the data more comprehensively.

      We answer to more specific requests for additional evidence below, but the reviewer should be aware that this is a short report, which reaches the word limit. In our view, the group average neural data should be sufficient to support the conclusions, and the example neurons are there for illustration. And while we cannot provide the raster plots for a large number of neurons, the anonymized data is made available at:

      (a) There is no direct comparison between the fMRI data and the SEEG data, except for a comparison of the location of the electrodes relative to the statistical parametric map generated from a contrast (Fig 2a,d). It will be helpful to build a model linking between the neural responses to the voxel response in the same location - i.e., estimate from the electrophysiology data the fMRI data (e.g., Logothetis & Wandell, 2004).

      As mentioned above the comparison between fMRI and SEEG is indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance) and would not allow to make such a model.

      (b) More comprehensive analyses of the SSVEP neural data: It will be helpful to show the results of the frequency analyses of the SSVEP data for all neurons to show that there are significant visual responses and significant face responses. It will be also useful to compare and quantify the magnitude of the face responses compared to the visual responses.

      The data has been analyzed comprehensively, but we would not be able to show all neurons with such significant visual responses and face-selective responses.

      (c) The neuron shown in E shows cyclical responses tied to the onset of the stimuli, is this the visual response?

      Correct, it’s the visual response at 6 Hz.

      If so, why is there an increase in the firing rate of the neuron before the face stimulus is shown in time 0?

      Because the stimulation is continuous. What is displayed at 0 is the onset of the face stimulus, with each face stimulus being preceded by 4 images of nonface objects.

      The neuron's data seems different than the average response across neurons; This raises a concern about interpreting the average response across neurons in panel F which seems different than the single neuron responses

      The reviewer is correct, and we apologize for the confusion. This is because the average data on panel F has been notch-filtered for the 6 Hz (and harmonic responses), as indicated in the methods (p.11): ‘a FFT notch filter (filter width = 0.05 Hz) was then applied on the 70 s single or multi-units time-series to remove the general visual response at 6 Hz and two additional harmonics (i.e., 12 and 18 Hz)’.

      Here is the same data without the notch-filter (the 6Hz periodic response is clearly visible):

      Author response image 2.

      For sake of clarity, we prefer presenting the notch-filtered data in the paper, but the revised version makes it clear in the figure caption that the average data has been notch-filtered.

      (d) Related to (c) it would be useful to show raster plots of all neurons and quantify if the neural responses within a region are homogeneous or heterogeneous. This would add data relating the single neuron response to the population responses measured from fMRI. See also Nir 2009.

      We agree with the reviewer that this is interesting, but again we do not think that it is necessary for the point made in the present paper. Responses in these regions appear rather heterogenous, and we are currently working on a longer paper with additional SEEG data (other patients tested for shorter sessions) to define and quantify the face-selective neurons in the MidFusiform gyrus with this approach (without relating it to the fMRI contrast as reported here).

      (e) When reporting group average data (e.g., Fig 2C,F) it is necessary to show standard deviation of the response across neurons.

      We agree with the reviewer and have modified Figure 2 accordingly in the revised manuscript.

      (f) Is it possible to estimate the latency of the neural responses to face and object images from the phase data? If so, this will add important information on the timing of neural responses in the human fusiform gyrus to face and object images.

      The fast periodic paradigm to measure neural face-selectivity has been used in tens of studies since its original reports:

      In this paradigm, the face-selective response spreads to several harmonics (1.2 Hz, 2.4 Hz, 3.6 Hz, etc.) (which are summed for quantifying the total face-selective amplitude). This is illustrated below by the averaged single units’ SNR spectra across all recording sessions for both participants.

      Author response image 3.

      There is no unique phase-value, each harmonic being associated with a phase-value, so that the timing cannot be unambiguously extracted from phase values. Instead, the onset latency is computed directly from the time-domain responses, which is more straightforward and reliable than using the phase. Note that the present paper is not about the specific time-courses of the different types of neurons, which would require a more comprehensive report, but which is not necessary to support the point made in the present paper about the SEEG-fMRI sign relationship.

      (g) Related to (e) In total the authors recorded data from 245 units (some single units and some multiunits) and they found that both in the face and nonface selective most of the recoded neurons exhibited face -selectivity, which this reader found confusing: They write “ Among all visually responsive neurons, we found a very high proportion of face-selective neurons (p < 0.05) in both activated and deactivated MidFG regions (P1: 98.1%; N = 51/52; P2: 86.6%; N = 110/127)’. Is the face selectivity in P1 an increase in response to faces and P2 a reduction in response to faces or in both it’s an increase in response to faces

      Face-selectivity is defined as a DIFFERENTIAL response to faces compared to objects, not necessarily a larger response to faces. So yes, face-selectivity in P1 is an increase in response to faces and P2 a reduction in response to faces.

      Additional methods

      (a) it is unclear if the SSVEP analyses of neural responses were done on the spikes or the raw electrical signal. If the former, how is the SSVEP frequency analysis done on discrete data like action potentials?

      The FFT is applied directly on spike trains using Matlab’s discrete Fourier Transform function. This function is suitable to be applied to spike trains in the same way as to any sampled digital signal (here, the microwires signal was sampled at 30 kHz, see Methods).

      In complementary analyses, we also attempted to apply the FFT on spike trains that had been temporally smoothed by convolving them with a 20ms square window (Le Cam et al., 2023, cited in the paper ). This did not change the outcome of the frequency analyses in the frequency range we are interested in. We have also added one sentence with information in the methods section about spike detection (p.10).

      (b) it is unclear why the onset time was shifted by 33ms; one can measure the phase of the response relative to the cycle onset and use that to estimate the delay between the onset of a stimulus and the onset of the response. Adding phase information will be useful.

      The onset time was shifted by 33ms because the stimuli are presented with a sinewave contrast modulation (i.e., at 0ms, the stimulus has 0% contrast). 100% contrast is reached at half a stimulation cycle, which is 83.33ms here, but a response is likely triggered before reaching 100% contrast. To estimate the delay between the start of the sinewave (0% contrast) and the triggering of a neural response, we tested 7 SEEG participants with the same images presented in FPVS sequences either as a sinewave contrast (black line) modulation or as a squarewave (i.e. abrupt) contrast modulation (red line). The 33ms value is based on these LFP data obtained in response to such sinewave stimulation and squarewave stimulation of the same paradigm. This delay corresponds to 4 screen refresh frames (120 Hz refresh rate = 8.33ms by frame) and 35% of the full contrast, as illustrated below (please see also Retter, T. L., & Rossion, B. (2016). Uncovering the neural magnitude and spatio-temporal dynamics of natural image categorization in a fast visual stream. Neuropsychologia, 91, 9–28).

      Author response image 4.

      (2) Interpretation of suppression:

      The SSVEP paradigm alternates between 2 conditions: faces and objects and has no baseline; In other words, responses to faces are measured relative to the baseline response to objects so that any region that contains neurons that have a lower firing rate to faces than objects is bound to show a lower response in the SSVEP signal. Therefore, because the experiment does not have a true baseline (e.g. blank screen, with no visual stimulation) this experimental design cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      The strongest evidence put forward for suppression is the response of non-visual neurons that was also reduced when patients looked at faces, but since these are non-visual neurons, it is unclear how to interpret the responses to faces.

      We understand this point, but how does the reviewer know that these are non-visual neurons? Because these neurons are located in the visual cortex, they are likely to be visual neurons that are not responsive to non-face objects. In any case, as the reviewer writes, we think it’s strong evidence for suppression.

      We thank all three reviewers for their positive evaluation of our paper and their constructive comments.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preferences (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, it directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2, which examined the vicarious inequality aversion in conditions where feedback was never provided, is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulting directly from their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. The intro and set are heavily around vicarious learning, and later the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: (Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020)

      We are appreciative of the Reviewer for raising this question and providing the reference. In response to this comment we have elected to avoid, in most cases, use of the term ‘vicarious’ and instead focus the paper on learning of others’ preferences (without specific commitment to various/observational learning per se). These changes are reflected throughout all sections of the revised manuscript, and in the revised title. We believe this simplified terminology has improved the clarity of our contribution.

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder what this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      We thank the Reviewer for pointing this out. The uniformly distributed noise was added to all three phases to make the proposers’ behavior more realistic. This added noise was rounded to integer numbers, constrained from -9 to 9, which means in both 70:30 and 90:10 offer types, an 80:20 split could not occur. We have made this feature of our design clear in the Method section Line 524 ~ 528:

      “In all task phases, we added uniformly distributed noise to each trial’s offer (ranging from -9 to 9, inclusive, rounding to the nearest integer) such that the random amount added (or subtracted) from the Proposer’s share was subtracted (or added) to the Receiver’s share. We adopted this manipulation to make the proposers’ behavior appear more realistic. The orders of offers participants experienced were fully randomized within each experiment phase. ”

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participant, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the learning phase can largely impact the preference learning of the participants.

      We agree with the Reviewer the order in which offers are experienced could be very important. The order of the conditions was randomized independently for each participant (i.e. each participant experienced different trial sequences). We made this point clear in the Methods part. Line 527 ~ 528:

      “The orders of offers participants experienced were fully randomized within each experiment phase.”

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, the blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      We thank the Reviewer for pointing out this feature of the results. Prompted by this comment, we compared the baseline rejection rates between two conditions for these two offer types, finding in Experiment 1 that rejection rates in the DI-AI-averse condition were significantly higher than in the DI-averse condition (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.13, p < 0.001, Offer 70:30, β = 0.09, p < 0.034). We agree with the Reviewer that there should, in principle, be no difference between the experiences of participants in these two conditions is identical in the Baseline phase. However, we did not observe these difference in baseline preferences in Experiment 2 (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.07, p < 0.100, Offer 70:30, β = 0.05, p < 0.193). On the basis of the inconsistency of this effect across studies we believe this is a spurious difference in preferences stemming from chance.

      Regarding the LME results, the reason why only interaction terms are reported is due to the specification of the model and the rationale for testing.

      Taking the model reported in Table S3 as an example—a logistic model which examines Baseline phase rejection rates as a function of offer level and condition—the between-subject conditions (DI-averse and DI-AI-averse) are represented by dummy-coded variables. Similarly, offer types were also dummy-coded, such that each of the five columns (90:10, 70:30, 50:50, 30:70, and 10:90) correspond corresponded to a particular offer type. This model specification yields ten interaction terms (i.e., fixed effects) of interest—for example, the “DI-averse × Offer 90:10” indicates baseline rejection rates for 90:10 offers in DI-averse condition. Thus, to compare rejection rates across specific offer types, we estimate and report linear contrasts between these resultant terms. We have clarified the nature of these reported tests in our revised Results—for example, line189-190: “linear contrasts; e.g. 90:10 vs 10:90, all Ps<0.001, see Table S3 for logistic regression coefficients for rejection rates).

      Also in response to this comment that and a recommendation from Reviewer 2 (see below), we have revised our supplementary materials to make each model specification clearer as SI line 25:

      RejectionRate ~ 0 + (Disl + Advl):(Offer10 + Offer30 + Offer50 + Offer70 + Offer90) + (1|Subject)”

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, the participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the change between the baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      Thanks for pointing this out. Also, considering the comments from Reviewer 2 concerning the interpretation of this analysis, we have elected to remove this result from our revision.

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model. This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      We agree with the Reviewer that a simplified F-S model could be used, in principle, to characterize Baseline and Transfer phase behavior, but it is our view that the rejection rates provide readers with the clearest (and simplest) picture of how participants are responding to inequity. Put another way, we believe that the added complexity of using (and explaining) a new model to characterize simple, steady-state choice behavior (within these phases) would not be justified or add appreciable insights about participants’ behavior.

      (7) I quite liked Study 2 which tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote, "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assume the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference update (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study 2 will be very helpful for the paper.

      We are appreciative of the Reviewer’s positive impression of Experiment 2. Upon reflection, we realize that our original submission was not clear about the modeling done in Experiment 2, and we should clarify here that we did also fit the Preference Inference model to this dataset. As in Experiment 1, this model assumes that the participants have a representation of the teacher’s preference as a Fehr-Schmidt form utility function and infer the Teacher’s Envy and Guilt parameters through learning. The model indicates that, on the basis of experience with the Teacher’s preferences on moderately unfair offers (i.e., offer 70:30 and offer 30:70), participants can successfully infer these guess of these two parameters, and in turn, compute Fehr-Schmidt utility to guide their decisions in the extreme unfair offers (i.e., offer 90:10 and offer 10:90).

      In response to this comment, we have made this clearer in our Results (Line 377-382):

      “Finally, following Experiment 1, we fit a series of computational models of Learning phase choice behavior, comparing the goodness-of-fit of the four best-fitting models from Experiment 1 (see Methods). As before, we found that the Preference Inference model provided the best fit of participants’ Learning Phase behavior (Figure S1a, Table S12). Given that this model is able to infer the Teacher’s underlying inequity-averse preferences (rather than learns offer-specific rejection preferences), it is unsurprising that this model best describes the generalization behavior observed in Experiment 2.”

      and in our revised Methods (Line 551-553)

      “We considered 6 computational models of Learning Phase choice behavior, which we fit to individual participants’ observed sequences of choices, in both Experiments 1 and 2, via Maximum Likelihood Estimation”

      Reviewer #2 (Public review):

      Summary:

      This study investigates whether individuals can learn to adopt egalitarian norms that incur a personal monetary cost, such as rejecting offers that benefit them more than the giver (advantageous inequitable offers). While these behaviors are uncommon, two experiments demonstrate that individuals can learn to reject such offers through vicarious learning - by observing and acting in line with a "teacher" who follows these norms. The authors use computational modelling to argue that learners adopt these norms through a sophisticated process, inferring the latent structure of the teacher's preferences, akin to theory of mind.

      Strengths:

      This paper is well-written and tackles a critical topic relevant to social norms, morality, and justice. The findings, which show that individuals can adopt just and fair norms even at a personal cost, are promising. The study is well-situated in the literature, with clever experimental design and a computational approach that may offer insights into latent cognitive processes. Findings have potential implications for policymakers.

      Weaknesses:

      Note: in the text below, the "teacher" will refer to the agent from which a participant presumably receives feedback during the learning phase.

      (1) Focus on Disadvantageous Inequity (DI): A significant portion of the paper focuses on responses to Disadvantageous Inequitable (DI) offers, which is confusing given the study's primary aim is to examine learning in response to Advantageous Inequitable (AI) offers. The inclusion of DI offers is not well-justified and distracts from the main focus. Furthermore, the experimental design seems, in principle, inadequate to test for the learning effects of DI offers. Because both teaching regimes considered were identical for DI offers the paradigm lacks a control condition to test for learning effects related to these offers. I can't see how an increase in rejection of DI offers (e.g., between baseline and generalization) can be interpreted as speaking to learning. There are various other potential reasons for an increase in rejection of DI offers even if individuals learn nothing from learning (e.g. if envy builds up during the experiment as one encounters more instances of disadvantageous fairness).

      We are appreciative of the Reviewer’s insight here and for the opportunity to clarify our experimental logic. We included DI offers in order to 1) expose participants to the full spectrum of offer types, and avoid focusing participants exclusively upon AI offers, which might result in a demand characteristic and 2) to afford exploration of how learning dynamics might differ in DI context s—which was, to some extent, examined in our previous study (FeldmanHall, Otto, & Phelps, 2018)—versus AI contexts. Furthermore, as this work builds critically on our previous study, we reasoned that replicating these original findings (in the DI context) would be important for demonstrating the generality of the learning effects in the DI context across experimental settings. We now remark on this point in our revised Introduction Line 129 ~132:

      “In addition, to mechanistically probe how punitive preferences are acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      (2) Statistical Analysis: The analysis of the learning effects of AI offers is not fully convincing. The authors analyse changes in rejection rates within each learning condition rather than directly comparing the two. Finding a significant effect in one condition but not the other does not demonstrate that the learning regime is driving the effect. A direct comparison between conditions is necessary for establishing that there is a causal role for the learning regime.

      We agree with the Reviewer and upon reflection, believe that direct comparisons between conditions would be helpful to support the claim that the different learning conditions are responsible for the observed learning effects. In brief, these specific tests buttress the idea that exposure to AI-averse preferences result in increases in AI punishment rates in the Transfer phase (over and above the rates observed for participants who were only exposed to DI-averse preferences).

      Accordingly, our revision now reports statistics concerning the differences between conditions for AI offers in Experiment 1 (Line 198~ 207):

      “Importantly, when comparing these changes between the two learning conditions, we observed significant differences in rejection rates for Adv-I offers: compared to exposure to a Teacher who rejected only Dis-I offers, participants exposed to a Teacher who rejected both Dis-I and Adv-I offers were more likely to reject Adv-I offers and rated these offers more unfair. This difference between conditions was evident in both 30:70 offers (Rejection rates: β(SE) = 0.10(0.04), p = 0.013; Fairness ratings: β(SE) = -0.86(0.17), p < 0.001) and 10:90 offers (Rejection rates: β(SE) = 0.15(0.04), p < 0.001, Fairness ratings: β(SE) = -1.04(0.17), p < 0.001). As a control, we also compared rejection rates and fairness rating changes between conditions in Dis-I offers (90:10 and 30:70) and Fair offers (i.e., 50:50) but observed no significant difference (all ps > 0.217), suggesting that observing an Adv-I-averse Teacher’s preferences did not influence participants’ behavior in response to Dis-I offers.”

      Line 222 ~ 230:

      “A mixed-effects logistic regression revealed a significant larger (positive) effect of trial number on rejection rates of Adv-I offers for the Adv-Dis-I-Averse condition compared to the Dis-I-Averse condition. This relative rejection rate increase was evident both in 30:70 offers (Table S7; β(SE) = -0.77(0.24), p < 0.001) and in 10:90 offers (β(SE) = -1.10(0.33), p < 0.001). In contrast, comparing Dis-I and Fairness offers when the Teacher showed the same tendency to reject, we found no significant difference between the two conditions (90:10 splits: β(SE)=-0.48(0.21),p=0.593;70:30 splits: β(SE)=-0.01(0.14),p=0.150; 50:50 splits: β(SE)=-0.00(0.21),p=0.086). In other words, participants by and large appeared to adjust their rejection choices in accordance with the Teacher’s feedback in an incremental fashion.”

      And in Experiment 2 Line 333 ~ 345:

      “Similar to what we observed in Experiment 1 (Figure 4a), Compared to the participants in the Dis-I-Averse Condition, participants in the Adv-I-Averse Condition increased their rates of rejection of extreme Adv-I offerers (i.e., 10:90) in the Transfer Phase, relative to the Baseline phase (β(SE) = -0.12(0.04), p < 0.004; Table S9), suggesting that participants’ learned (and adopted) Adv-I-averse preferences, generalized from one specific offer type (30:70) to an offer types for which they received no Teacher feedback (10:90). Examining extreme Dis-I offers where the Teacher exhibited identical preferences across the two learning conditions, we found no difference in the Changes of Rejection Rates from Baseline to Transfer phase between conditions (β(SE) = -0.05(0.04), p < 0.259). Mirroring the observed rejection rates (Figure 4b), relative to the Dis-I-Averse Condition, participants’ fairness ratings for extreme Adv-I offers increased more from the Baseline to Transfer phase in the Adv-Dis-I-Averse Condition than in the Dis-I-Averse condition (β(SE) = -0.97(0.18), p < 0.001), but, importantly, changes in fairness ratings for extreme Dis-I offers did not differ significantly between learning conditions (β(SE) = -0.06(0.18), p < 0.723)”

      Line 361 ~ 368:

      “Examining the time course of rejection rates in Adv-I-contexts during the Learning phase (Figure 5) revealed that participants learned over time to punish mildly unfair 30:70 offers, and these punishment preferences generalized to more extreme offers (10:90). Specifically, compared to the Dis-I-Averse Condition, in the Adv-Dis-I-Averse condition we observed a significant larger trend of increase in rejections rates for 10:90 (Adv-I) offers (Figure 5, β(SE) = -0.81(0.26), p < 0.002 mixed-effects logistic regression, see Table S10). Again, when comparing the rejection rate increase in the extremely Dis-I offers (90:10), we didn’t find significant difference between conditions (β(SE) = -0.25(0.19), p < 0.707).”

      (3) Correlation Between Learning and Contagion Effects:

      The authors argue that correlations between learning effects (changes in rejection rates during the learning phase) and contagion effects (changes between the generalization and baseline phases) support the idea that individuals who are better aligning their preferences with the teacher also give more consideration to the teacher's preferences later during generalization phase. This interpretation is not convincing. Such correlations could emerge even in the absence of learning, driven by temporal trends like increasing guilt or envy (or even by slow temporal fluctuations in these processes) on behalf of self or others. The reason is that the baseline phase is temporally closer to the beginning of the learning phase whereas the generalization phase is temporally closer to the end of the learning phase. Additionally, the interpretation of these effects seems flawed, as changes in rejection rates do not necessarily indicate closer alignment with the teacher's preferences. For example, if the teacher rejects an offer 75% of the time then a positive 5% learning effect may imply better matching the teacher if it reflects an increase in rejection rate from 65% to 70%, but it implies divergence from the teacher if it reflects an increase from 85% to 90%. For similar reasons, it is not clear that the contagion effects reflect how much a teacher's preferences are taken into account during generalization.

      This comment is very similar to a previous comment made by Reviewer 1, who also called into question the interpretability of these correlations. In response to both of these comments we have elected to remove these analyses from our revision.

      (4) Modeling Efforts: The modelling approach is underdeveloped. The identification of the "best model" lacks transparency, as no model-recovery results are provided, and fits for the losing models are not shown, leaving readers in the dark about where these models fail. Moreover, the reinforcement learning (RL) models used are overly simplistic, treating actions as independent when they are likely inversely related (for example, the feedback that the teacher would have rejected an offer provides feedback that rejection is "correct" but also that acceptance is "an error", and the later is not incorporated into the modelling). It is unclear if and to what extent this limits current RL formulations. There are also potentially important missing details about the models. Can the authors justify/explain the reasoning behind including these variants they consider? What are the initial Q-values? If these are not free parameters what are their values?

      We are appreciative of the Reviewer for identifying these potentially unaddressed questions.

      The RL models we consider in the present study are naïve models which, in our previous study (FeldmanHall, Otto, & Phelps, 2018), we found to capture important aspects of learning. While simplistic, we believed these models serve as a reasonable baseline for evaluating more complex models, such as the Preference Inference model. We have made this point more explicit in our revised Introduction, Line 129 ~ 132:

      “In addition, to mechanistically probe how punitive preferences may be acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis-I context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      Again, following from our previous modeling of observational learning (FeldmanHall et al., 2018), we believe that the feedback the Teacher provides here is ideally suited to the RL formalism. In particular, when the teacher indicates that the participant’s choice is what they would have preferred, the model receives a reward of ‘1’ (e.g., the participant rejects and the Teacher indicates they would preferred rejection, resulting in a positive prediction error) otherwise, the model receives a reward of ‘0’ (e.g., the participant accepts and the Teacher indicates they would preferred rejection, resulting in a negative prediction error), indicating that the participant did not choose in accordance with the Teacher’s preferences. Through an error driven learning process, these models provide a naïve way of learning to act in accordance with the Teacher’s preferences.

      Regarding the requested model details: When treating the initial values as free parameters (model 5), we set Q(reject, offertype) as free values in [0,1] and Q(accept,offertype) as 0.5. This setting can capture participants' initial tendency to reject or accept offers from this offer type. When the initial values are fixed, for all offer types we set Q(reject, offertype) = Q(accept,offertype) = 0.5. In practice, when the initial values are fixed, setting them to 0.5 or 0 doesn’t make much difference. We have clarified these points in our revised Methods, Line 275 ~ 576:

      “We kept the initial values fixed in this model, that is Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90)”

      And Line 582 ~ 584:

      “Formally, this model treats Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90) as free parameters with values between 0 and 1.”

      (5) Conceptual Leap in Modeling Interpretation: The distinction between simple RL models and preference-inference models seems to hinge on the ability to generalize learning from one offer to another. Whereas in the RL models learning occurs independently for each offer (hence to cross-offer generalization), preference inference allows for generalization between different offers. However, the paper does not explore RL models that allow generalization based on the similarity of features of the offers (e.g., payment for the receiver, payment for the offer-giver, who benefits more). Such models are more parsimonious and could explain the results without invoking a theory of mind or any modelling of the teacher. In such model versions, a learner learns a functional form that allows to predict the teacher's feedback based on said offer features (e.g., linear or quadratic form). Because feedback for an offer modulates the parameters of this function (feature weights) generalization occurs without necessarily evoking any sophisticated model of the other person. This leaves open the possibility that RL models could perform just as well or even show superiority over the preference learning model, casting doubt on the authors' conclusions. Of note: even the behaviourists knew that as Little Albert was taught to fear rats, this fear generalized to rabbits. This could occur simply because rabbits are somewhat similar to rats. But this doesn't mean little Alfred had a sophisticated model of animals he used to infer how they behave.

      We are appreciative of the Reviewer for their suggestion of an alternative explanation for the observed generalization effects. Our understanding of the suggestion, put simply, put simply, is that an RL model could capture the observed generalization effects if the model were to learn and update a functional form of the Teacher’s rejection preferences using an RL-like algorithm. This idea is similar, conceptually to our account of preference learning whereby the learner has a representation of the teacher’s preferences. In our experiment the offer is in the range of [0-100], the crux of this idea is why the participants should take the functional form (either v-shaped or quadratic) with the minimum at 50. This is important because, at the beginning of the learning phase, the rejection rates are already v-shaped with 50 as its minimum. The participants do not need to adjust the minimum of this functional form. Thus, if we assume that the participants represent the teacher’s rejection rate as a v-shape function with a minimum at [50,50], then this very likely implies that the participants have a representation that the teacher has a preference for fairness. Above all, we agree that with suitable setup of the functional form, one could implement an RL model to capture the generalization effects, without presupposing an internal “model” of the teacher’s preferences.

      However, there is another way of modeling the generalization effect by truly “model-free” similarity-based Reinforcement learning. In this approach, we do not assume any particular functional form of the teacher’s preferences, but rather, assumes that experience acquired in one offer type can be generalized to offers that are close (i.e., similar) to the original offer. Accordingly, we implement this idea using a simple RL model in which the action values for each offer type is updated by a learning rate that is scaled by the distance between that offer and the experienced offer (i.e., the offer that generated the prediction error). This learning rate is governed by a Gaussian distribution, similar to the case in the Gaussian process regression (cf. Chulz, Speekenbrink, & Krause, 2018). The initial value of the ‘Reject’ action, for each offer , is set to a free parameter between 0 and 1, and the initial value for the 'Accept’ action was set to 0.5. The results show that even though this model exhibits the trend of increasing rejection rates observed in the AI-DI punish condition, the initial preferences (i.e., starting point of learning) diverges markedly from the Learning phase behavior we observed in Experiment 1:

      Author response image 1.

      This demonstrated that the participant at least maintains a representation of the teacher’s preference at the beginning. That is, they have prior knowledge about the shape of this preference. We incorporated this property into the model, that is, we considered a new model that assumes v-shaped starting values for rejection with two parameters, alpha and beta, governing the slope of this v-shaped function (this starting value actually mimics the shape of the preference functions of the Fehr-Schmidt model). We found that this new model (which we term the “Model RL Sim Vstart”) provided a satisfactory qualitative fit of the Transfer phase learning curves in Experiment 1 (see below).

      Author response image 2.

      However, we didn’t adopt this model as the best model for the following reasons. First, this model yielded a larger AIC value (indicating worse quantitative fit) compared to our preference Inference model in both Experiments 1 and 2, likely owing to its increased complexity (5 free parameters versus 4 in the Preference Inference model). Accordingly, we believe that inclusion of this model in our revised submission would be more distracting than helpful on account of the added complexity of explaining and justifying these assumptions, and of course its comparatively poor goodness of fit (relative to the preference inference model).

      (6) Limitations of the Preference-Inference Model: The preference-inference model struggles to capture key aspects of the data, such as the increase in rejection rates for 70:30 DI offers during the learning phase (e.g. Figure 3A, AI+DI blue group). This is puzzling.

      Thinking about this I realized the model makes quite strong unintuitive predictions that are not examined. For example, if a subject begins the learning phase rejecting the 70:30 offer more than 50% of the time (meaning the starting guilt parameter is higher than 1.5), then overleaning the tendency to reject will decrease to below 50% (the guilt parameter will be pulled down below 1.5). This is despite the fact the teacher rejects 75% of the offers. In other words, as learning continues learners will diverge from the teacher. On the other hand, if a participant begins learning to tend to accept this offer (guilt < 1.5) then during learning they can increase their rejection rate but never above 50%. Thus one can never fully converge on the teacher. I think this relates to the model's failure in accounting for the pattern mentioned above. I wonder if individuals actually abide by these strict predictions. In any case, these issues raise questions about the validity of the model as a representation of how individuals learn to align with a teacher's preferences (given that the model doesn't really allow for such an alignment).

      In response to this comment we explain our efforts to build a new model that might be able conceptually resolves the issue identified by the Reviewer.

      The key intuition guiding the Preference inference model is a Bayesian account of learning which we aimed to further simplify. In this setting, a Bayesian learner maintains a representation of the teacher’s inequity aversion parameters and updates it according to the teacher’s (observed) behavior. Intuitively, the posterior distribution shifts to the likelihood of the teacher’s action. On this view, when the teacher rejects, for instance, an AI offer, the learner should assign a higher probability to larger values of the Guilt parameter, and in turn the learner should change their posterior estimate to better capture the teacher’s preferences.

      In the current study, we simplified this idea, implementing this sort of learning using incremental “delta rule” updating (e.g. Equation 8 of the main text). Then the key question is to define the “teaching signal”. Assuming that the teacher rejects an offer 70:30, based on Bayesian reasoning, the teacher’s envy parameter (α) is more likely to exceed 1.5 (computed as 30/(50-30), per equation 7) than to be smaller than 1.5. Thus, 1.5, which is then used in equation 8 to update α, can be thought of as a teaching signal. We simply assumed that if the initial estimate is already greater than 1.5, which means the prior is consistent with the likelihood, no updating would occur. This assumption raises the question of how to set the learning rate range. In principle, an envy parameter that is larger than 1.5 should be the target of learning (i.e., the teaching signal), and thus our model definition allows the learning rate to be greater than 1, incorporating this possibility.

      Our simplified preference inference model has already successfully captured some key aspects of the participants’ learning behavior. However, it may fail in the following case: assume that the participant has an initial estimate of 1.51 for the envy parameter (β). Let’s say this corresponds to a rejection rate of 60%. Thus, no matter how many times the teacher rejects the offer 70:30, the participant’s estimate of the envy parameter remains the same, but observing only one offer acceptance would decrease this estimate, and in turn, would decrease the model’s predicted rejection rate. We believe this is the anomalous behavior—in 70:30 offers—identified by the Reviewer which the model does not appear able to recreate participants’ in these offers.

      This issue actually touches the core of our model specification, that is, the choosing of the teaching signal. As we chose 1.5 as the teaching signal—i.e. lower bound on whenever the teacher rejects or accepts an offer of 70:30, a very small deviation of 1.5 would fail one part of updating. One way to mitigate this problem would be to choose a lower bound for α greater than 1.5, such that when the Teacher rejects a 70:30 offer, we assign a number greater than 1.5 (by ‘hard-coding’ this into the model via modification of equation 7). One sensible candidate value could be the middle point between 1.5 and 10 (the maximum value of α per our model definition). Intuitively, the model of this setting could still pull up the value of α to 1.51 when the teacher rejects 70:30, thus alleviating (but not completely eliminating) the anomaly.

      We fitted this modified Preference Inference model to the data from Experiment 1 (see Author response image 3 below) and found that even though this model has a smaller AIC (and thus better quantitative fit than the original Preference Inference model), it still doesn’t fully capture the participants’ behavior for 70:30 offers.

      Author response image 3.

      Accordingly, rather than revising our model to include an unprincipled ‘kludge’ to account for this minor anomaly in the model behavior, we have opted to report our original model in our revision as we still believe it parsimoniously captures our intuitions about preference learning and provides a better fit to the observed behavior than the other RL models considered in the present study.

      Reviewer #1 (Recommendations for the authors):

      (1) I do not particularly prefer the acronyms AI and DI for disadvantageous inequity and advantageous inequity. Although they have been used in the literature, not every single paper uses them. More importantly, AI these days has such a strong meaning of artificial intelligence, so when I was reading this, I'd need to very actively inhibit this interpretation. I believe for the readability for a wider readership of eLife, I would advise not to use AI/DI here, but rather use the full terms.

      We thank the Reviewer for this suggestion. As the full spelling of the two terms are somewhat lengthy, and appear frequently in the figures, we have elected to change the abbreviations for disadvantageous inequity and advantageous inequity to Dis-I and Adv-I, respectively in the main text and the supplementary information. We still use AI/DI in the response letter to make the terminology consistent.

      (2) Do "punishment rate" and "rejection rate" mean the same? If so, it would be helpful to stick with one single term, eg, rejection rate.

      We thank the Reviewer for this suggestion. As these terms have the same meaning, we have opted to use the term “rejection rate” throughout the main text.

      (3) For the linear mixed effect models, were other random effect structures also considered (eg, random slops of experimental conditions)? It might be worth considering a few model specifications and selecting the best one to explain the data.

      Thanks for this comment. Following established best practices (Barr, Levy, Scheepers, & Tily, 2013) we have elected to use a maximal random effects structure, whereby all possible predictor variables in the fixed effects structure also appear in the random effects structure.

      (4) For equation (4), the softmax temperature is denoted as tau, but later in the text, it is called gamma. Please make it consistent.

      We are appreciative of the Reviewer’s attention to detail. We have corrected this error.

      Reviewer #2 (Recommendations for the authors):

      (1) Several Tables in SI are unclear. I wasn't clear if these report raw probabilities of coefficients of mixed models. For any mixed models, it would help to give the model specification (e.g., Walkins form) and explain how variables were coded.

      We are appreciative of the Reviewer’s attention to detail. We have clarified, in the captions accompanying our supplemental regression tables, that these coefficients represent log-odds. Regretfully we are unaware of the “Walkins form” the Reviewer references (even after extensive searching of the scientific literature). However, in our new revision we do include lme4 model syntax in our supplemental information which we believe will be helpful for readers seeking replicate our model specification.

      (2) In one of the models it was said that the guilt and envy parameters were bounded between 0-1 but this doesn't make sense and I think values outside this range were later reported.

      We are again appreciative of the Reviewer’s attention to detail. This was an error we have corrected— the actual range is [0,10].

      (3) It is unclear if the model parameters are recoverable.

      In response to this comment our revision now reports a basic parameter recovery analysis for the winning Preference Inference model. This is reported in our revised Methods:

      “Finally, to verify if the free parameters of the winning model (Preference Inference) are recoverable, we simulated 200 artificial subjects, based on the Learning Phase of Experiment 1, with free parameters randomly chosen (uniformly) from their defined ranges. We then employed the same model-fitting procedure as described above to estimate these parameter value, observing that parameters. We found that all parameters of the model can be recovered (see Figure S2).”

      And scatter plots depicting these simulated (versus recovered) parameters are given in Figure S2 of our revised Supplementary Information:

      (4) I was confused about what Figure S2 shows. The text says this is about correlating contagious effects for different offers but the captions speak about learning effects. This is an important aspect which is unclear.

      We have removed this figure in response to both Reviewers’ comments about the limited insights that can be drawn on the basis of these correlations.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      To the Senior Editor and the Reviewing Editor:

      We sincerely appreciate the valuable comments provided by the reviewers, the reviewing editor, and the senior editor. Based on our last response and revision, we are confused by the two limitations noted in the eLife assessment. 

      (1) benchmarking against comparable methods is limited.

      In our last revision, we added the comparison experiments with TNDM, as the reviewers requested. Additionally, it is crucial to emphasize that our evaluation of decoding capabilities of behaviorally relevant signals has been benchmarked against the performance of the ANN on raw signals, which, as Reviewer #1 previously noted, nearly represents the upper limit of performance. Consequently, we believe that our benchmarking methods are sufficiently strong.

      (2) some observations may be a byproduct of their method, and may not constitute new scientific observations.

      We believe that our experimental results are sufficient to demonstrate that our conclusions are not byproducts of d-VAE based on three reasons:

      (1) The d-VAE, as a latent variable model, adheres to the population doctrine, which posits that latent variables are responsible for generating the activities of individual neurons. The goal of such models is to maximize the explanation of the raw signals. At the signal level, the only criterion we can rely on is neural reconstruction performance, in which we have achieved unparalleled results. Thus, it is inappropriate to focus on the mixing process during the model's inference stage while overlooking the crucial de-mixing process during the generation stage and dismissing the significance of our neural reconstruction results. For more details, please refer to the first point in our response to Q4 from Reviewer #4.

      (2) The criterion that irrelevant signals should contain minimal information can effectively demonstrate that our conclusions are not by-products of d-VAE. Unfortunately, the reviewers seem to have overlooked this criterion. For more details, please refer to the third point in our response to Q4 from Reviewer #4

      (3) Our synthetic experimental results also substantiate that our conclusions are not byproducts of d-VAE. However, it appears the reviewers did not give these results adequate consideration. For more details, please refer to the fourth point in our response to Q4 from Reviewer #4.

      Furthermore, our work presents not just "a useful method" but a comprehensive framework. Our study proposes, for the first time, a framework for defining, extracting, and validating behaviorally relevant signals. In our current revision, to clearly distinguish between d-VAE and other methods, we have formalized the extraction of behaviorally relevant signals into a mathematical optimization problem. To our knowledge, current methods have not explicitly proposed extracting behaviorally relevant signals, nor have they identified and addressed the key challenges of extracting relevant signals. Similarly, existing research has not yet defined and validated behaviorally relevant signals. For more details, please refer to our response to Q1 from Reviewer #4.

      Based on these considerations, we respectfully request that you reconsider the eLife assessment of our work. We greatly appreciate your time and attention to this matter.

      The main revisions made to the manuscript are as follows:

      (1) We have formalized the extraction of behaviorally relevant signals into a mathematical optimization problem, enabling a clearer distinction between d-VAE and other models.

      (2) We have moderated the assertion about linear readout to highlight its conjectural nature and have broadened the discussion regarding this conclusion. 

      (3) We have elaborated on the model details of d-VAE and have removed the identifiability claim.

      To Reviewer #1

      Q1: “As reviewer 3 also points out, I would, however, caution to interpret this as evidence for linear read-out of the motor system - your model performs a non-linear transformation, and while this is indeed linearly decodable, the motor system would need to do something similar first to achieve the same. In fact to me it seems to show the opposite, that behaviour-related information may not be generally accessible to linear decoders (including to down-stream brain areas).”

      Thank you for your comments. It's important to note that the conclusions we draw are speculative and not definitive. We use terms like "suggest" to reflect this uncertainty. To further emphasize the conjectural nature of our conclusions, we have deliberately moderated our tone.

      The question of whether behaviorally-relevant signals can be accessed by linear decoders or downstream brain regions hinges on the debate over whether the brain employs a strategy of filtering before decoding. If the brain employs such a strategy, the brain can probably access these signals. In our opinion, it is likely that the brain utilizes this strategy.

      Given the existence of behaviorally relevant signals, it is reasonable to assume that the brain has intrinsic mechanisms to differentiate between relevant and irrelevant signals. There is growing evidence suggesting that the brain utilizes various mechanisms, such as attention and specialized filtering, to suppress irrelevant signals and enhance relevant signals [1-3]. Therefore, it is plausible that the brain filters before decoding, thereby effectively accessing behaviorally relevant signals.

      Thank you for your valuable feedback.

      (1) Sreenivasan, Sameet, and Ila Fiete. "Grid cells generate an analog error-correcting code for singularly precise neural computation." Nature neuroscience 14.10 (2011): 1330-1337.

      (2) Schneider, David M., Janani Sundararajan, and Richard Mooney. "A cortical filter that learns to suppress the acoustic consequences of movement." Nature 561.7723 (2018): 391-395.

      (3) Nakajima, Miho, L. Ian Schmitt, and Michael M. Halassa. "Prefrontal cortex regulates sensory filtering through a basal ganglia-to-thalamus pathway." Neuron 103.3 (2019): 445-458.

      Q2: “As in my initial review, I would also caution against making strong claims about identifiability although this work and TNDM seem to show that in practise such methods work quite well. CEBRA, in contrast, offers some theoretical guarantees, but it is not a generative model, so would not allow the type of analysis done in this paper. In your model there is a para,eter \alpha to balance between neural and behaviour reconstruction. This seems very similar to TNDM and has to be optimised - if this is correct, then there is manual intervention required to identify a good model.”

      Thank you for your comments. 

      Considering your concerns about our identifiability claims and the fact that identifiability is not directly relevant to the core of our paper, we have removed content related to identifiability.

      Firstly, our model is based on the pi-VAE, which also has theoretical guarantees. However, it is important to note that all such theoretical guarantees (including pi-VAE and CEBRA) are based on certain assumptions that cannot be validated as the true distribution of latent variables remains unknown.

      Secondly, it is important to clarify that the identifiability of latent variables does not impact the conclusions of this paper, nor does this paper make specific conclusions about the model's latent variables. Identifiability means that distinct latent variables correspond to distinct observations. If multiple latent variables can generate the same observation, it becomes impossible to determine which one is correct given the observation, which leads to the issue of nonidentifiability. Notably, our analysis focuses on the generated signals, not the latent variables themselves, and thus the identifiability of these variables does not affect our findings. 

      Our approach, dedicated to extracting these signals, distinctly differs from methods such as TNDM, which focuses on extracting behaviorally relevant latent dynamics. To clearly set apart d-VAE from other models, we have framed the extraction of behaviorally relevant signals as the following mathematical optimization problem:

      where 𝑥# denotes generated behaviorally-relevant signals, 𝑥 denotes raw noisy signals, 𝐸(⋅,⋅) demotes reconstruction loss, and 𝑅(⋅) denotes regularization loss. It is important to note that while both d-VAE and TNDM employ reconstruction loss, relying solely on this term is insufficient for determining the optimal degree of similarity between the generated and raw noisy signals. The key to accurately extracting behaviorally relevant signals lies in leveraging prior knowledge about these signals to determine the optimal similarity degree, encapsulated by 𝑅(𝒙𝒓).  Other studies have not explicitly proposed extracting behaviorally-relevant signals, nor have they identified and addressed the key challenges involved in extracting relevant signals. Consequently, our approach is distinct from other methods.

      Thank you for your valuable feedback.

      Q3: “Somewhat related, I also found that the now comprehensive comparison with related models shows that the using decoding performance (R2) as a metric for model comparison may be problematic: the R2 values reported in Figure 2 (e.g. the MC_RTT dataset) should be compared to the values reported in the neural latent benchmark, which represent well-tuned models (e.g. AutoLFADS). The numbers (difficult to see, a table with numbers in the appendix would be useful, see: https://eval.ai/web/challenges/challenge-page/1256/leaderboard) seem lower than what can be obtained with models without latent space disentanglement. While this does not necessarily invalidate the conclusions drawn here, it shows that decoding performance can depend on a variety of model choices, and may not be ideal to discriminate between models. I'm also surprised by the low neural R2 for LFADS I assume this is condition-averaged) - LFADS tends to perform very well on this metric.”

      Thank you for your comments. The dataset we utilized is not from the same day as the neural latent benchmark dataset. Notably, there is considerable variation in the length of trials within the RTT paradigm, and the dataset lacks explicit trial information, rendering trial-averaging unsuitable. Furthermore, behaviorally relevant signals are not static averages devoid of variability; even behavioral data exhibits variability. We computed the neural R2 using individual trials rather than condition-averaged responses. 

      Thank you for your valuable feedback.

      Q4: “One statement I still cannot follow is how the prior of the variational distribution is modelled. You say you depart from the usual Gaussian prior, but equation 7 seems to suggest there is a normal prior. Are the parameters of this distribution learned? As I pointed out earlier, I however suspect this may not matter much as you give the prior a very low weight. I also still am not sure how you generate a sample from the variational distribution, do you just draw one for each pass?”

      Thank you for your questions.

      The conditional distribution of prior latent variables 𝑝%(𝒛|𝒚) is a Gaussian distribution, but the distribution of prior latent variables 𝑝(𝒛) is a mixture Gaussian distribution. The distribution of prior latent variables 𝑝(𝒛) is:

      where denotes the empirical distribution of behavioral variables

      𝒚, and 𝑁 denotes the number of samples, 𝒚(𝒊) denotes the 𝒊th sample, δ(⋅) denotes the Dirac delta function, and 𝑝%(𝒛|𝒚) denotes the conditional distribution of prior latent variables given the behavioral variables parameterized by network 𝑚. Based on the above equation, we can see that 𝑝(𝒛) is not a Gaussian distribution, it is a Gaussian mixture model with 𝑁 components, which is theoretically a universal approximator of continuous probability densities.

      Learning this prior is important, as illustrated by our latent variable visualizations, which are not a Gaussian distribution. Upon conducting hypothesis testing for both latent variables and behavioral variables, neither conforms to Gaussian distribution (Lilliefors test and Kolmogorov-Smirnov test). Consequently, imposing a constraint on the latent variables towards N(0,1) is expected to affect performance adversely.

      Regarding sampling, during training process, we draw only one sample from the approximate posterior distribution . It is worth noting that drawing multiple samples or one sample for each pass does not affect the experimental results. After training, we can generate a sample from the prior by providing input behavioral data 𝒚(𝒊) and then generating corresponding samples via and . To extract behaviorally-relevant signals from raw signals, we use and .

      Thank you for your valuable feedback.

      Q5: “(1) I found the figures good and useful, but the text is, in places, not easy to follow. I think the manuscript could be shortened somewhat, and in some places more concise focussed explanations would improve readability.

      (2) I would not call the encoding "complex non-linear" - non-linear is a clear term, but complex can mean many things (e.g. is a quadratic function complex?) ”

      Thank you for your recommendation. We have revised the manuscript for enhanced clarity.  We call the encoding “complex nonlinear” because neurons encode information with varying degrees of nonlinearity, as illustrated in Fig. 3b, f, and Fig. S3b.

      Thank you for your valuable feedback.

      To Reviewer #2

      Q1: “I still remain unconvinced that the core findings of the paper are "unexpected". In the response to my previous Specific Comment #1, they say "We use the term 'unexpected' due to the disparity between our findings and the prior understanding concerning neural encoding and decoding." However, they provide no citations or grounding for why they make those claims. What prior understanding makes it unexpected that encoding is more complex than decoding given the entropy, sparseness, and high dimensionality of neural signals (the "encoding") compared to the smoothness and low dimensionality of typical behavioural signals (the "decoding")?” 

      Thank you for your comments. We believe that both the complexity of neural encoding and the simplicity of neural decoding in motor cortex are unexpected.

      The Complexity of Neural Encoding: As noted in the Introduction, neurons with small R2 values were traditionally considered noise and consequently disregarded, as detailed in references [1-3]. However, after filtering out irrelevant signals, we discovered that these neurons actually contain substantial amounts of behavioral information, previously unrecognized. Similarly, in population-level analyses, neural signals composed of small principal components (PCs) are often dismissed as noise, with analyses typically utilizing only between 6 and 18 PCs [4-10]. Yet, the discarded PC signals nonlinearly encode significant amounts of information, with practically useful dimensions found to range between 30 and 40—far exceeding the usual number analyzed. These findings underscore the complexity of neural encoding and are unexpected.

      The Simplicity of Neural Decoding: In the motor cortex, nonlinear decoding of raw signals has been shown to significantly outperform linear decoding, as evidenced in references [11,12]. Interestingly, after separating behaviorally relevant and irrelevant signals, we observed that the linear decoding performance of behaviorally relevant signals is nearly equivalent to that of nonlinear decoding—a phenomenon previously undocumented in the motor cortex. This discovery is also unexpected.

      Thank you for your valuable feedback.

      (1) Georgopoulos, Apostolos P., Andrew B. Schwartz, and Ronald E. Kettner. "Neuronal population coding of movement direction." Science 233.4771 (1986): 1416-1419.

      (2) Hochberg, Leigh R., et al. "Reach and grasp by people with tetraplegia using a neurally controlled robotic arm." Nature 485.7398 (2012): 372-375. 

      (3) Inoue, Yoh, et al. "Decoding arm speed during reaching." Nature communications 9.1 (2018): 5243.

      (4) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (5) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (6) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239.

      (7) Sadtler, Patrick T., et al. "Neural constraints on learning." Nature 512.7515 (2014): 423426.

      (8) Golub, Matthew D., et al. "Learning by neural reassociation." Nature neuroscience 21.4 (2018): 607-616.

      (9) Gallego, Juan A., et al. "Cortical population activity within a preserved neural manifold underlies multiple motor behaviors." Nature communications 9.1 (2018): 4233.

      (10) Gallego, Juan A., et al. "Long-term stability of cortical population dynamics underlying consistent behavior." Nature neuroscience 23.2 (2020): 260-270.

      (11) Glaser, Joshua I., et al. "Machine learning for neural decoding." Eneuro 7.4 (2020).

      (12) Willsey, Matthew S., et al. "Real-time brain-machine interface in non-human primates achieves high-velocity prosthetic finger movements using a shallow feedforward neural network decoder." Nature Communications 13.1 (2022): 6899.

      Q2: “I still take issue with the premise that signals in the brain are "irrelevant" simply because they do not correlate with a fixed temporal lag with a particular behavioural feature handchosen by the experimenter. In the response to my previous review, the authors say "we employ terms like 'behaviorally-relevant' and 'behaviorally-irrelevant' only regarding behavioral variables of interest measured within a given task, such as arm kinematics during a motor control task.". This is just a restatement of their definition, not a response to my concern, and does not address my concern that the method requires a fixed temporal lag and continual decoding/encoding. My example of reward signals remains. There is a huge body of literature dating back to the 70s on the linear relationships between neural and activity and arm kinematics; in a sense, the authors have chosen the "variable of interest" that proves their point. This all ties back to the previous comment: this is mostly expected, not unexpected, when relating apparently-stochastic, discrete action potential events to smoothly varying limb kinematics.”

      Thank you for your comments. 

      Regarding the experimenter's specification of behavioral variables of interest, we followed common practice in existing studies [1, 2]. Regarding the use of fixed temporal lags, we followed the same practice as papers related to the dataset we use, which assume fixed temporal lags [3-5]. Furthermore, many studies in the motor cortex similarly use fixed temporal lags [68].

      Concerning the issue of rewards, in the paper you mentioned [9], the impact of rewards occurs after the reaching phase. It's important to note that in our experiments, we analyze only the reaching phase, without any post-movement phase. 

      If the impact of rewards can be stably reflected in the signals in the reaching phase of the subsequent trial, and if the reward-induced signals do not interfere with decoding—since these signals are harmless for decoding and beneficial for reconstruction—our model is likely to capture these signals. If the signals induced by rewards during the reaching phase are randomly unstable, our model will likely be unable to capture them.

      If the goal is to extract post-movement neural activity from both rewarded and unrewarded trials, and if the neural patterns differ between these conditions, one could replace the d-VAE's regression loss, used for continuous kinematics decoding, with a classification loss tailored to distinguish between rewarded and unrewarded conditions.

      To clarify the definition, we have revised it in the manuscript. Specifically, before a specific definition, we briefly introduce the relevant signals and irrelevant signals. Behaviorally irrelevant signals refer to those not directly associated with the behavioral variables of interest and may include noise or signals from variables of no interest. In contrast, behaviorally relevant signals refer to those directly related to the behavioral variables of interest. For instance, rewards in the post-movement phase are not directly related to behavioral variables (kinematics) in the reaching movement phase.

      It is important to note that our definition of behaviorally relevant signals not only includes decoding capabilities but also specific requirement at the signal level, based on two key requirements:

      (1) they should closely resemble raw signals to preserve the underlying neuronal properties without becoming so similar that they include irrelevant signals. (encoding requirement), and  (2) they should contain behavioral information as much as possible (decoding requirement). Signals that meet both requirements are considered effective behaviorally relevant signals. In our study, we assume raw signals are additively composed of behaviorally-relevant and irrelevant signals. We define irrelevant signals as those remaining after subtracting relevant signals from raw signals. Therefore, we believe our definition is clearly articulated. 

      Thank you for your valuable feedback.

      (1) Sani, Omid G., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification." Nature Neuroscience 24.1 (2021): 140-149.

      (2) Buetfering, Christina, et al. "Behaviorally relevant decision coding in primary somatosensory cortex neurons." Nature neuroscience 25.9 (2022): 1225-1236.

      (3) Wang, Fang, et al. "Quantized attention-gated kernel reinforcement learning for brain– machine interface decoding." IEEE transactions on neural networks and learning systems 28.4 (2015): 873-886.

      (4) Dyer, Eva L., et al. "A cryptography-based approach for movement decoding." Nature biomedical engineering 1.12 (2017): 967-976.

      (5) Ahmadi, Nur, Timothy G. Constandinou, and Christos-Savvas Bouganis. "Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning." Journal of Neural Engineering 18.2 (2021): 026011.

      (6) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (7) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (8) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239.

      (9) Ramkumar, Pavan, et al. "Premotor and motor cortices encode reward." PloS one 11.8 (2016): e0160851.

      Q3: “The authors seem to have missed the spirit of my critique: to say "linear readout is performed in motor cortex" is an over-interpretation of what their model can show.”

      Thank you for your comments. It's important to note that the conclusions we draw are speculative and not definitive. We use terms like "suggest" to reflect this uncertainty. To further emphasize the conjectural nature of our conclusions, we have deliberately moderated our tone.

      The question of whether behaviorally-relevant signals can be accessed by downstream brain regions hinges on the debate over whether the brain employs a strategy of filtering before decoding. If the brain employs such a strategy, the brain can probably access these signals. In our view, it is likely that the brain utilizes this strategy.

      Given the existence of behaviorally relevant signals, it is reasonable to assume that the brain has intrinsic mechanisms to differentiate between relevant and irrelevant signals. There is growing evidence suggesting that the brain utilizes various mechanisms, such as attention and specialized filtering, to suppress irrelevant signals and enhance relevant signals [1-3]. Therefore, it is plausible that the brain filters before decoding, thereby effectively accessing behaviorally relevant signals.

      Regarding the question of whether the brain employs linear readout, given the limitations of current observational methods and our incomplete understanding of brain mechanisms, it is challenging to ascertain whether the brain employs a linear readout. In many cortical areas, linear decoders have proven to be sufficiently accurate. Consequently, numerous studies [4, 5, 6], including the one you referenced [4], directly employ linear decoders to extract information and formulate conclusions based on the decoding results. Contrary to these approaches, our research has compared the performance of linear and nonlinear decoders on behaviorally relevant signals and found their decoding performance is comparable. Considering both the decoding accuracy and model complexity, our results suggest that the motor cortex may utilize linear readout to decode information from relevant signals. Given the current technological limitations, we consider it reasonable to analyze collected data to speculate on the potential workings of the brain, an approach that many studies have also embraced [7-10]. For instance, a study [7] deduces strategies the brain might employ to overcome noise by analyzing the structure of recorded data and decoding outcomes for new stimuli.

      Thank you for your valuable feedback.

      (1) Sreenivasan, Sameet, and Ila Fiete. "Grid cells generate an analog error-correcting code for singularly precise neural computation." Nature neuroscience 14.10 (2011): 1330-1337.

      (2) Schneider, David M., Janani Sundararajan, and Richard Mooney. "A cortical filter that learns to suppress the acoustic consequences of movement." Nature 561.7723 (2018): 391-395.

      (3) Nakajima, Miho, L. Ian Schmitt, and Michael M. Halassa. "Prefrontal cortex regulates sensory filtering through a basal ganglia-to-thalamus pathway." Neuron 103.3 (2019): 445-458.

      (4) Jurewicz, Katarzyna, et al. "Irrational choices via a curvilinear representational geometry for value." bioRxiv (2022): 2022-03.

      (5) Hong, Ha, et al. "Explicit information for category-orthogonal object properties increases along the ventral stream." Nature neuroscience 19.4 (2016): 613-622.

      (6) Chang, Le, and Doris Y. Tsao. "The code for facial identity in the primate brain." Cell 169.6 (2017): 1013-1028.

      (7) Ganmor, Elad, Ronen Segev, and Elad Schneidman. "A thesaurus for a neural population code." Elife 4 (2015): e06134.

      (8) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (9) Gallego, Juan A., et al. "Cortical population activity within a preserved neural manifold underlies multiple motor behaviors." Nature communications 9.1 (2018): 4233.

      (10) Gallego, Juan A., et al. "Long-term stability of cortical population dynamics underlying consistent behavior." Nature neuroscience 23.2 (2020): 260-270.

      Q4: “Agreeing with my critique is not sufficient; please provide the data or simulations that provides the context for the reference in the fano factor. I believe my critique is still valid.”

      Thank you for your comments. As we previously replied, Churchland's research examines the variability of neural signals across different stages, including the preparation and execution phases, as well as before and after the target appears. Our study, however, focuses exclusively on the movement execution phase. Consequently, we are unable to produce comparative displays similar to those in his research. Intuitively, one might expect that the variability of behaviorally relevant signals would be lower; however, since no prior studies have accurately extracted such signals, the specific FF values of behaviorally relevant signals remain unknown. Therefore, presenting these values is meaningful, and can provide a reference for future research. While we cannot compare FF across different stages, we can numerically compare the values to the Poisson count process. An FF of 1 indicates a Poisson firing process, and our experimental data reveals that most neurons have an FF less than 1, indicating that the variance in firing counts is below the mean.  Thank you for your valuable feedback.

      To Reviewer #4

      Q1: “Overall, studying neural computations that are behaviorally relevant or not is an important problem, which several previous studies have explored (for example PSID in (Sani et al. 2021), TNDM in (Hurwitz et al. 2021), TAME-GP in (Balzani et al. 2023), pi-VAE in (Zhou and Wei 2020), and dPCA in (Kobak et al. 2016), etc). However, this manuscript does not properly put their work in the context of such prior works. For example, the abstract states "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive", which is not the case given that these prior works have done that. The same is true for various claims in the main text, for example "Furthermore, we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that using raw signals to estimate the neural dimensionality of behaviors leads to an overestimation" (line 321). This finding was presented in (Sani et al. 2021) and (Hurwitz et al. 2021), which is not clarified here. This issue of putting the work in context has been brought up by other reviewers previously but seems to remain largely unaddressed. The introduction is inaccurate also in that it mixes up methods that were designed for separation of behaviorally relevant information with those that are unsupervised and do not aim to do so (e.g., LFADS). The introduction should be significantly revised to explicitly discuss prior models/works that specifically formulated this behavior separation and what these prior studies found, and how this study differs.”  

      Thank you for your comments. Our statement about “One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive” is accurate. To our best knowledge, there is no prior works to do this work--- separating accurate behaviorally relevant neural signals at both single-neuron and single-trial resolution. The works you mentioned have not explicitly proposed extracting behaviorally relevant signals, nor have they identified and addressed the key challenges of extracting relevant signals, namely determining the optimal degree of similarity between the generated relevant signals and raw signals. Those works focus on the latent neural dynamics, rather than signal level.

      To clearly set apart d-VAE from other models, we have framed the extraction of behaviorally relevant signals as the following mathematical optimization problem:

      where 𝒙𝒓 denotes generated behaviorally-relevant signals, 𝒙 denotes raw noisy signals, 𝐸(⋅,⋅) demotes reconstruction loss, and 𝑅(⋅) denotes regularization loss. It is important to note that while both d-VAE and TNDM employ reconstruction loss, relying solely on this term is insufficient for determining the optimal degree of similarity between the generated and raw noisy signals. The key to accurately extracting behaviorally relevant signals lies in leveraging prior knowledge about these signals to determine the optimal similarity degree, encapsulated by 𝑅(𝒙𝒓). All the works you mentioned did not have the key part 𝑅(𝒙𝒓).

      Regarding the dimensionality estimation, the dimensionality of neural manifolds quantifies the degrees of freedom required to describe population activity without significant information loss.

      There are two differences between our work and PSID and TNDM. 

      First, the dimensions they refer to are fundamentally different from ours. The dimensionality we describe pertains to a linear subspace, where a neural dimension or neural mode or principal component basis, , with N representing the number of neurons. However, the vector length of a neural mode of PSID and our approach differs; PSID requires concatenating multiple time steps T, essentially making , TNDM, on the other hand, involves nonlinear dimensionality reduction, which is different from linear dimensionality reduction.

      Second, we estimate neural dimensionality by explaining the variance of neural signals, whereas PSID and TNDM determine dimensionality through decoding performance saturation. It is important to note that the dimensionality at which decoding performance saturates may not accurately reflect the true dimensionality of neural manifolds, as some dimensions may contain redundant information that does not enhance decoding performance.

      We acknowledge that while LFADS can generate signals that contain some behavioral information, it was not specifically designed to do so. Following your suggestion, we have removed this reference from the Introduction.

      Thank you for your valuable feedback.

      Q2: “Claims about linearity of "motor cortex" readout are not supported by results yet stated even in the abstract. Instead, what the results support is that for decoding behavior from the output of the dVAE model -- that is trained specifically to have a linear behavior readout from its embedding -- a nonlinear readout does not help. This result can be biased by the very construction of the dVAE's loss that encourages a linear readout/decoding from embeddings, and thus does not imply a finding about motor cortex.”

      Thank you for your comments. We respectfully disagree with the notion that the ability of relevant signals to be linearly decoded is due to constraints that allow embedding to be linearly decoded. Embedding involves reorganizing or transforming the structure of original signals, and they can be linearly decoded does not mean the corresponding signals can be decoded linearly.

      Let's clarify this with three intuitive examples:

      Example 1: Image denoising is a well-established field. Whether employing supervised or blind denoising methods [1, 2], both can effectively recover the original image. This denoising process closely resembles the extraction of behaviorally relevant signals from raw signals. Consider if noisy images are not amenable to linear decoding (classification); would removing the noise enable linear decoding? The answer is no. Typically, the noise in images captured under normal conditions is minimal, yet even the clear images remain challenging to decode linearly.

      Example 2: Consider the task of face recognition, where face images are set against various backgrounds, in this context, the pixels representing the face corresponds to relevant signals, while the background pixels are considered irrelevant. Suppose a network is capable of extracting the face pixels and the resulting embedding can be linearly decoded. Can the face pixels themselves be linearly decoded? The answer is no. If linear decoding of face pixels were feasible, the challenging task of face recognition could be easily resolved by merely extracting the face from the background and training a linear classifier.

      Example 3: In the MNIST dataset, the background is uniformly black, and its impact is minimal. However, linear SVM classifiers used directly on the original pixels significantly underperform compared to non-linear SVMs.

      In summary, embedding involves reorganizing the structure of the original signals through a feature transformation function. However, the reconstruction process can recover the structure of the original signals from the embedding. The fact that the structure of the embedding can be linearly decoded does not imply that the structure of the original signals can be linearly decoded in the same way. It is inappropriate to focus on the compression process without equally considering the reconstruction process.

      Thank you for your valuable feedback.

      (1) Mao, Xiao-Jiao, Chunhua Shen, and Yu-Bin Yang. "Image restoration using convolutional auto-encoders with symmetric skip connections." arXiv preprint arXiv:1606.08921 (2016).

      (2) Lehtinen, Jaakko, et al. "Noise2Noise: Learning image restoration without clean data." International Conference on Machine Learning. International Machine Learning Society, 2018.

      Q3: “Related to the above, it is unclear what the manuscript means by readout from motor cortex. A clearer definition of "readout" (a mapping from what to what?) in general is needed. The mapping that the linearity/nonlinearity claims refer to is from the *inferred* behaviorally relevant neural signals, which themselves are inferred nonlinearly using the VAE. This should be explicitly clarified in all claims, i.e., that only the mapping from distilled signals to behavior is linear, not the whole mapping from neural data to behavior. Again, to say the readout from motor cortex is linear is not supported, including in the abstract.” 

      Thank you for your comments. We have revised the manuscript to make it more clearly. Thank you for your valuable feedback.

      Q4: “Claims about individual neurons are also confounded. The d-VAE distilling processing is a population level embedding so the individual distilled neurons are not obtainable on their own without using the population data. This population level approach also raises the possibility that information can leak from one neuron to another during distillation, which is indeed what the authors hope would recover true information about individual neurons that wasn't there in the recording (the pixel denoising example). The authors acknowledge the possibility that information could leak to a neuron that didn't truly have that information and try to rule it out to some extent with some simulations and by comparing the distilled behaviorally relevant signals to the original neural signals. But ultimately, the distilled signals are different enough from the original signals to substantially improve decoding of low information neurons, and one cannot be sure if all of the information in distilled signals from any individual neuron truly belongs to that neuron. It is still quite likely that some of the improved behavior prediction of the distilled version of low-information neurons is due to leakage of behaviorally relevant information from other neurons, not the former's inherent behavioral information. This should be explicitly acknowledged in the manuscript.”

      Thank you for your comments. We value your insights regarding the mixing process. However, we are confident in the robustness of our conclusions. We respectfully disagree with the notion that the small R2 values containing significant information are primarily due to leakage, and we base our disagreement on four key reasons.

      (1) Neural reconstruction performance is a reliable and valid criterion.

      The purpose of latent variable models is to explain neuronal activity as much as possible. Given the fact that the ground truth of behaviorally-relevant signals, the latent variables, and the generative model is unknow, it becomes evident that the only reliable reference at the signal level is the raw signals. A crucial criterion for evaluating the reliability of latent variable models (including latent variables and generated relevant signals) is their capability to effectively explain the raw signals [1]. Consequently, we firmly maintain the belief that if the generated signals closely resemble the raw signals to the greatest extent possible, in accordance with an equivalence principle, we can claim that these obtained signals faithfully retain the inherent properties of single neurons. 

      Reviewer #4 appears to focus on the compression (mixing) process without giving equal consideration to the reconstruction (de-mixing) process. Numerous studies have demonstrated that deep autoencoders can reconstruct the original signal very effectively. For example, in the field of image denoising, autoencoders are capable of accurately restoring the original image [2, 3]. If one persistently focuses on the fact of mixing and ignores the reconstruction (demix) process, even if the only criterion that we can rely on at the signal level is high, one still won't acknowledge it. If this were the case, many problems would become unsolvable. For instance, a fundamental criterion for latent variable models is their ability to explain the original data. If the ground truth of the latent variables remains unknown and the reconstruction criterion is disregarded, how can we validate the effectiveness of the model, the validity of the latent variables, or ensure that findings related to latent variables are not merely by-products of the model? Therefore, we disagree with the aforementioned notion. We believe that as long as the reconstruction performance is satisfactory, the extracted signals have successfully retained the characteristics of individual neurons.

      In our paper, we have shown in various ways that our generated signals sufficiently resemble the raw signals, including visualizing neuronal activity (Fig. 2m, Fig. 3i, and Fig. S5), achieving the highest performance among competitors (Fig. 2d, h, l), and conducting control analyses. Therefore, we believe our results are reliable. 

      (1) Cunningham, J.P. and Yu, B.M., 2014. Dimensionality reduction for large-scale neural recordings. Nature neuroscience, 17(11), pp.1500-1509.

      (2) Mao, Xiao-Jiao, Chunhua Shen, and Yu-Bin Yang. "Image restoration using convolutional auto-encoders with symmetric skip connections." arXiv preprint arXiv:1606.08921 (2016).

      (3) Lehtinen, Jaakko, et al. "Noise2Noise: Learning image restoration without clean data." International Conference on Machine Learning. International Machine Learning Society, 2018.

      (2) There is no reason for d-VAE to add signals that do not exist in the original signals.

      (1) Adding signals that does not exist in the small R2 neurons would decrease the reconstruction performance. This is because if the added signals contain significant information, they will not resemble the irrelevant signals which contain no information, and thus, the generated signals will not resemble the raw signals. The model optimizes towards reducing the reconstruction loss, and this scenario deviates from the model's optimization direction. It is worth mentioning that when the model only has reconstruction loss without the interference of decoding loss, we believe that information leakage does not happen. Because the model can only be optimized in a direction that is similar to the raw signals; adding non-existent signals to the generated signals would increase the reconstruction loss, which is contrary to the objective of optimization. 

      (2) Information carried by these additional signals is redundant for larger R2 neurons, thus they do not introduce new information that can enhance the decoding performance of the neural population, which does not benefit the decoding loss.

      Based on these two points, we believe the model would not perform such counterproductive and harmful operations.

      (3) The criterion that irrelevant signals should contain minimal information can effectively rule out the leakage scenario.

      The criterion that irrelevant signals should contain minimal information is very important, but it seems that reviewer #4 has continuously overlooked their significance. If the model's reconstruction is insufficient, or if additional information is added (which we do not believe will happen), the residuals would decode a large amount of information, and this criterion would exclude selecting such signals. To clarify, if we assume that x, y, and z denote the raw, relevant, and irrelevant signals of smaller R2 neurons, with x=y+z, and the extracted relevant signals become y+m, the irrelevant signals become z-m in this case. Consequently, the irrelevant signals contain a significant amount of information.

      We presented the decoding R2 for irrelevant signals in real datasets under three distillation scenarios: a bias towards reconstruction (alpha=0, an extreme case where the model only has reconstruction loss without decoding loss), a balanced trade-off, and a bias towards decoding (alpha=0.9), as detailed in Table 1. If significant information from small R2 neurons leaks from large R2 neurons, the irrelevant signals should contain a large amount of information. However, our results indicate that the irrelevant signals contain only minimal information, and their performance closely resembles that of the model training solely with reconstruction loss, showing no significant differences (P > 0.05, Wilcoxon rank-sum test). When the model leans towards decoding, some useful information will be left in the residuals, and irrelevant signals will contain a substantial amount of information, as observed in Table 1, alpha=0.9. Therefore, we will not choose these signals for analysis.

      In conclusion, the criterion that irrelevant signals should contain minimal information is a very effective measure to exclude undesirable signals.

      Author response table 1.

      Decoding R2 of irrelevant signals

      (4) Synthetic experiments can effectively rule out the leakage scenario.

      In the absence of ground truth data, synthetic experiments serve as an effective method for validating models and are commonly employed [1-3]. 

      Our experimental results demonstrate that d-VAE can effectively extract neural signals that more closely resemble actual behaviorally relevant signals (Fig. S2g).  If there were information leakage, it would decrease the similarity to the ground truth signals, hence we have ruled out this possibility. Moreover, in synthetic experiments with small R2 neurons (Fig. S10), results also demonstrate that our model could make these neurons more closely resemble ground truth relevant signals and recover their information. 

      In summary, synthetic experiments strongly demonstrate that our model can recover obscured neuronal information, rather than adding signals that do not exist.

      (1) Pnevmatikakis, Eftychios A., et al. "Simultaneous denoising, deconvolution, and demixing of calcium imaging data." Neuron 89.2 (2016): 285-299.

      (2) Schneider, Steffen, Jin Hwa Lee, and Mackenzie Weygandt Mathis. "Learnable latent embeddings for joint behavioural and neural analysis." Nature 617.7960 (2023): 360-368.

      (3) Zhou, Ding, and Xue-Xin Wei. "Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-VAE." Advances in Neural Information Processing Systems 33 (2020): 7234-7247.

      Based on these four points, we are confident in the reliability of our results. If Reviewer #4 considers these points insufficient, we would highly appreciate it if specific concerns regarding any of these aspects could be detailed.

      Thank you for your valuable feedback.

      Q5: “Given the nuances involved in appropriate comparisons across methods and since two of the datasets are public, the authors should provide their complete code (not just the dVAE method code), including the code for data loading, data preprocessing, model fitting and model evaluation for all methods and public datasets. This will alleviate concerns and allow readers to confirm conclusions (e.g., figure 2) for themselves down the line.”

      Thanks for your suggestion.

      Our codes are now available on GitHub at https://github.com/eric0li/d-VAE. Thank you for your valuable feedback.

      Q6: “Related to 1) above, the authors should explore the results if the affine network h(.) (from embedding to behavior) was replaced with a nonlinear ANN. Perhaps linear decoders would no longer be as close to nonlinear decoders. Regardless, the claim of linearity should be revised as described in 1) and 2) above, and all caveats should be discussed.”

      Thank you for your suggestion. We appreciate your feasible proposal that can be empirically tested. Following your suggestion, we have replaced the decoding of the latent variable z to behavior y with a nonlinear neural network, specifically a neural network with a single hidden layer. The modified model is termed d-VAE2. We applied the d-VAE2 to the real data, and selected the optimal alpha through the validation set. As shown in Table 1, results demonstrate that the performance of KF and ANN remains comparable. Therefore, the capacity to linearly decode behaviorally relevant signals does not stem from the linear decoding of embeddings.

      Author response table 2.

      Decoding R2 of behaviorally relevant signals obtained by d-VAE2

      Additionally, it is worth noting that this approach is uncommon and is considered somewhat inappropriate according to the Information Bottleneck theory [1]. According to the Information Bottleneck theory, information is progressively compressed in multilayer neural networks, discarding what is irrelevant to the output and retaining what is relevant. This means that as the number of layers increases, the mutual information between each layer's embedding and the model input gradually decreases, while the mutual information between each layer's embedding and the model output gradually increases. For the decoding part, if the embeddings that is not closest to the output (behaviors) is used, then these embeddings might contain behaviorally irrelevant signals. Using these embeddings to generate behaviorally relevant signals could lead to the inclusion of irrelevant signals in the behaviorally relevant signals.

      To demonstrate the above statement, we conducted experiments on the synthetic data. As shown in Table 2, we present the performance (neural R2 between the generated signals and the ground truth signals) of both models at several alpha values around the optimal alpha of dVAE (alpha=0.9) selected by the validation set. The experimental results show that at the same alpha value, the performance of d-VAE2 is consistently inferior to that of d-VAE, and d-VAE2 requires a higher alpha value to achieve performance comparable to d-VAE, and the best performance of d-VAE2 is inferior to that of d-VAE.

      Author response table 3.

      Neural R2 between generated signals and real behaviorally relevant signals

      Thank you for your valuable feedback.

      (1) Shwartz-Ziv, Ravid, and Naftali Tishby. "Opening the black box of deep neural networks via information." arXiv preprint arXiv:1703.00810 (2017).

      Q7: “The beginning of the section on the "smaller R2 neurons" should clearly define what R2 is being discussed. Based on the response to previous reviewers, this R2 "signifies the proportion of neuronal activity variance explained by the linear encoding model, calculated using raw signals". This should be mentioned and made clear in the main text whenever this R2 is referred to.”

      Thank you for your suggestion. We have made the modifications in the main text. Thank you for your valuable feedback.

      Q8: “Various terms require clear definitions. The authors sometimes use vague terminology (e.g., "useless") without a clear definition. Similarly, discussions regarding dimensionality could benefit from more precise definitions. How is neural dimensionality defined? For example, how is "neural dimensionality of specific behaviors" (line 590) defined? Related to this, I agree with Reviewer 2 that a clear definition of irrelevant should be mentioned that clarifies that relevance is roughly taken as "correlated or predictive with a fixed time lag". The analyses do not explore relevance with arbitrary time lags between neural and behavior data.”

      Thanks for your suggestion. We have removed the “useless” statements and have revised the statement of “the neural dimensionality of specific behaviors” in our revised manuscripts.

      Regarding the use of fixed temporal lags, we followed the same practice as papers related to the dataset we use, which assume fixed temporal lags [1-3]. Furthermore, many studies in the motor cortex similarly use fixed temporal lags [4-6]. To clarify the definition, we have revised the definition in our manuscript. For details, please refer to the response to Q2 of reviewer #2 and our revised manuscript. We believe our definition is clearly articulated.

      Thank you for your valuable feedback.

      (1) Wang, Fang, et al. "Quantized attention-gated kernel reinforcement learning for brain– machine interface decoding." IEEE transactions on neural networks and learning systems 28.4 (2015): 873-886.

      (2) Dyer, Eva L., et al. "A cryptography-based approach for movement decoding." Nature biomedical engineering 1.12 (2017): 967-976.

      (3) Ahmadi, Nur, Timothy G. Constandinou, and Christos-Savvas Bouganis. "Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning." Journal of Neural Engineering 18.2 (2021): 026011.

      (4) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (5) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (6) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239. 

      Q9: “CEBRA itself doesn't provide a neural reconstruction from its embeddings, but one could obtain one via a regression from extracted CEBRA embeddings to neural data. In addition to decoding results of CEBRA (figure S3), the neural reconstruction of CEBRA should be computed and CEBRA should be added to Figure 2 to see how the behaviorally relevant and irrelevant signals from CEBRA compare to other methods.”

      Thank you for your question. Modifying CEBRA is beyond the scope of our work. As CEBRA is not a generative model, it cannot obtain behaviorally relevant and irrelevant signals, and therefore it lacks the results presented in Fig. 2. To avoid the same confusion encountered by reviewers #3 and #4 among our readers, we have opted to exclude the comparison with CEBRA. It is crucial to note, as previously stated, that our assessment of decoding capabilities has been benchmarked against the performance of the ANN on raw signals, which almost represents the upper limit of performance. Consequently, omitting CEBRA does not affect our conclusions.

      Thank you for your valuable feedback.

      Q10: “Line 923: "The optimal hyperparameter is selected based on the lowest averaged loss of five-fold training data." => why is this explained specifically under CEBRA? Isn't the same criteria used for hyperparameters of other methods? If so, clarify.”

      Thank you for your question. The hyperparameter selection for CEBRA follows the practice of the original CEBRA paper. The hyperparameter selection for generative models is detailed in the Section “The strategy for selecting effective behaviorally-relevant signals”.  Thank you for your valuable feedback.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      In this paper, the authors evaluate the utility of brain age derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain age derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ('brain cognition') as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.  

      Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. 

      REVISED VERSION: while the authors have partially addressed my concerns, I do not feel they have addressed them all. I do not feel they have addressed the weight instability and concerns about the stacked regression models satisfactorily.

      Please see our responses to Reviewer #1 Public Review #3 below

      I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. This suffers from the same problem the authors raise with brain age and would indeed disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain cognition. I have indicated the main considerations about these points in the recommendations section below. 

      Thank you so much for raising this point. We now have the following statement in the introduction and discussion to address this concern (see below). 

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study. 

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition,  we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.” 

      This is a reasonably good paper and the use of a commonality analysis is a nice contribution to understanding variance partitioning across different covariates. I have some comments that I believe the authors ought to address, which mostly relate to clarity and interpretation 

      Reviewer #1 Public Review #1

      First, from a conceptual point of view, the authors focus exclusively on cognition as a downstream outcome. I would suggest the authors nuance their discussion to provide broader considerations of the utility of their method and on the limits of interpretation of brain age models more generally. 

      Thank you for your comments on this issue. 

      We now discussed the broader consideration in detail:

      (1) the consistency between our findings on fluid cognition and other recent works on brain disorders, 

      (2) the difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021)

      and 

      (3) suggested solutions we and others made to optimise the utility of Brain Age for both cognitive functioning and brain disorders.

      From Discussion:

      “This discrepancy between the predictive performance of age-prediction models and the utility of Brain Age indices as a biomarker is consistent with recent findings (for review, see Jirsaraie, Gorelik, et al., 2023), both in the context of cognitive functioning (Jirsaraie, Kaufmann, et al., 2023) and neurological/psychological disorders (Bashyam et al., 2020; Rokicki et al., 2021). For instance,  combining different MRI modalities into the prediction models, similar to our stacked models, ocen leads to the highest performance of age prediction models, but does not likely explain the highest variance across different phenotypes, including cognitive functioning and beyond (Jirsaraie, Gorelik, et al., 2023).”

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the lader as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore underfided models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age prediction models from MRI data of largely healthy participants and apply the built age prediction models to participants who are also largely healthy. Accordingly, the age prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fided. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      “Next, researchers should not select age-prediction models based solely on age-prediction performance. Instead, researchers could select age-prediction models that explained phenotypes of interest the best. Here we selected age-prediction models based on a set of features (i.e., modalities) of brain MRI. This strategy was found effective not only for fluid cognition as we demonstrated here, but also for neurological and psychological disorders as shown elsewhere (Jirsaraie, Gorelik, et al., 2023; Rokicki et al., 2021). Rokicki and colleagues (2021), for instance, found that, while integrating across MRI modalities led to age prediction models with the highest age-prediction performance, using only T1 structural MRI gave age-prediction models that were better at classifying Alzheimer’s disease. Similarly, using only cerebral blood flow gave age-prediction models that were better at classifying mild/subjective cognitive impairment, schizophrenia and bipolar disorder. 

      As opposed to selecting age-prediction models based on a set of features, researchers could also select age-prediction models based on modelling methods. For instance, Jirsaraie and colleagues (2023) compared gradient tree boosting (GTB) and deep-learning brain network (DBN) algorithms in building age-prediction models. They found GTB to have higher age prediction performance but DBN to have better utility in explaining cognitive functioning. In this case, an algorithm with better utility (e.g., DBN) should be used for explaining a phenotype of interest. Similarly, Bashyam and colleagues (2020) built different DBN-based age-prediction models, varying in age-prediction performance. The DBN models with a higher number of epochs corresponded to higher age-prediction performance. However, DBN-based age-prediction models with a moderate (as opposed to higher or lower) number of epochs were better at classifying Alzheimer’s disease, mild cognitive impairment and schizophrenia. In this case, a model from the same algorithm with better utility (e.g., those DBN with a moderate epoch number) should be used for explaining a phenotype of interest.

      Accordingly, this calls for a change in research practice, as recently pointed out by Jirasarie and colleagues (2023, p7), “Despite mounting evidence, there is a persisting assumption across several studies that the most accurate brain age models will have the most potential for detecting differences in a given phenotype of interest”. Future neuroimaging research should aim to build age-prediction models that are not necessarily good at predicting age, but at capturing phenotypes of interest.”

      Reviewer #1 Public Review #2

      Second, from a methods perspective, there is not a sufficient explanation of the methodological procedures in the current manuscript to fully understand how the stacked regression models were constructed. I would request that the authors provide more information to enable the reader to beUer understand the stacked regression models used to ensure that these models are not overfit. 

      Thank you for allowing us an opportunity to clarify our stacked model. We made additional clarification to make this clearer (see below). We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models.  

      From Methods:

      “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features),  “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. Acer looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values. 

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 \= 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Author response image 1.

      Diagram of the nested cross-validation used for creating predictions for models of each set of features as well as predictions for stacked models. 

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      Reviewer #1 Public Review #3

      Please also provide an indication of the different regression strengths that were estimated across the different models and cross-validation splits. Also, how stable were the weights across splits? 

      The focus of this article is on the predictions. Still, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features.  We found Spearman’s ρ to be varied dramatically in both age-prediction (range\=.31-.94) and fluid cognition-prediction (range\=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.   

      Author response image 2.

      Stability of feature importance (i.e., Elastic Net Coefficients) of prediction models. Each dot represents rank stability (reflected by Spearman’s ρ) in the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, there were 10 Spearman’s ρs for each prediction model.  The numbers to the right of the plots indicate the mean of Spearman’s ρ for each prediction model.  

      Reviewer #1 Public Review #4

      Please provide more details about the task designs, MRI processing procedures that were employed on this sample in addition to the regression methods and bias correction methods used. For example, there are several different parameterisations of the elastic net, please provide equations to describe the method used here so that readers can easily determine how the regularisation parameters should be interpreted.  

      Thank you for the opportunity for us to provide more methodical details.

      First, for the task design, we included the following statements:

      From Methods:

      “HCP-A collected fMRI data from three tasks: Face Name (Sperling et al., 2001), Conditioned Approach Response Inhibition Task (CARIT) (Somerville et al., 2018) and VISual MOTOR (VISMOTOR) (Ances et al., 2009). 

      First, the Face Name task (Sperling et al., 2001) taps into episodic memory. The task had three blocks. In the encoding block [Encoding], participants were asked to memorise the names of faces shown. These faces were then shown again in the recall block [Recall] when the participants were asked if they could remember the names of the previously shown faces. There was also the distractor block [Distractor] occurring between the encoding and recall blocks. Here participants were distracted by a Go/NoGo task. We computed six contrasts for this Face Name task: [Encode], [Recall], [Distractor], [Encode vs. Distractor], [Recall vs. Distractor] and [Encode vs. Recall].

      Second, the CARIT task (Somerville et al., 2018) was adapted from the classic Go/NoGo task and taps into inhibitory control. Participants were asked to press a budon to all [Go] but not to two [NoGo] shapes. We computed three contrasts for the CARIT task: [NoGo], [Go] and [NoGo vs. Go]. 

      Third, the VISMOTOR task (Ances et al., 2009) was designed to test simple activation of the motor and visual cortices. Participants saw a checkerboard with a red square either on the lec or right. They needed to press a corresponding key to indicate the location of the red square. We computed just one contrast for the VISMOTOR task: [Vismotor], which indicates the presence of the checkerboard vs. baseline.” 

      Second, for MRI processing procedures, we included the following statements.

      From Methods:

      “HCP-A provides details of parameters for brain MRI elsewhere (Bookheimer et al., 2019; Harms et al., 2018). Here we used MRI data that were pre-processed by the HCP-A with recommended methods, including the MSMALL alignment (Glasser et al., 2016; Robinson et al., 2018) and ICA-FIX (Glasser et al., 2016) for functional MRI. We used multiple brain MRI modalities, covering task functional MRI (task fMRI), resting-state functional MRI (rsfMRI) and structural MRI (sMRI), and organised them into 19 sets of features.”

      “Sets of Features 1-10: Task fMRI contrast (Task Contrast)

      Task contrasts reflect fMRI activation relevant to events in each task. Bookheimer and colleagues (2019) provided detailed information about the fMRI in HCP-A. Here we focused on the pre-processed task fMRI Connectivity Informatics Technology Initiative (CIFTI) files with a suffix, “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” These CIFTI files encompassed both the cortical mesh surface and subcortical volume (Glasser et al., 2013). Collected using the posterior-to-anterior (PA) phase, these files were aligned using MSMALL (Glasser et al., 2016; Robinson et al., 2018), linear detrended (see hdps://groups.google.com/a/humanconnectome.org/g/hcp-users/c/ZLJc092h980/m/GiihzQAUAwAJ) and cleaned from potential artifacts using ICA-FIX (Glasser et al., 2016). 

      To extract Task Contrasts, we regressed the fMRI time series on the convolved task events using a double-gamma canonical hemodynamic response function via FMRIB Software Library (FSL)’s FMRI Expert Analysis Tool (FEAT) (Woolrich et al., 2001). We kept FSL’s default high pass cutoff at 200s (i.e., .005 Hz). We then parcellated the contrast ‘cope’ files, using the Glasser atlas (Gordon et al., 2016) for cortical surface regions and the Freesurfer’s automatic segmentation (aseg) (Fischl et al., 2002) for subcortical regions. This resulted in 379 regions, whose number was, in turn, the number of features for each Task Contrast set of features. “ 

      “Sets of Features 11-13: Task fMRI functional connectivity (Task FC)

      Task FC reflects functional connectivity (FC ) among the brain regions during each task, which is considered an important source of individual differences (Elliod et al., 2019; Fair et al., 2007; Gradon et al., 2018). We used the same CIFTI file “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” as the task contrasts. Unlike Task Contrasts, here we treated the double-gamma, convolved task events as regressors of no interest and focused on the residuals of the regression from each task (Fair et al., 2007). We computed these regressors on FSL, and regressed them in nilearn (Abraham et al., 2014). Following previous work on task FC (Elliod et al., 2019), we applied a highpass at .008 Hz. For parcellation, we used the same atlases as Task Contrast (Fischl et al., 2002; Glasser et al., 2016). We computed Pearson’s correlations of each pair of 379 regions, resulting in a table of 71,631 non-overlapping FC indices for each task. We then applied r-to-z transformation and principal component analysis (PCA) of 75 components (Rasero et al., 2021; Sripada et al., 2019, 2020). Note to avoid data leakage, we conducted the PCA on each training set and applied its definition to the corresponding test set. Accordingly, there were three sets of 75 features for Task FC, one for each task. 

      Set of Features 14: Resting-state functional MRI functional connectivity (Rest FC) Similar to Task FC, Rest FC reflects functional connectivity (FC ) among the brain regions, except that Rest FC occurred during the resting (as opposed to task-performing) period. HCPA collected Rest FC from four 6.42-min (488 frames) runs across two days, leading to 26-min long data (Harms et al., 2018). On each day, the study scanned two runs of Rest FC, starting with anterior-to-posterior (AP) and then with posterior-to-anterior (PA) phase encoding polarity. We used the “rfMRI_REST_Atlas_MSMAll_hp0_clean.dscalar.nii” file that was preprocessed and concatenated across the four runs.  We applied the same computations (i.e., highpass filter, parcellation, Pearson’s correlations, r-to-z transformation and PCA) with the Task FC. 

      Sets of Features 15-18: Structural MRI (sMRI)

      sMRI reflects individual differences in brain anatomy. The HCP-A used an established preprocessing pipeline for sMRI (Glasser et al., 2013). We focused on four sets of features: cortical thickness, cortical surface area, subcortical volume and total brain volume. For cortical thickness and cortical surface area, we used Destrieux’s atlas (Destrieux et al., 2010; Fischl, 2012) from FreeSurfer’s “aparc.stats” file, resulting in 148 regions for each set of features. For subcortical volume, we used the aseg atlas (Fischl et al., 2002) from FreeSurfer’s “aseg.stats” file, resulting in 19 regions. For total brain volume, we had five FreeSurfer-based features: “FS_IntraCranial_Vol” or estimated intra-cranial volume, “FS_TotCort_GM_Vol” or total cortical grey mader volume, “FS_Tot_WM_Vol” or total cortical white mader volume, “FS_SubCort_GM_Vol” or total subcortical grey mader volume and “FS_BrainSegVol_eTIV_Ratio” or ratio of brain segmentation volume to estimated total intracranial volume.”

      Third, for regression methods and bias correction methods used, we included the following statements:

      From Methods:

      “For the machine learning algorithm, we used Elastic Net (Zou & Hastie, 2005). Elastic Net is a general form of penalised regressions (including Lasso and Ridge regression), allowing us to simultaneously draw information across different brain indices to predict one target variable. Penalised regressions are commonly used for building age-prediction models (Jirsaraie, Gorelik, et al., 2023). Previously we showed that the performance of Elastic Net in predicting cognitive abilities is on par, if not better than, many non-linear and morecomplicated algorithms (Pat, Wang, Bartonicek, et al., 2022; Tetereva et al., 2022). Moreover, Elastic Net coefficients are readily explainable, allowing us the ability to explain how our age-prediction and cognition-prediction models made the prediction from each brain feature (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022) (see below). 

      Elastic Net simultaneously minimises the weighted sum of the features’ coefficients. The degree of penalty to the sum of the feature’s coefficients is determined by a shrinkage hyperparameter ‘a’: the greater the a, the more the coefficients shrink, and the more regularised the model becomes. Elastic Net also includes another hyperparameter, ‘ℓ! ratio’, which determines the degree to which the sum of either the squared (known as ‘Ridge’; ℓ! ratio=0) or absolute (known as ‘Lasso’; ℓ! ratio=1) coefficients is penalised (Zou & Hastie, 2005). The objective function of Elastic Net as implemented by sklearn (Pedregosa et al., 2011) is defined as:

      where X is the features, y is the target, and b is the coefficient. In our grid search, we tuned two Elastic Net hyperparameters: a using 70 numbers in log space, ranging from .1 and 100, and ℓ!-ratio using 25 numbers in linear space, ranging from 0 and 1.

      To understand how Elastic Net made a prediction based on different brain features, we examined the coefficients of the tuned model. Elastic Net coefficients can be considered as feature importance, such that more positive Elastic Net coefficients lead to more positive predicted values and, similarly, more negative Elastic Net coefficients lead to more negative predicted values (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022). While the magnitude of Elastic Net coefficients is regularised (thus making it difficult for us to interpret the magnitude itself directly), we could still indicate that a brain feature with a higher magnitude weights relatively stronger in making a prediction. Another benefit of Elastic Net as a penalised regression is that the coefficients are less susceptible to collinearity among features as they have already been regularised (Dormann et al., 2013; Pat, Wang, Bartonicek, et al., 2022).

      Given that we used five-fold nested cross validation, different outer folds may have different degrees of ‘a’ and ‘ℓ! ratio’, making the final coefficients from different folds to be different. For instance, for certain sets of features, penalisation may not play a big part (i.e., higher or lower ‘a’ leads to similar predictive performance), resulting in different ‘a’ for different folds. To remedy this in the visualisation of Elastic Net feature importance, we refitted the Elastic Net model to the full dataset without spli{ng them into five folds and visualised the coefficients on brain images using Brainspace (Vos De Wael et al., 2020) and Nilern (Abraham et al., 2014) packages. Note, unlike other sets of features, Task FC and Rest FC were modelled acer data reduction via PCA. Thus, for Task FC and Rest FC, we, first, multiplied the absolute PCA scores (extracted from the ‘components_’ attribute of ‘sklearn.decomposition.PCA’) with Elastic Net coefficients and, then, summed the multiplied values across the 75 components, leaving 71,631 ROI-pair indices.

      References

      Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., & Varoquaux, G. (2014). Machine learning for neuroimaging with scikitlearn. Frontiers in Neuroinformatics, 8, 14. hdps://doi.org/10.3389/fninf.2014.00014

      Ances, B. M., Liang, C. L., Leontiev, O., Perthen, J. E., Fleisher, A. S., Lansing, A. E., & Buxton, R. B. (2009). Effects of aging on cerebral blood flow, oxygen metabolism, and blood oxygenation level dependent responses to visual stimulation. Human Brain Mapping, 30(4), 1120–1132. hdps://doi.org/10.1002/hbm.20574

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Saderthwaite, T. D., … on behalf of the ISTAGING Consortium,  the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. hdps://doi.org/10.1093/brain/awaa160

      Bookheimer, S. Y., Salat, D. H., Terpstra, M., Ances, B. M., Barch, D. M., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Diaz-Santos, M., Elam, J. S., Fischl, B., Greve, D. N., Hagy, H. A., Harms, M. P., Hatch, O. M., Hedden, T., Hodge, C., Japardi, K. C., Kuhn, T. P., … Yacoub, E. (2019). The Lifespan Human Connectome Project in Aging: An overview. NeuroImage, 185, 335–348. hdps://doi.org/10.1016/j.neuroimage.2018.10.009

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Saderthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pi alls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. hdps://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. hdps://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Destrieux, C., Fischl, B., Dale, A., & Halgren, E. (2010). Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1), 1–15. hdps://doi.org/10.1016/j.neuroimage.2010.06.010

      Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. hdps://doi.org/10.1111/j.16000587.2012.07348.x

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. hdps://doi.org/10.1098/rstb.2017.0284

      Elliod, M. L., Knodt, A. R., Cooke, M., Kim, M. J., Melzer, T. R., Keenan, R., Ireland, D., Ramrakha, S., Poulton, R., Caspi, A., Moffid, T. E., & Hariri, A. R. (2019). General functional connectivity: Shared features of resting-state and task fMRI drive reliable and heritable individual differences in functional brain networks. NeuroImage, 189, 516–532. hdps://doi.org/10.1016/j.neuroimage.2019.01.068

      Fair, D. A., Schlaggar, B. L., Cohen, A. L., Miezin, F. M., Dosenbach, N. U. F., Wenger, K. K., Fox, M. D., Snyder, A. Z., Raichle, M. E., & Petersen, S. E. (2007). A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. NeuroImage, 35(1), 396–405. hdps://doi.org/10.1016/j.neuroimage.2006.11.051

      Fischl, B. (2012). FreeSurfer. NeuroImage, 62(2), 774–781. hdps://doi.org/10.1016/j.neuroimage.2012.01.021

      Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., & Dale, A. M. (2002). Whole Brain Segmentation. Neuron, 33(3), 341–355. hdps://doi.org/10.1016/S0896-6273(02)00569-X

      Glasser, M. F., Smith, S. M., Marcus, D. S., Andersson, J. L. R., Auerbach, E. J., Behrens, T. E. J., Coalson, T. S., Harms, M. P., Jenkinson, M., Moeller, S., Robinson, E. C., Sotiropoulos, S. N., Xu, J., Yacoub, E., Ugurbil, K., & Van Essen, D. C. (2016). The Human Connectome Project’s neuroimaging approach. Nature Neuroscience, 19(9), 1175– 1187. hdps://doi.org/10.1038/nn.4361

      Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B., Andersson, J. L., Xu, J., Jbabdi, S., Webster, M., Polimeni, J. R., Van Essen, D. C., & Jenkinson, M. (2013). The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage, 80, 105–124. hdps://doi.org/10.1016/j.neuroimage.2013.04.127

      Gordon, E. M., Laumann, T. O., Adeyemo, B., Huckins, J. F., Kelley, W. M., & Petersen, S. E. (2016). Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cerebral Cortex, 26(1), 288–303. hdps://doi.org/10.1093/cercor/bhu239

      Gradon, C., Laumann, T. O., Nielsen, A. N., Greene, D. J., Gordon, E. M., Gilmore, A. W., Nelson, S. M., Coalson, R. S., Snyder, A. Z., Schlaggar, B. L., Dosenbach, N. U. F., & Petersen, S. E. (2018). Functional Brain Networks Are Dominated by Stable Group and Individual Factors, Not Cognitive or Daily Variation. Neuron, 98(2), 439-452.e5. hdps://doi.org/10.1016/j.neuron.2018.03.035

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fi{ng’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. hdps://doi.org/10.1093/brain/awaa454

      Harms, M. P., Somerville, L. H., Ances, B. M., Andersson, J., Barch, D. M., Bastiani, M., Bookheimer, S. Y., Brown, T. B., Buckner, R. L., Burgess, G. C., Coalson, T. S., Chappell, M. A., Dapredo, M., Douaud, G., Fischl, B., Glasser, M. F., Greve, D. N., Hodge, C., Jamison, K. W., … Yacoub, E. (2018). Extending the Human Connectome Project across ages: Imaging protocols for the Lifespan Development and Aging projects. NeuroImage, 183, 972–984. hdps://doi.org/10.1016/j.neuroimage.2018.09.060

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. hdps://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Gorelik, A. J., Gatavins, M. M., Engemann, D. A., Bogdan, R., Barch, D. M., & Sotiras, A. (2023). A systematic review of multimodal brain age studies: Uncovering a divergence between model accuracy and utility. PaUerns, 4(4), 100712. hdps://doi.org/10.1016/j.pader.2023.100712

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. hdps://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. hdps://doi.org/10.1016/j.biopsych.2015.12.023

      Molnar, C. (2019). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. hdps://christophm.github.io/interpretable-ml-book/

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. hdps://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain-based predictive models mediate the relationships between childhood cognition and socio-demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. hdps://doi.org/10.1002/hbm.26027

      Pat, N., Wang, Y., Bartonicek, A., Candia, J., & Stringaris, A. (2022). Explainable machine learning approach to predict and explain the relationship between task-based fMRI and individual differences in cognition. Cerebral Cortex, bhac235. hdps://doi.org/10.1093/cercor/bhac235

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Predenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. hdps://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. hdps://doi.org/10.1371/journal.pcbi.1008347

      Robinson, E. C., Garcia, K., Glasser, M. F., Chen, Z., Coalson, T. S., Makropoulos, A., Bozek, J., Wright, R., Schuh, A., Webster, M., Huder, J., Price, A., Cordero Grande, L., Hughes, E., Tusor, N., Bayly, P. V., Van Essen, D. C., Smith, S. M., Edwards, A. D., … Rueckert, D. (2018). Multimodal surface matching with higher-order smoothness constraints. NeuroImage, 167, 453–465. hdps://doi.org/10.1016/j.neuroimage.2017.10.037

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. hdps://doi.org/10.1002/hbm.25323

      Somerville, L. H., Bookheimer, S. Y., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Dapredo, M., Elam, J. S., Gaffrey, M. S., Harms, M. P., Hodge, C., Kandala, S., Kastman, E. K., Nichols, T. E., Schlaggar, B. L., Smith, S. M., Thomas, K. M., Yacoub, E., Van Essen, D. C., & Barch, D. M. (2018). The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5–21 year olds. NeuroImage, 183, 456–468. hdps://doi.org/10.1016/j.neuroimage.2018.08.050

      Sperling, R. A., Bates, J. F., Cocchiarella, A. J., Schacter, D. L., Rosen, B. R., & Albert, M. S. (2001). Encoding novel face-name associations: A functional MRI study. Human Brain Mapping, 14(3), 129–139. hdps://doi.org/10.1002/hbm.1047

      Sripada, C., Angstadt, M., Rutherford, S., Kessler, D., Kim, Y., Yee, M., & Levina, E. (2019). Basic Units of Inter-Individual Variation in Resting State Connectomes. Scientific Reports, 9(1), Article 1. hdps://doi.org/10.1038/s41598-018-38406-5

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. hdps://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain-cognition relationship: Integrating task-based fMRI across tasks markedly boosts prediction and test-retest reliability. NeuroImage, 263, 119588. hdps://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. hdps://doi.org/10.1016/j.intell.2022.101654

      Vos De Wael, R., Benkarim, O., Paquola, C., Lariviere, S., Royer, J., Tavakol, S., Xu, T., Hong, S.J., Langs, G., Valk, S., Misic, B., Milham, M., Margulies, D., Smallwood, J., & Bernhardt, B. C. (2020). BrainSpace: A toolbox for the analysis of macroscale gradients in neuroimaging and connectomics datasets. Communications Biology, 3(1), 103. hdps://doi.org/10.1038/s42003-020-0794-7

      Woolrich, M. W., Ripley, B. D., Brady, M., & Smith, S. M. (2001). Temporal Autocorrelation in Univariate Linear Modeling of FMRI Data. NeuroImage, 14(6), 1370–1386. hdps://doi.org/10.1006/nimg.2001.0931

      Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. hdps://doi.org/10.1111/j.1467-9868.2005.00503.x

    1. Author response:

      The following is the authors’ response to the original reviews.

      Responses to Reviewer’s Comments:  

      To Reviewer #2:

      (1) The use of two m<sup>5</sup>C reader proteins is likely a reason for the high number of edits introduced by the DRAM-Seq method. Both ALYREF and YBX1 are ubiquitous proteins with multiple roles in RNA metabolism including splicing and mRNA export. It is reasonable to assume that both ALYREF and YBX1 bind to many mRNAs that do not contain m<sup>5</sup>C. 

      To substantiate the author's claim that ALYREF or YBX1 binds m<sup>5</sup>C-modified RNAs to an extent that would allow distinguishing its binding to non-modified RNAs from binding to m<sup>5</sup>Cmodified RNAs, it would be recommended to provide data on the affinity of these, supposedly proven, m<sup>5</sup>C readers to non-modified versus m<sup>5</sup>C-modified RNAs. To do so, this reviewer suggests performing experiments as described in Slama et al., 2020 (doi: 10.1016/j.ymeth.2018.10.020). However, using dot blots like in so many published studies to show modification of a specific antibody or protein binding, is insufficient as an argument because no antibody, nor protein, encounters nanograms to micrograms of a specific RNA identity in a cell. This issue remains a major caveat in all studies using so-called RNA modification reader proteins as bait for detecting RNA modifications in epitranscriptomics research. It becomes a pertinent problem if used as a platform for base editing similar to the work presented in this manuscript.

      The authors have tried to address the point made by this reviewer. However, rather than performing an experiment with recombinant ALYREF-fusions and m<sup>5</sup>C-modified to unmodified RNA oligos for testing the enrichment factor of ALYREF in vitro, the authors resorted to citing two manuscripts. One manuscript is cited by everybody when it comes to ALYREF as m<sup>5</sup>C reader, however none of the experiments have been repeated by another laboratory. The other manuscript is reporting on YBX1 binding to m<sup>5</sup>C-containing RNA and mentions PARCLiP experiments with ALYREF, the details of which are nowhere to be found in doi: 10.1038/s41556-019-0361-y.

      Furthermore, the authors have added RNA pull-down assays that should substitute for the requested experiments. Interestingly, Figure S1E shows that ALYREF binds equally well to unmodified and m<sup>5</sup>C-modified RNA oligos, which contradicts doi:10.1038/cr.2017.55, and supports the conclusion that wild-type ALYREF is not specific m<sup>5</sup>C binder. The necessity of including always an overexpression of ALYREF-mut in parallel DRAM experiments, makes the developed method better controlled but not easy to handle (expression differences of the plasmid-driven proteins etc.) 

      Thank you for pointing this out. First, we would like to correct our previous response: the binding ability of ALYREF to m<sup>5</sup>C-modified RNA was initially reported in doi: 10.1038/cr.2017.55, (and not in doi: 10.1038/s41556-019-0361-y), where it was observed through PAR-CLIP analysis that the K171 mutation weakens its binding affinity to m<sup>5</sup>C -modified RNA.

      Our previous experimental approach was not optimal: the protein concentration in the INPUT group was too high, leading to overexposure in the experimental group. Additionally, we did not conduct a quantitative analysis of the results at that time. In response to your suggestion, we performed RNA pull-down experiments with YBX1 and ALYREF, rather than with the pan-DRAM protein, to better validate and reproduce the previously reported findings. Our quantitative analysis revealed that both ALYREF and YBX1 exhibit a stronger affinity for m<sup>5</sup>C -modified RNAs. Furthermore, mutating the key amino acids involved in m<sup>5</sup>C recognition significantly reduced the binding affinity of both readers. These results align with previous studies (doi: 10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y), confirming that ALYREF and YBX1 are specific readers of m<sup>5</sup>C -modified RNAs. However, our detection system has certain limitations. Despite mutating the critical amino acids, both readers retained a weak binding affinity for m<sup>5</sup>C, suggesting that while the mutation helps reduce false positives, it is still challenging to precisely map the distribution of m<sup>5</sup>C modifications. To address this, we plan to further investigate the protein structure and function to obtain a more accurate m<sup>5</sup>C sequencing of the transcriptome in future studies. Accordingly, we have updated our results and conclusions in lines 294-299 and discuss these limitations in lines 109114.

      In addition, while the m<sup>5</sup>C assay can be performed using only the DRAM system alone, comparing it with the DRAM<sup>mut</sup> control enhances the accuracy of m<sup>5</sup>C region detection. To minimize the variations in transfection efficiency across experimental groups, it is recommended to use the same batch of transfections. This approach not only ensures more consistent results but also improve the standardization of the DRAM assay, as discussed in the section added on line 308-312.

      (2) Using sodium arsenite treatment of cells as a means to change the m<sup>5</sup>C status of transcripts through the downregulation of the two major m<sup>5</sup>C writer proteins NSUN2 and NSUN6 is problematic and the conclusions from these experiments are not warranted. Sodium arsenite is a chemical that poisons every protein containing thiol groups. Not only do NSUN proteins contain cysteines but also the base editor fusion proteins. Arsenite will inactivate these proteins, hence the editing frequency will drop, as observed in the experiments shown in Figure 5, which the authors explain with fewer m<sup>5</sup>C sites to be detected by the fusion proteins.

      The authors have not addressed the point made by this reviewer. Instead the authors state that they have not addressed that possibility. They claim that they have revised the results section, but this reviewer can only see the point raised in the conclusions. An experiment would have been to purify base editors via the HA tag and then perform some kind of binding/editing assay in vitro before and after arsenite treatment of cells.

      We appreciate the reviewer’s insightful comment. We fully agree with the concern raised. In the original manuscript, our intention was to use sodium arsenite treatment to downregulate NSUN mediated m<sup>5</sup>C levels and subsequently decrease DRAM editing efficiency, with the aim of monitoring m<sup>5</sup>C dynamics through the DRAM system. However, as the reviewer pointed out, sodium arsenite may inactivate both NSUN proteins and the base editor fusion proteins, and any such inactivation would likely result in a reduced DRAM editing.

      This confounds the interpretation of our experimental data.

      As demonstrated in Author response image 1A, western blot analysis confirmed that sodium arsenite indeed decreased the expression of fusion proteins. In addition, we attempted in vitro fusion protein purificationusing multiple fusion tags (HIS, GST, HA, MBP) for DRAM fusion protein expression, but unfortunately, we were unable to obtain purified proteins. However, using the Promega TNT T7 Rapid Coupled In Vitro Transcription/Translation Kit, we successfully purified the DRAM protein (Author response image 1B). Despite this success, subsequent in vitro deamination experiments did not yield the expected mutation results (Author response image 1C), indicating that further optimization is required. This issue is further discussed in line 314-315.

      Taken together, the above evidence supports that the experiment of sodium arsenite treatment was confusing and we determined to remove the corresponding results from the main text of the revised manuscript.

      Author response image 1.

      (3) The authors should move high-confidence editing site data contained in Supplementary Tables 2 and 3 into one of the main Figures to substantiate what is discussed in Figure 4A. However, the data needs to be visualized in another way then excel format. Furthermore, Supplementary Table 2 does not contain a description of the columns, while Supplementary Table 3 contains a single row with letters and numbers.

      The authors have not addressed the point made by this reviewer. Figure 3F shows the screening process for DRAM-seq assays and principles for screening highconfidence genes rather than the data contained in Supplementary Tables 2 and 3 of the former version of this manuscript.

      Thank you for your valuable suggestion. We have visualized the data from Supplementary Tables 2 and 3 in Figure 4A as a circlize diagram (described in lines 213-216), illustrating the distribution of mutation sites detected by the DRAM system across each chromosome. Additionally, to improve the presentation and clarity of the data, we have revised Supplementary Tables 2 and 3 by adding column descriptions, merging the DRAM-ABE and DRAM-CBE sites, and including overlapping m<sup>5</sup>C genes from previous datasets.

      Responses to Reviewer’s Comments:  

      To Reviewer #3:

      The authors have again tried to address the former concern by this reviewer who questioned the specificity of both m<sup>5</sup>C reader proteins towards modified RNA rather than unmodified RNA. The authors chose to do RNA pull down experiments which serve as a proxy for proving the specificity of ALYREF and YBX1 for m<sup>5</sup>C modified RNAs. Even though this reviewer asked for determining the enrichment factor of the reader-base editor fusion proteins (as wildtype or mutant for the identified m<sup>5</sup>C specificity motif) when presented with m<sup>5</sup>C-modified RNAs, the authors chose to use both reader proteins alone (without the fusion to an editor) as wildtype and as respective m<sup>5</sup>C-binding mutant in RNA in vitro pull-down experiments along with unmodified and m<sup>5</sup>C-modified RNA oligomers as binding substrates. The quantification of these pull-down experiments (n=2) have now been added, and are revealing that (according to SFigure 1 E and G) YBX1 enriches an RNA containing a single m<sup>5</sup>C by a factor of 1.3 over its unmodified counterpart, while ALYREF enriches by a factor of 4x. This is an acceptable approach for educated readers to question the specificity of the reader proteins, even though the quantification should be performed differently (see below).

      Given that there is no specific sequence motif embedding those cytosines identified in the vicinity of the DRAM-edits (Figure 3J and K), even though it has been accepted by now that most of the m<sup>5</sup>C sites in mRNA are mediated by NSUN2 and NSUN6 proteins, which target tRNA like substrate structures with a particular sequence enrichment, one can conclude that DRAM-Seq is uncovering a huge number of false positives. This must be so not only because of the RNA bisulfite seq data that have been extensively studied by others, but also by the following calculations: Given that the m<sup>5</sup>C/C ratio in human mRNA is 0.02-0.09% (measured by mass spec) and assuming that 1/4 of the nucleotides in an average mRNA are cytosines, an mRNA of 1.000 nucleotides would contain 250 Cs. 0.02- 0.09% m<sup>5</sup>C/C would then translate into 0.05-0.225 methylated cytosines per 250 Cs in a 1000 nt mRNA. YBX1 would bind every C in such an mRNA since there is no m<sup>5</sup>C to be expected, which it could bind with 1.3 higher affinity. Even if the mRNAs would be 10.000 nt long, YBX1 would bind to half a methylated cytosine or 2.25 methylated cytosines with 1.3x higher affinity than to all the remaining cytosines (2499.5 to 2497.75 of 2.500 cytosines in 10.000 nt, respectively). These numbers indicate a 4999x to 1110x excess of cytosine over m<sup>5</sup>C in any substrate RNA, which the "reader" can bind as shown in the RNA pull-downs on unmodified RNAs. This reviewer spares the reader of this review the calculations for ALYREF specificity, which is slightly higher than YBX1. Hence, it is up to the capable reader of these calculations to follow the claim that this minor affinity difference allows the unambiguous detection of the few m<sup>5</sup>C sites in mRNA be it in the endogenous scenario of a cell or as fusion-protein with a base editor attached? 

      We sincerely appreciate the reviewer’s rigorous analysis. We would like to clarify that in our RNA pulldown assays, we indeed utilized the full DRAM system (reader protein fused to the base editor) to reflect the specificity of m<sup>5</sup>C recognition. As previously suggested by the reviewer, to independently validate the m<sup>5</sup>C-binding specificity of ALYREF and YBX1, we performed separate pulldown experiments with wild-type and mutant reader proteins (without the base editor fusion) using both unmodified and m<sup>5</sup>C-modified RNA substrates. This approach aligns with established methodologies in the field (doi:10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y). We have revised the Methods section (line 230) to explicitly describe this experimental design.

      Although the m<sup>5</sup>C/C ratios in LC/MS-assayed mRNA are relatively low (ranging from 0.02% to 0.09%), as noted by the reviewer, both our data and previous studies have demonstrated that ALYREF and YBX1 preferentially bind to m<sup>5</sup>C-modified RNAs over unmodified RNAs, exhibiting 4-fold and 1.3-fold enrichment, respectively (Supplementary Figure 1E–1G). Importantly, this specificity is further enhanced in the DRAM system through two key mechanisms: first, the fusion of reader proteins to the deaminase restricts editing to regions near m<sup>5</sup>C sites, thereby minimizing off-target effects; second, background editing observed in reader-mutant or deaminase controls (e.g., DRAM<sup>mut</sup>-CBE in Figure 2D) is systematically corrected for during data analysis.

      We agree that the theoretical challenge posed by the vast excess of unmodified cytosines. However, our approach includes stringent controls to alleviate this issue. Specifically, sites identified in NSUN2/NSUN6 knockout cells or reader-mutant controls are excluded (Figure 3F), which significantly reduces the number of false-positive detections. Additionally, we have observed deamination changes near high-confidence m<sup>5</sup>C methylation sites detected by RNA bisulfite sequencing, both in first-generation and high-throughput sequencing data. This observation further substantiates the validity of DRAM-Seq in accurately identifying m<sup>5</sup>C sites.

      We fully acknowledge that residual false positives may persist due to the inherent limitations of reader protein specificity, as discussed in line 299-301 of our manuscript. To address this, we plan to optimize reader domains with enhanced m<sup>5</sup>C binding (e.g., through structure-guided engineering), which is also previously implemented in the discussion of the manuscript.

      The reviewer supports the attempt to visualize the data. However, the usefulness of this Figure addition as a readable presentation of the data included in the supplement is up to debate.

      Thank you for your kind suggestion. We understand the reviewer's concern regarding data visualization. However, due to the large volume of DRAM-seq data, it is challenging to present each mutation site and its characteristics clearly in a single figure. Therefore, we chose to categorize the data by chromosome, which not only allows for a more organized presentation of the DRAM-seq data but also facilitates comparison with other database entries. Additionally, we have updated Supplementary Tables 2 and 3 to provide comprehensive information on the mutation sites. We hope that both the reviewer and editors will understand this approach. We will, of course, continue to carefully consider the reviewer's suggestions and explore better ways to present these results in the future.

      (3) A set of private Recommendations for the Authors that outline how you think the science and its presentation could be strengthened

      NEW COMMENTS to TEXT:

      Abstract:

      "5-Methylcytosine (m<sup>5</sup>C) is one of the major post-transcriptional modifications in mRNA and is highly involved in the pathogenesis of various diseases."

      In light of the increasing use of AI-based writing, and the proof that neither DeepSeek nor ChatGPT write truthfully statements if they collect metadata from scientific abstracts, this sentence is utterly misleading.

      m<sup>5</sup>C is not one of the major post-transcriptional modifications in mRNA as it is only present with a m<sup>5</sup>C/C ratio of 0.02- 0.09% as measured by mass-spec. Also, if m<sup>5</sup>C is involved in the pathogenesis of various diseases, it is not through mRNA but tRNA. No single published work has shown that a single m<sup>5</sup>C on an mRNA has anything to do with disease. Every conclusion that is perpetuated by copying the false statements given in the many reviews on the subject is based on knock-out phenotypes of the involved writer proteins. This reviewer wishes that the authors would abstain from the common practice that is currently flooding any scientific field through relentless repetitions in the increasing volume of literature which perpetuate alternative facts.

      We sincerely appreciate the reviewer’s insightful comments. While we acknowledge that m<sup>5</sup>C is not the most abundant post-transcriptional modification in mRNA, we believe that research into m<sup>5</sup>C modification holds considerable value. Numerous studies have highlighted its role in regulating gene expression and its potential contribution to disease progression. For example, recent publications have demonstrated that m<sup>5</sup>C modifications in mRNA can influence cancer progression, lipid metabolism, and other pathological processes (e.g., PMID: 37845385; 39013911; 39924557; 38042059; 37870216).

      We fully agree with the reviewer on the importance of maintaining scientific rigor in academic writing. While m<sup>5</sup>C is not the most abundant RNA modification, we cannot simply draw a conclusion that the level of modification should be the sole criterion for assessing its biological significance. However, to avoid potential confusion, we have removed the word “major”.

      COMMENTS ON FIGURE PRESENTATION:

      Figure 2D:

      The main text states: "DRAM-CBE induced C to U editing in the vicinity of the m<sup>5</sup>C site in AP5Z1 mRNA, with 13.6% C-to-U editing, while this effect was significantly reduced with APOBEC1 or DRAM<sup>mut</sup>-CBE (Fig.2D)." The Figure does not fit this statement. The seq trace shows a U signal of about 1/3 of that of C (about 30%), while the quantification shows 20+ percent

      Thank you for your kind suggestion. Upon visual evaluation, the sequencing trace in the figure appears to suggest a mutation rate closer to 30% rather than 22%. However, relying solely on the visual interpretation of sequencing peaks is not a rigorous approach. The trace on the left represents the visualization of Sanger sequencing results using SnapGene, while the quantification on the right is derived from EditR 1.0.10 software analysis of three independent biological replicates. The C-to-U mutation rates calculated were 22.91667%, 23.23232%, and 21.05263%, respectively. To further validate this, we have included the original EditR analysis of the Sanger sequencing results for the DRAM-CBE group used in the left panel of Figure 2D (see Author response image 2). This analysis confirms an m<sup>5</sup>C fraction (%) of 22/(22+74) = 22.91667, and the sequencing trace aligns well with the mutation rate we reported in Figure 2D. In conclusion, the data and conclusions presented in Figure 2D are consistent and supported by the quantitative analysis.

      Author response image 2.

      Figure 4B: shows now different numbers in Venn-diagrams than in the same depiction, formerly Figure 4A

      We sincerely thank the reviewer for pointing out this issue, and we apologize for not clearly indicating the changes in the previous version of the manuscript. In response to the initial round of reviewer comments, we implemented a more stringent data filtering process (as described in Figure 3F and method section) : "For high-confidence filtering, we further adjusted the parameters of Find_edit_site.pl to include an edit ratio of 10%–60%, a requirement that the edit ratio in control samples be at least 2-fold higher than in NSUN2 or NSUN6knockout samples, and at least 4 editing events at a given site." As a result, we made minor adjustments to the Venn diagram data in Figure 4A, reducing the total number of DRAM-edited mRNAs from 11,977 to 10,835. These changes were consistently applied throughout the manuscript, and the modifications have been highlighted for clarity. Importantly, these adjustments do not affect any of the conclusions presented in the manuscript.

      Figure 4B and D: while the overlap of the DRAM-Seq data with RNA bisulfite data might be 80% or 92%, it is obvious that the remaining data DRAM seq suggests a detection of additional sites of around 97% or 81.83%. It would be advised to mention this large number of additional sites as potential false positives, unless these data were normalized to the sites that can be allocated to NSUN2 and NSUN6 activity (NSUN mutant data sets could be substracted).

      Thank you for pointing this out. The Venn diagrams presented in Figure 4B and D already reflect the exclusion of potential false-positive sites identified in methyltransferasedeficient datasets, as described in our experimental filtering process, and they represent the remaining sites after this stringent filtering. However, we acknowledge that YBX1 and ALYREF, while preferentially binding to m<sup>5</sup>C-modified RNA, also exhibit some affinity for unmodified RNA. Although we employed rigorous controls, including DRAM<sup>mut</sup> and deaminase groups, to minimize false positives, the possibility of residual false positives cannot be entirely ruled out. Addressing this limitation would require even more stringent filtering methods, as discussed in lines 299–301 of the manuscript. We are committed to further optimizing the DRAM system to enhance the accuracy of transcriptome-wide m<sup>5</sup>C analysis in future studies.

      SFigure 1: It is clear that the wild type version of both reader proteins are robustly binding to RNA that does not contain m<sup>5</sup>C. As for the calculations of x-fold affinity loss of RNA binding using both ALYREF -mut or YBX1 -mut, this reviewer asks the authors to determine how much less the mutated versions of the proteins bind to a m<sup>5</sup>C-modified RNAs. Hence, a comparison of YBX1 versus YBX1 -mut (ALYREF versus ALYREF -mut) on the same substrate RNA with the same m<sup>5</sup>C-modified position would allow determining the contribution of the so-called modification binding pocket in the respective proteins to their RNA binding. The way the authors chose to show the data presently is misleading because what is compared is the binding of either the wild type or the mutant protein to different RNAs.

      We appreciate the reviewer’s valuable feedback and apologize for any confusion caused by the presentation of our data. We would like to clarify the rationale behind our approach. The decision to present the wild-type and mutant reader proteins in separate panels, rather than together, was made in response to comments from Reviewer 2. Below, we provide a detailed explanation of our experimental design and its justification.

      First, we confirmed that YBX1 and ALYREF exhibit stronger binding affinity to m<sup>5</sup>Cmodified RNA compared to unmodified RNA, establishing their role as m<sup>5</sup>C reader proteins. Next, to validate the functional significance of the DRAM<sup>mut</sup> group, we demonstrated that mutating key amino acids in the m<sup>5</sup>C-binding pocket significantly reduces the binding affinity of YBX1<sup>mut</sup> and ALYREF<sup>mut</sup> to m<sup>5</sup>C-modified RNA. This confirms that the DRAM<sup>mut</sup> group effectively minimizes false-positive results by disrupting specific m<sup>5</sup>C interactions.

      Crucially, in our pull-down experiments, both the wild-type and mutant proteins (YBX1/YBX1<sup>mut</sup> and ALYREF/ALYREF<sup>mut</sup>) were incubated with the same RNA sequences. To avoid any ambiguity, we have included the specific RNA sequence information in the Methods section (lines 463–468). This ensures a assessment of the reduced binding affinity of the mutant versions relative to the wild-type proteins, even though they are presented in separate panels.

      We hope this explanation clarifies our approach and demonstrates the robustness of our findings. We sincerely appreciate the reviewer’s understanding and hope this addresses their concerns.

      SFigure 2C: first two panels are duplicates of the same image.

      Thank you for pointing this out. We sincerely apologize for incorrectly duplicating the images. We have now updated Supplementary Figure 2C with the correct panels and have provided the original flow cytometry data for the first two images. It is important to note that, as demonstrated by the original data analysis, the EGFP-positive quantification values (59.78% and 59.74%) remain accurate. Therefore, this correction does not affect the conclusions of our study. Thank you again for bringing this to our attention.

      Author response image 3.

      SFigure 4B: how would the PCR product for NSUN6 be indicative of a mutation? The used primers seem to amplify the wildtype sequence.

      Thank you for your kind suggestion. In our NSUN6<sup>-/-</sup> cell line, the NSUN6 gene is only missing a single base pair (1bp) compared to the wildtype, which results in frame shift mutation and reduction in NSUN6 protein expression. We fully agree with the reviewer that the current PCR gel electrophoresis does not provide a clear distinction of this 1bp mutation. To better illustrate our experimental design, we have included a schematic representation of the knockout sequence in SFigure 4B. Additionally, we have provided the original sequencing data, and the corresponding details have been added to lines 151-153 of the manuscript for further clarification.

      Author response image 4.

      SFigure 4C: the Figure legend is insufficient to understand the subfigure.

      Thank you for your valuable suggestion. To improve clarity, we have revised the figure legend for SFigure 4C, as well as the corresponding text in lines 178-179. We have additionally updated the title of SFigure 4 for better clarity. The updated SFigure 4C now demonstrates that the DRAM-edited mRNAs exhibit a high degree of overlap across the three biological replicates.

      SFigure 4D: the Figure legend is insufficient to understand the subfigure.

      Thank you for your kind suggestion. We have revised the figure legend to provide a clearer explanation of the subfigure. Specifically, this figure illustrates the motif analysis derived from sequences spanning 10 nucleotides upstream and downstream of DRAMedited sites mediated by loci associated with NSUN2 or NSUN6. To enhance clarity, we have also rephrased the relevant results section (lines 169-175) and the corresponding discussion (lines 304-307).

      SFigure 7: There is something off with all 6 panels. This reviewer can find data points in each panel that do not show up on the other two panels even though this is a pairwise comparison of three data sets (file was sent to the Editor) Available at https://elife-rp.msubmit.net/elife-rp_files/2025/01/22/00130809/02/130809_2_attach_27_15153.pdf

      Response: We thank the reviewer for pointing this out. We would like to clarify the methodology behind this analysis. In this study, we conducted pairwise comparisons of the number of DRAM-edited sites per gene across three biological replicates of DRAM-ABE or DRAM-CBE, visualized as scatterplots. Each data point in the plots corresponds to a gene, and while the same gene is represented in all three panels, its position may vary vertically or horizontally across the panels. This variation arises because the number of mutation sites typically differs between replicates, making it unlikely for a data point to occupy the exact same position in all panels. A similar analytical approach has been used in previous studies on m6A (PMID: 31548708). To address the reviewer’s concern, we have annotated the corresponding positions of the questioned data points with arrows in Author response image 5.

      Author response image 5.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      The authors investigated the role of the C. elegans Flower protein, FLWR-1, in synaptic transmission, vesicle recycling, and neuronal excitability. They confirmed that FLWR-1 localizes to synaptic vesicles and the plasma membrane and facilitates synaptic vesicle recycling at neuromuscular junctions. They observed that hyperstimulation results in endosome accumulation in flwr-1 mutant synapses, suggesting that FLWR-1 facilitates the breakdown of endocytic endosomes. Using tissue-specific rescue experiments, the authors showed that expressing FLWR-1 in GABAergic neurons restored the aldicarb-resistant phenotype of flwr-1 mutants to wild-type levels. By contrast, cholinergic neuron expression did not rescue aldicarb sensitivity at all. They also showed that FLWR-1 removal leads to increased Ca<sup>2+</sup> signaling in motor neurons upon photo-stimulation. From these findings, the authors conclude that FLWR-1 helps maintain the balance between excitation and inhibition (E/I) by preferentially regulating GABAergic neuronal excitability in a cell-autonomous manner. 

      Overall, the work presents solid data and interesting findings, however the proposed cell-autonomous model of GABAergic FLWR-1 function may be overly simplified in my opinion. 

      Most of my previous comments have been addressed; however, two issues remain. 

      (1) I appreciate the authors' efforts conducting additional aldicarb sensitivity assays that combine muscle-specific rescue with either cholinergic or GABergic neuron-specific expression of FLWR-1. In the revised manuscript, they conclude, "This did not show any additive effects to the pure neuronal rescues, thus FLWR-1 effects on muscle cell responses to cholinergic agonists must be cellautonomous." However, I find this interpretation confusing for the reasons outlined below. 

      Figure 1 - Figure Supplement 3B shows that muscle-specific FLWR-1 expression in flwr-1 mutants significantly restores aldicarb sensitivity. However, when FLWR-1 is co-expressed in both cholinergic neurons and muscle, the worms behave like flwr-1 mutants and no rescue is observed. Similarly, cholinergic FLWR-1 alone fails to restore aldicarb sensitivity (shown in the previous manuscript).

      This data is still shown in the manuscript, Fig. 3D. We interpreted our finding in the muscle/cholinergic co-rescue experiment as meaning, that FLWR-1 in cholinergic neurons over-compensates, so worms should be resistant, and the rescuing effect of muscle FLWR-1 is therefore cancelled. But it is true, if this were the case, why does the pure cholinergic rescue not show over-compensation? We added a sentence to acknowledge this inconsistency and we added a sentence in the discussion (see also below, comment 1) of reviewer #2).

      These observations indicate a non-cell-autonomous interaction between cholinergic neurons and muscle, rather than a strictly muscle cell-autonomous mechanism. In other words, FLWR-1 expressed in cholinergic neurons appears to negate or block the rescue effect of muscle-expressed FLWR-1. Therefore, FLWR-1 could play a more complex role in coordinating physiology across different tissues. This complexity may affect interpretations of Ca<sup>2+</sup> dynamics and/or functional data, particularly in relation to E/I balance, and thus warrants careful discussion or further investigation. 

      For the Ca<sup>2+</sup> dynamics, we think the effects of flwr-1 are likely very immediate, as the imaging assay relies on a sensor expressed directly in the neurons or muscles under study, and not on indirect phenotypes as muscle contraction and behavior, that depend on an interplay of several cell types influencing each other.

      (2) The revised manuscript includes new GCaMP analyses restricted to synaptic puncta. The authors mention that "we compared Ca<sup>2+</sup> signals in synaptic puncta versus axon shafts, and did not find any differences," concluding that "FLWR-1's impact is local, in synaptic boutons." This is puzzling: the similarity of Ca<sup>2+</sup> signals in synaptic regions and axon shafts seems to indicate a more global effect on Ca<sup>2+</sup> dynamics or may simply reflect limited temporal resolution in distinguishing local from global signals due to rapid Ca<sup>2+</sup> diffusion. The authors should clarify how they reached the conclusion that FLWR-1 has a localized impact at synaptic boutons, given that synaptic and axonal signals appear similar. Based on the presented data, the evidence supporting a local effect of FLWR-1 on Ca<sup>2+</sup> dynamics appears limited.

      We apologize, here we simply overlooked this misleading wording in our rebuttal letter. The data we mentioned, showing no obvious difference in axon vs. bouton, are shown below, including time constants for the onset and the offset of the stimulus (data is peak normalized for better visualization):

      Author response image 1.

      One can see that axonal Ca<sup>2+</sup> signals may rise a bit slower than synaptic Ca<sup>2+</sup> signals, as expected for Ca<sup>2+</sup> entering the boutons, and then diffusing out into the axon. The loss of FLWR1 does not affect this. However, the temporal resolution of the used GCaMP6f sensor is ca. 200 ms to reach peak, and the decay time (to t1/2) is ca. 400 ms (PMID: 23868258). Thus, it would be difficult to see effects based on Ca<sup>2+</sup> diffusion using this assay. For the decay, this is similar for both axon and synapse, while flwr-1 mutants do not reduce Ca<sup>2+</sup> as much as wt. In the axon, there is a seemingly slightly slower reduction in flwr-1 mutants, however, given the kinetics of the sensor, this is likely not a meaningful difference. Therefore, we wrote we did not find differences. The interpretation should not have been that the impact of FLWR-1 is local. It may be true if one could image this at faster time scales, i.e. if there is more FLWR-1 localized in boutons (as indicated by our data showing FLWR-1 enrichment in boutons; Fig. 3), and when considering its possible effect on MCA-3 localization (and assuming that MCA-3 is the active player in Ca<sup>2+</sup> removal), i.e. FLWR-1 recruiting MCA-3 to boutons (Fig. 9C, D).  

      Reviewer #2 (Public review): 

      Summary: 

      The Flower protein is expressed in various cell types, including neurons. Previous studies in flies have proposed that Flower plays a role in neuronal endocytosis by functioning as a Ca<sup>2+</sup> channel. However, its precise physiological roles and molecular mechanisms in neurons remain largely unclear. This study employs C. elegans as a model to explore the function and mechanism of FLWR-1, the C. elegans homolog of Flower. This study offers intriguing observations that could potentially challenge or expand our current understanding of the Flower protein. Nevertheless, further clarification or additional experiments are required to substantiate the study's conclusions. 

      Strengths: 

      A range of approaches was employed, including the use of a flwr-1 knockout strain, assessment of cholinergic synaptic activity via analyzing aldicarb (a cholinesterase inhibitor) sensitivity, imaging Ca<sup>2+</sup> dynamics with GCaMP3, analyzing pHluorin fluorescence, examination of presynaptic ultrastructure by EM, and recording postsynaptic currents at the neuromuscular junction. The findings include notable observations on the effects of flwr-1 knockout, such as increased Ca<sup>2+</sup> levels in motor neurons, changes in endosome numbers in motor neurons, altered aldicarb sensitivity, and potential involvement of a Ca<sup>2+</sup>-ATPase and PIP2 binding in FLWR-1's function. 

      The authors have adequately addressed most of my previous concerns, however, I recommend minor revisions to further strengthen the study's rigor and interpretation: 

      Major suggestions 

      (1) This study relies heavily on aldicarb assays to support its conclusions. While these assays are valuable, their results may not fully align with direct assessment of neurotransmitter release from motor neurons. For instance, prior work has shown that two presynaptic modulators identified through aldicarb sensitivity assays exhibited no corresponding electrophysiological defects at the neuromuscular junction (Liu et al., J Neurosci 27: 10404-10413, 2007). Similarly, at least one study from the Kaplan lab has noted discrepancies between aldicarb assays and electrophysiological analyses. The authors should consider adding a few sentences in the Discussion to acknowledge this limitation and the potential caveats of using aldicarb assays, especially since some of the aldicarb assay results in this study are not easily interpretable. 

      Aldicarb assays have been used very successfully in identifying mutants with defects in chemical synaptic transmission, and entire genetic screens have been conducted this way. The reviewer is right, one needs to realize that it is the balance of excitation and inhibition at the NMJ of C. elegans, which underlies the effects on the rate of aldicarb-induced paralysis, not just cholinergic transmission. I.e. if a given mutant affects cholinergic and GABAergic transmission differently, things become difficult to interpret, particularly if also muscle physiology is affected. Therefore, we combined mutant analyses with cell-type specific rescue. We acknowledge that results are nonetheless difficult to interpret. We thus added a sentence in the first paragraph of the discussion.

      (2) The manuscript states, "Elevated Ca<sup>2+</sup> levels were not further enhanced in a flwr-1;mca-3 double mutant." (lines 549-550). However, Figure 7C does not include statistical comparisons between the single and double mutants of flwr-1 and mca-3. Please add the necessary statistical analysis to support this statement. 

      Because we only marked significant differences in that figure, and n.s. was not shown. This was stated in the figure legend.

      (3) The term "Ca<sup>2+</sup> influx" should be avoided, as this study does not provide direct evidence (e.g. voltage-clamp recordings of Ca<sup>2+</sup> inward currents in motor neurons) for an effect of the flwr-1 mutation of Ca<sup>2+</sup> influx. The observed increase in neuronal GCaMP signals in response to optogenetic activation of ChR2 may result from, or be influenced by, Ca<sup>2+</sup> mobilization from of intracellular stores. For example, optogenetic stimulation could trigger ryanodine receptor-mediated Ca<sup>2+</sup> release from the ER via calcium-induced calcium release (CICR) or depolarization-induced calcium release (DICR). It would be more appropriate to describe the observed increase in Ca<sup>2+</sup> signal as "Ca<sup>2+</sup> elevation" rather than increased "Ca<sup>2+</sup> influx". 

      Ok, yes, we can do this, we referred by ‘influx’ to cytosolic Ca<sup>2+</sup>, that fluxes into the cytosol, be it from the internal stores or the extracellular. Extracellular influx, more or less, inevitably will trigger further influx from internal stores, to our understanding. We changed this to “elevated Ca<sup>2+</sup> levels” or “Ca<sup>2+</sup> level rise” or “Ca<sup>2+</sup> level increase”.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      A thorough discussion on the impact of cell-autonomous versus non-cell-autonomous effects is necessary. 

      Revise and clarify the distinction between local and global Ca²⁺ changes. 

      see above.

      Reviewer #2 (Recommendations for the authors): 

      Minor suggestions 

      (1) In "Few-Ubi was shown to facilitate recovery of neurons following intense synaptic activity (Yao et al.,....." (lines 283-284), please specify which aspects of neuronal recovery are influenced by the Flower protein. 

      We added “refilling of SV pools”.

      (2) The abbreviation "Few-Ubi" is used for the Drosophila Flower protein (e.g., line 283, Figure 1A, and Figure 8A). Please clarify what "Ubi" stands for and verify whether its inclusion in the protein name is appropriate.

      This is inconsistent across the literature, sometimes Fwe-Ubi is also referred to as FweA. We now added this term. Ubi refers to ubiquitous (“Therefore, we named this isoform fweubi because it is expressed ubiquitously in imaginal discs“) (Rhiner 2010)

      (3) The manuscript uses "pflwr-1" (line 303 and elsewhere) to denote the flwr-1 promoter. This notation could be misleading, as it may be interpreted as a gene name. Please consider using either "flwr-1p" or "Pflwr-1" instead. Additionally, ensure proper italicization of gene names throughout the manuscript. 

      We changed this throughout. We will change to italicized at proof stage, it would be too timeconsuming to spot these incidents now.

      (4) The authors tagged the C-terminus of FLWR-1 by GFP (lines 321). The fusion protein is referred to as "GFP::FLWR-1" throughout the manuscript. Please verify whether "FLWR-1::GFP" would be the more appropriate designation.

      Thank you, yes, we changed this in the text, GFP is indeed N-terminal.

      (5) In "This did not show any additive effects...." (line 363), please clarify what "This" refers to. 

      Altered to “The combined rescues did not show any additive effects…”

      (6) In "..., supporting our previous finding of increased neurotransmitter release in GABAergic neurons" (lines 412-413), please provide a citation for the referenced previous study.

      This refers to our aldicarb data within this paper, just further up in the text. We removed “previous”.

      (7) Figure 4C, D examines the effect of flwr-1 mutation on body length in the genetic background of the unc-29 mutation, which selectively disrupts the levamisole-sensitive acetylcholine receptor. Please comment on the rationale for implicating only the levamisole receptor rather than the nicotinic acetylcholine receptor in muscle cells. 

      This was because we used a behavioral assay. Despite the fact that the homopentameric ACR16/N-AChR mediate about 2/3 of the peak currents in response to acute ACh application to the NMJ (e.g. Almedom et al., EMBO J, 2009), the acr-16 mutant has virtually no behavioral / locomotion phenotype. Likely, this is because the heteropentameric, UNC-29 containing LAChR, while only contributing 1/3 of the peak current, desensitizes much more slowly and thus unc-29 mutants show a severe behavioral phenotype (uncoordinated locomotion, etc.). We thus did not expect a major effect when performing the behavoral assay in acr-16 mutants and thus chose the unc-29 mutant background.

      (8) In "we found no evidence ....insertion into the PM (Yao et al., 2009)", It appears that the cited paper was not authored by any of the current manuscript. Please confirm whether this citation is correctly attributed. 

      This sentence was arranged in a misleading way, we did not mean that we authored this paper. It was change in the text: “While a facilitating role of Flower in endocytosis appears to be conserved in C. elegans, in contrast to previous findings from Drosophila (Yao et al., 2009), we found no evidence that FLWR-1 conducts Ca<sup>2+</sup> upon insertion into the PM.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      As to the exceptionally minor issue, namely, correction for multiple statistical tests (minor because the data and the error are presented in the text). We have now conducted one-way ANOVA to back the data displayed in Fig 4A., and Supp. Figs 19 and 21. In each case ANOVA revealed a highly significant difference among means: Dunnett’s post hoc test was then used to test each result against SBW25, with the multiple comparisons corrected for in the analysis.

      This resulted in changes to the description of the statistical analysis in the following captions:

      To Figure 4.

      Where we previously referred to paired t-tests we now state:  ANOVA revealed a highly significant difference among means [F<sub>7,16</sub> = 8.19, p < 0.001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that five genotypes (*) differ significantly (p < 0.05) from SBW25.

      To Supplementary Figure 19.

      Where we previously referred to paired t-tests we now state: ANOVA revealed a highly significant difference among means [F<sub>7,16</sub> = 16.74, p < 0.001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that three genotypes (*) differ significantly (p < 0.05) from SBW25.

      To Supplementary Figure 21.

      Where we previously referred to paired t-tests we now state:  ANOVA revealed a highly significant difference among means [F<sub>7,89</sub> = 9.97, p < 0.0001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that SBW25 ∆mreB and SBW25 ∆PFLU4921-4925 are significantly different (*) from SBW25 (p < 0.05).


      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The authors performed experimental evolution of MreB mutants that have a slow-growing round phenotype and studied the subsequent evolutionary trajectory using analysis tools from molecular biology. It was remarkable and interesting that they found that the original phenotype was not restored (most common in these studies) but that the round phenotype was maintained. 

      Strengths: 

      The finding that the round phenotype was maintained during evolution rather than that the original phenotype, rod-shaped cells, was recovered is interesting. The paper extensively investigates what happens during adaptation with various different techniques. Also, the extensive discussion of the findings at the end of the paper is well thought through and insighXul. 

      Weaknesses: 

      I find there are three general weaknesses: 

      (1) Although the paper states in the abstract that it emphasizes "new knowledge to be gained" it remains unclear what this concretely is. On page 4 they state 3 three research questions, these could be more extensively discussed in the abstract. Also, these questions read more like genetics questions while the paper is a lot about cell biological findings. 

      Thank you for drawing attention to the unnecessary and gratuitous nature of the last sentence of the Abstract. We are in agreement. It has been modified, and we have taken  advantage of additional word space to draw attention to the importance of the two competing (testable) hypotheses laid out in the Discussion. 

      As to new knowledge, please see the Results and particularly the Discussion. But beyond this, and as recognised by others, there is real value for cell biology in seeing how (and whether) selection can compensate for effects that are deleterious to fitness. The results will very often depart from those delivered from, for example, suppressor analyses, or bottom up engineering. 

      In the work recounted in our paper, we chose to focus – by way of proof-of principle – on the most commonly observed mutations, namely, those within pbp1A.  But beyond this gene, we detected mutations  in other components of the cell shape / division machinery whose connections are not yet understood and which are the focus of on-going investigation.  

      As to the three questions posed at the end of the Introduction, the first concerns whether selection can compensate for deleterious effects of deleting mreB (a question that pertains to evolutionary aspects); the second seeks understanding of genetic factors; the third aims to shed light on the genotype-to-phenotype map (which is where the cell biology comes into play).  Given space restrictions, we cannot see how we could usefully expand, let alone discuss, the three questions raised at the end of the Introduction in restrictive space available in the Abstract.   

      (2) It is not clear to me from the text what we already know about the restoration of MreB loss from suppressors studies (in the literature). Are there suppressor screens in the literature and which part of the findings is consistent with suppressor screens and which parts are new knowledge?  

      As stated in the Introduction, a previous study with B. subtilis (which harbours three MreB isoforms and where the isoform named “MreB” is essential for growth under normal conditions), suppressors of MreB lethality were found to occur in ponA, a class A penicillin binding protein (Kawai et al., 2009). This led to recognition that MreB plays a role in recruiting Pbp1A to the lateral cell wall. On the other hand, Patel et al. (2020) have shown that deletion of classA PBPs leads to an up-regulation of rod complex activity. Although there is a connection between rod complex and class A PBPs, a further study has shown that the two systems work semi-autonomously (Cho et al., 2016). 

      Our work confirms a connection between MreB and Pbp1A, and has shed new light on how this interaction is established by means of natural selection, which targets the integrity of cell wall. Indeed, the Rod complex and class A PBPs have complementary activities in the building of the cell wall with each of the two systems able to compensate for the other in order to maintain cell wall integrity. Please see the major part of the Discussion. In terms of specifics, the connection between mreB and pbp1A (shown by Kawai et al (2009)) is indirect because it is based on extragenic transposon insertions. In our study, the genetic connection is mechanistically demonstrated.  In addition, we capture that the evolutionary dynamics is rapid and we finally enriched understanding of the genotype-to-phenotype map.

      (3) The clarity of the figures, captions, and data quantification need to be improved.  

      Modifications have been implemented. Please see responses to specific queries listed below.

      Reviewer #2 (Public Review): 

      Yulo et al. show that deletion of MreB causes reduced fitness in P. fluorescens SBW25 and that this reduction in fitness may be primarily caused by alterations in cell volume. To understand the effect of cell volume on proliferation, they performed an evolution experiment through which they predominantly obtained mutations in pbp1A that decreased cell volume and increased viability. Furthermore, they provide evidence to propose that the pbp1A mutants may have decreased PG cross-linking which might have helped in restoring the fitness by rectifying the disorganised PG synthesis caused by the absence of MreB. Overall this is an interesting study. 

      Queries: 

      Do the small cells of mreB null background indeed have no DNA? It is not apparent from the DAPI images presented in Supplementary Figure 17. A more detailed analysis will help to support this claim. 

      It is entirely possible that small cells have no DNA, because if cell division is aberrant then division can occur prior to DNA segregation resulting in cells with no DNA. It is clear from microscopic observation that both small and large cells do not divide. It is, however, true, that we are unable to state – given our measures of DNA content – that small cells have no DNA. We have made this clear on page 13, paragraph 2.

      What happens to viability and cell morphology when pbp1A is removed in the mreB null background? If it is actually a decrease in pbp1A activity that leads to the rescue, then pbp1A- mreB- cells should have better viability, reduced cell volume and organised PG synthesis. Especially as the PG cross-linking is almost at the same level as the T362 or D484 mutant.  

      Please see fitness data in Supp. Fig. 13. Fitness of ∆mreBpbp1A is no different to that caused by a point mutation. Cells remain round.  

      What is the status of PG cross-linking in ΔmreB Δpflu4921-4925 (Line 7)? 

      This was not analysed as the focus of this experiment was PBPs. A priori, there is no obvious reason to suspect that ∆4921-25 (which lacks oprD) would be affected in PBP activity.

      What is the morphology of the cells in Line 2 and Line 5? It may be interesting to see if PG cross-linking and cell wall synthesis is also altered in the cells from these lines. 

      The focus of investigation was restricted to L1, L4 and L7. Indeed, it would be interesting to look at the mutants harbouring mutations in :sZ, but this is beyond scope of the present investigation (but is on-going). The morphology of L2 and L5 are shown in Supp. Fig. 9.

      The data presented in 4B should be quantified with appropriate input controls. 

      Band intensity has now been quantified (see new Supp. Fig .20). The controls are SBW25, SBW25∆pbp1A, SBW25 ∆mreB and SBW25 ∆mreBpbp1A as explained in the paper.

      What are the statistical analyses used in 4A and what is the significance value? 

      Our oversight. These were reported in Supp. Fig. 19, but should also have been presented in Fig. 4A. Data are means of three biological replicates. The statistical tests are comparisons between each mutant and SBW25, and assessed by paired t-tests.  

      A more rigorous statistical analysis indicating the number of replicates should be done throughout. 

      We have checked and made additions where necessary and where previously lacking. In particular, details are provided in Fig. 1E, Fig. 4A and Fig. 4B. For Fig. 4C we have produced quantitative measures of heterogeneity in new cell wall insertion. These are reported in Supp. Fig. 21 (and referred to in the text and figure caption) and show that patterns of cell wall insertion in ∆mreB are highly heterogeneous.

      Reviewer #3 (Public Review): 

      This paper addresses an understudied problem in microbiology: the evolution of bacterial cell shape. Bacterial cells can take a range of forms, among the most common being rods and spheres. The consensus view is that rods are the ancestral form and spheres the derived form. The molecular machinery governing these different shapes is fairly well understood but the evolutionary drivers responsible for the transition between rods and spheres are not. Enter Yulo et al.'s work. The authors start by noting that deletion of a highly conserved gene called MreB in the Gram-negative bacterium Pseudomonas fluorescens reduces fitness but does not kill the cell (as happens in other species like E. coli and B. subtilis) and causes cells to become spherical rather than their normal rod shape. They then ask whether evolution for 1000 generations restores the rod shape of these cells when propagated in a rich, benign medium. 

      The answer is no. The evolved lineages recovered fitness by the end of the experiment, growing just as well as the unevolved rod-shaped ancestor, but remained spherical. The authors provide an impressively detailed investigation of the genetic and molecular changes that evolved. Their leading results are: 

      (1) The loss of fitness associated with MreB deletion causes high variation in cell volume among sibling cells after cell division. 

      (2) Fitness recovery is largely driven by a single, loss-of-function point mutation that evolves within the first ~250 generations that reduces the variability in cell volume among siblings. 

      (3) The main route to restoring fitness and reducing variability involves loss of function mutations causing a reduction of TPase and peptidoglycan cross-linking, leading to a disorganized cell wall architecture characteristic of spherical cells. 

      The inferences made in this paper are on the whole well supported by the data. The authors provide a uniquely comprehensive account of how a key genetic change leads to gains in fitness and the spectrum of phenotypes that are impacted and provide insight into the molecular mechanisms underlying models of cell shape. 

      Suggested improvements and clarifications include: 

      (1) A schematic of the molecular interactions governing cell wall formation could be useful in the introduction to help orient readers less familiar with the current state of knowledge and key molecular players. 

      We understand that this would be desirable, but there are numerous recent reviews with detailed schematics that we think the interested reader would be better consulting. These are referenced in the text.

      (2) More detail on the bioinformatics approaches to assembling genomes and identifying the key compensatory mutations are needed, particularly in the methods section. This whole subject remains something of an art, with many different tools used. Specifying these tools, and the parameter settings used, will improve transparency and reproducibility, should it be needed. 

      We overlooked providing this detail, which has now been corrected by provision of more information in the Materials and Methods. In short we used Breseq, the clonal option, with default parameters. Additional analyses were conducted using Genieous. The BreSeq output files are provided https://doi.org/10.17617/3.CU5SX1 (which include all read data).

      (3) Corrections for multiple comparisons should be used and reported whenever more than one construct or strain is compared to the common ancestor, as in Supplementary Figure 19A (relative PG density of different constructs versus the SBW25 ancestor). 

      The data presented in Supp Fig 19A (and Fig 4A) do not involve multiple comparisons. In each instance the comparison is between SBW25 and each of the different mutants. A paired t-test is thus appropriate.

      (4) The authors refrain from making strong claims about the nature of selection on cell shape, perhaps because their main interest is the molecular mechanisms responsible. However, I think more can be said on the evolutionary side, along two lines. First, they have good evidence that cell volume is a trait under strong stabilizing selection, with cells of intermediate volume having the highest fitness. This is notable because there are rather few examples of stabilizing selection where the underlying mechanisms responsible are so well characterized. Second, this paper succeeds in providing an explanation for how spherical cells can readily evolve from a rod-shaped ancestor but leaves open how rods evolved in the first place. Can the authors speculate as to how the complex, coordinated system leading to rods first evolved? Or why not all cells have lost rod shape and become spherical, if it is so easy to achieve? These are important evolutionary questions that remain unaddressed. The manuscript could be improved by at least flagging these as unanswered questions deserving of further attention. 

      These are interesting points, but our capacity to comment is entirely speculative. Nonetheless, we have added an additional paragraph to the Discussion that expresses an opinion that has yet to receive attention:

      “Given the complexity of the cell wall synthesis machinery that defines rod-shape in bacteria, it is hard to imagine how rods could have evolved prior to cocci. However, the cylindrical shape offers a number of advantages. For a given biomass (or cell volume), shape determines surface area of the cell envelope, which is the smallest surface area associated with the spherical shape. As shape sets the surface/volume ratio, it also determines the ratio between supply (proportional to the surface) and demand (proportional to cell volume). From this point of view, it is more efficient to be cylindrical (Young 2006). This also holds for surface attachment and biofilm formation (Young 2006). But above all, for growing cells, the ratio between supply and demand is constant in rod shaped bacteria, whereas it decreases for cocci. This requires that spherical cells evolve complex regulatory networks capable of maintaining the correct concentration of cellular proteins despite changes in surface/volume ratio. From this point of view, rod-shaped bacteria offer opportunities to develop unsophisticated regulatory networks.”

      why not all cells have lost rod shape and become spherical.

      Please see Kevin Young’s 2006 review on the adaptive significance of cell shape

      The value of this paper stems both from the insight it provides on the underlying molecular model for cell shape and from what it reveals about some key features of the evolutionary process. The paper, as it currently stands, provides more on which to chew for the molecular side than the evolutionary side. It provides valuable insights into the molecular architecture of how cells grow and what governs their shape. The evolutionary phenomena emphasized by the authors - the importance of loss-of-function mutations in driving rapid compensatory fitness gains and that multiple genetic and molecular routes to high fitness are often available, even in the relatively short time frame of a few hundred generations - are well understood phenomena and so arguably of less broad interest. The more compelling evolutionary questions concern the nature and cause of stabilizing selection (in this case cell volume) and the evolution of complexity. The paper misses an opportunity to highlight the former and, while claiming to shed light on the latter, provides rather little useful insight. 

      Thank you for these thoughts and comments. However, we disagree that the experimental results are an overlooked opportunity to discuss stabilising selection. Stabilising selection occurs when selection favours a particular phenotype causing a reduction in underpinning population-level genetic diversity. This is not happening when selection acts on SBW25 ∆mreB leading to a restoration of fitness. Driving the response are biophysical factors, primarily the critical need to balance elongation rate with rate of septation. This occurs without any change in underlying genetic diversity.  

      Recommendations for the authors:  

      Reviewer 1 (Recommendations for the Authors): 

      Hereby my suggestion for improvement of the quantification of the data, the figures, and the text. 

      -  p 14, what is the unit of elongation rate?  

      At first mention we have made clear that the unit is given in minutes^-1

      -  p 14, please give an error bar for both p=0.85 and f=0.77, to be able to conclude they are different 

      Error on the probability p is estimated at the 95% confidence interval by the formula:1.96 , where N is the total number of cells. This has been added in the paragraph p »probability » of the Image Analysis section in the Material and Methods. 

      We also added errors on p measurement in the main text.

      -  p 14, all the % differences need an errorbar 

      The error bars and means are given in Fig 3C and 3D.

      -  Figure 1B adds units to compactness, and what does it represent? Is the cell size the estimated volume (that is mentioned in the caption)? Shouldn't the datapoints have error bars? 

      Compactness is defined in the “Image Analysis” section of the Material and Methods. It is a dimensionless parameter. The distribution of individual cell shapes / sizes are depicted in Fig 1B. Error does arise from segmentation, but the degree of variance (few pixels) is much smaller than the representations of individual cells shown.

      -  Figure 1C caption, are the 50.000 cells? 

      Correct. Figure caption has been altered.

      -  Figure 1D, first the elongation rate is described as a volume per minute, but now, looking at the units it is a rate, how is it normalized? 

      Elongation rate is explained in the Materials and Methods (see the image analysis section) and is not volume per minute. It is dV/dt = r*V (the unit of r is min^-1). Page 9 includes specific mention of the unit of r.

      -  Figure 1E, how many cells (n) per replicate? 

      Our apologies. We have corrected the figure caption that now reads:

      “Proportion of live cells in ancestral SBW25 (black bar) and ΔmreB (grey bar) based on LIVE/DEAD BacLight Bacterial Viability Kit protocol. Cells were pelleted at 2,000 x g for 2 minutes to preserve ΔmreB cell integrity. Error bars are means and standard deviation of three biological replicates (n>100).”

      -  Figure 1G, how does this compare to the wildtype 

      The volume for wild type SBW25 is 3.27µm^3 (within the “white zone”). This is mentioned in the text.

      -  Figure 2B, is this really volume, not size? And can you add microscopy images? 

      The x-axis is volume (see Materials and Methods, subsection image analysis). Images are available in Supp. Fig. 9.

      -  Figure 3A what does L1, L4 and L7 refer too? Is it correct that these same lines are picked for WT and delta_mreB 

      Thank you for pointing this out. This was an earlier nomenclature. It was shorthand for the mutants that are specified everywhere else by genotype and has now been corrected. 

      -  Figure 3c: either way write out p, so which probability, or you need a simple cartoon that is plotted. 

      The value p is the probability to proceed to the next generation and is explained in Materials and Methods  subsection image analysis.  We feel this is intuitive and does not require a cartoon. We nonetheless added a sentence to the Materials and Methods to aid clarity.

      -  Figure 4B can you add a ladder to the gel? 

      No ladder was included, but the controls provide all the necessary information. The band corresponding to PBP1A is defined by presence in SBW25, but absence in SBW25 ∆pbp1A.

      -  Figure 4c, can you improve the quantification of these images? How were these selected and how well do they represent the community? 

      We apologise for the lack of quantitative description for data presented in Fig 4C. This has now been corrected. In brief, we measured the intensity of fluorescent signal from between 10 and 14 cells and computed the mean and standard deviation of pixel intensity for each cell. To rule out possible artifacts associated with variation of the mean intensity, we calculated the ratio of the standard deviation divided by the square root of the mean. These data reveal heterogeneity in cell wall synthesis and provide strong statistical support for the claim that cell wall synthesis in ∆mreB is significantly more heterogeneous than the control. The data are provided in new Supp. Fig. 21. 

      Minor comments: 

      -  It would be interesting if the findings of this experimental evolution study could be related to comparative studies (if these have ever been executed).  

      Little is possible, but Hendrickson and Yulo published a portion of the originally posted preprint separately. We include a citation to that paper. 

      -  p 13, halfway through the page, the second paragraph lacks a conclusion, why do we care about DNA content? 

      It is a minor observation that was included by way of providing a complete description of cell phenotype.  

      -  p 17, "suggesting that ... loss-of-function", I do no not understand what this is based upon. 

      We show that the fitness of a pbp1A deletion is indistinguishable from the fitness of one of the pbp1A point mutants. This fact establishes that the point mutation had the same effects as a gene deletion thus supporting the claim that the point mutations identified during the course of the selection experiment decrease (or destroy) PBP1A function.

      -  p 25, at the top of the page: do you have a reference for the statement that a disorganized cell wall architecture is suited to the topology of spherical cells? 

      The statement is a conclusion that comes from our reasoning. It stems from the fact that it is impossible to entirely map the surface of a sphere with parallel strands.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      To the Senior Editor and the Reviewing Editor:

      We sincerely appreciate the valuable comments provided by the reviewers, the reviewing editor, and the senior editor. After carefully reviewing and considering the comments, we have addressed the key concerns raised by the reviewers and made appropriate modifications to the article in the revised manuscript.

      The main revisions made to the manuscript are as follows:

      1) We have added comparison experiments with TNDM (see Fig. 2 and Fig. S2).

      2) We conducted new synthetic experiments to demonstrate that our conclusions are not a by-product of d-VAE (see Fig. S2 and Fig. S11).

      3) We have provided a detailed explanation of how our proposed criteria, especially the second criterion, can effectively exclude the selection of unsuitable signals.

      4) We have included a semantic overview figure of d-VAE (Fig. S1) and a visualization plot of latent variables (Fig. S13).

      5) We have elaborated on the model details of d-VAE, as well as the hyperparameter selection and experimental settings of other comparison models.

      We believe these revisions have significantly improved the clarity and comprehensibility of the manuscript. Thank you for the opportunity to address these important points.

      Reviewer #1

      Q1: “First, the model in the paper is almost identical to an existing VAE model (TNDM) that makes use of weak supervision with behaviour in the same way [1]. This paper should at least be referenced. If the authors wish they could compare their model to TNDM, which combines a state space model with smoothing similar to LFADS. Given that TNDM achieves very good behaviour reconstructions, it may be on par with this model without the need for a Kalman filter (and hence may achieve better separation of behaviour-related and unrelated dynamics).”

      Our model significantly differs from TNDM in several aspects. While TNDM also constrains latent variables to decode behavioral information, it does not impose constraints to maximize behavioral information in the generated relevant signals. The trade-off between the decoding and reconstruction capabilities of generated relevant signals is the most significant contribution of our approach, which is not reflected in TNDM. In addition, the backbone network of signal extraction and the prior distribution of the two models are also different.

      It's worth noting that our method does not require a Kalman filter. Kalman filter is used for post hoc assessment of the linear decoding ability of the generated signals. Please note that extracting and evaluating relevant signals are two distinct stages.

      Heeding your suggestion, we have incorporated comparison experiments involving TNDM into the revised manuscript. Detailed information on model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Q2: “Second, in my opinion, the claims regarding identifiability are overstated - this matters as the results depend on this to some extent. Recent work shows that VAEs generally suffer from identifiability problems due to the Gaussian latent space [2]. This paper also hints that weak supervision may help to resolve such issues, so this model as well as TNDM and CEBRA may indeed benefit from this. In addition however, it appears that the relative weight of the KL Divergence in the VAE objective is chosen very small compared to the likelihood (0.1%), so the influence of the prior is weak and the model may essentially learn the average neural trajectories while underestimating the noise in the latent variables. This, in turn, could mean that the model will not autoencode neural activity as well as it should, note that an average R2 in this case will still be high (I could not see how this is actually computed). At the same time, the behaviour R2 will be large simply because the different movement trajectories are very distinct. Since the paper makes claims about the roles of different neurons, it would be important to understand how well their single trial activities are reconstructed, which can perhaps best be investigated by comparing the Poisson likelihood (LFADS is a good baseline model). Taken together, while it certainly makes sense that well-tuned neurons contribute more to behaviour decoding, I worry that the very interesting claim that neurons with weak tuning contain behavioural signals is not well supported.”

      We don’t think our distilled signals are average neural trajectories without variability. The quality of reconstructing single trial activities can be observed in Figure 3i and Figure S4. Neural trajectories in Fig. 3i and Fig. S4 show that our distilled signals are not average neural trajectories. Furthermore, if each trial activity closely matched the average neural trajectory, the Fano Factor (FF) should theoretically approach 0. However, our distilled signals exhibit a notable departure from this expectation, as evident in Figure 3c, d, g, and f. Regarding the diminished influence of the KL Divergence: Given that the ground truth of latent variable distribution is unknown, even a learned prior distribution might not accurately reflect the true distribution. We found the pronounced impact of the KL divergence would prove detrimental to the decoding and reconstruction performance. As a result, we opt to reduce the weight of the KL divergence term. Even so, KL divergence can still effectively align the distribution of latent variables with the distribution of prior latent variables, as illustrated in Fig. S13. Notably, our goal is extracting behaviorally-relevant signals from given raw signals rather than generating diverse samples from the prior distribution. When aim to separating relevant signals, we recommend reducing the influence of KL divergence. Regarding comparing the Poisson likelihood: We compared Poisson log-likelihood among different methods (except PSID since their obtained signals have negative values), and the results show that d-VAE outperforms other methods.

      Author response image 1.

      Regarding how R2 is computed: , where and denote ith sample of raw signals, ith sample of distilled relevant signals, and the mean of raw signals. If the distilled signals exactly match the raw signals, the sum of squared error is zero, thus R2=1. If the distilled signals always are equal to R2=0. If the distilled signals are worse than the mean estimation, R2 is negative, negative R2 is set to zero.

      Thank you for your valuable feedback.

      Q3: “Third, and relating to this issue, I could not entirely follow the reasoning in the section arguing that behavioural information can be inferred from neurons with weak selectivity, but that it is not linearly decodable. It is right to test if weak supervision signals bleed into the irrelevant subspace, but I could not follow the explanations. Why, for instance, is the ANN decoder on raw data (I assume this is a decoder trained fully supervised) not equal in performance to the revenant distilled signals? Should a well-trained non-linear decoder not simply yield a performance ceiling? Next, if I understand correctly, distilled signals were obtained from the full model. How does a model perform trained only on the weakly tuned neurons? Is it possible that the subspaces obtained with the model are just not optimally aligned for decoding? This could be a result of limited identifiability or model specifics that bias reconstruction to averages (a well-known problem of VAEs). I, therefore, think this analysis should be complemented with tests that do not depend on the model.”

      Regarding “Why, for instance, is the ANN decoder on raw data (I assume this is a decoder trained fully supervised) not equal in performance to the relevant distilled signals? Should a well-trained non-linear decoder not simply yield a performance ceiling?”: In fact, the decoding performance of raw signals with ANN is quite close to the ceiling. However, due to the presence of significant irrelevant signals in raw signals, decoding models like deep neural networks are more prone to overfitting when trained on noisy raw signals compared to behaviorally-relevant signals. Consequently, we anticipate that the distilled signals will demonstrate superior decoding generalization. This phenomenon is evident in Fig. 2 and Fig. S1, where the decoding performance of the distilled signals surpasses that of the raw signals, albeit not by a substantial margin.

      Regarding “Next, if I understand correctly, distilled signals were obtained from the full model. How does a model perform trained only on the weakly tuned neurons? Is it possible that the subspaces obtained with the model are just not optimally aligned for decoding?”:Distilled signals (involving all neurons) are obtained by d-VAE. Subsequently, we use ANN to evaluate the performance of smaller and larger R2 neurons. Please note that separating and evaluating relevant signals are two distinct stages.

      Regarding the reasoning in the section arguing that smaller R2 neurons encode rich information, we would like to provide a detailed explanation:

      1) After extracting relevant signals through d-VAE, we specifically selected neurons characterized by smaller R2 values (Here, R2 signifies the proportion of neuronal activity variance explained by the linear encoding model, calculated using raw signals). Subsequently, we employed both KF and ANN to assess the decoding performance of these neurons. Remarkably, our findings revealed that smaller R2 neurons, previously believed to carry limited behavioral information, indeed encode rich information.

      2) In a subsequent step, we employed d-VAE to exclusively distill the raw signals of these smaller R2 neurons (distinct from the earlier experiment where d-VAE processed signals from all neurons). We then employed KF and ANN to evaluate the distilled smaller R2 neurons. Interestingly, we observed that we could not attain the same richness of information solely through the use of these smaller R2 neurons.

      3) Consequently, we put forth and tested two hypotheses: First, that larger R2 neurons introduce additional signals into the smaller R2 neurons that do not exist in the real smaller R2 neurons. Second, that larger R2 neurons aid in restoring the original appearance of impaired smaller R2 neurons. Our proposed criteria and synthetic experiments substantiate the latter scenario.

      Thank you for your valuable feedback.

      Q4: “Finally, a more technical issue to note is related to the choice to learn a non-parametric prior instead of using a conventional Gaussian prior. How is this implemented? Is just a single sample taken during a forward pass? I worry this may be insufficient as this would not sample the prior well, and some other strategy such as importance sampling may be required (unless the prior is not relevant as it weakly contributed to the ELBO, in which case this choice seems not very relevant). Generally, it would be useful to see visualisations of the latent variables to see how information about behaviour is represented by the model.”

      Regarding "how to implement the prior?": Please refer to Equation 7 in the revised manuscript; we have added detailed descriptions in the revised manuscript.

      Regarding "Generally, it would be useful to see visualizations of the latent variables to see how information about behavior is represented by the model.": Note that our focus is not on latent variables but on distilled relevant signals. Nonetheless, at your request, we have added the visualization of latent variables in the revised manuscript. Please see Fig. S13 for details.

      Thank you for your valuable feedback.

      Recommendations: “A minor point: the word 'distill' in the name of the model may be a little misleading - in machine learning the term refers to the construction of smaller models with the same capabilities.

      It should be useful to add a schematic picture of the model to ease comparison with related approaches.”

      In the context of our model's functions, it operates as a distillation process, eliminating irrelevant signals and retaining the relevant ones. Although the name of our model may be a little misleading, it faithfully reflects what our model does.

      I have added a schematic picture of d-VAE in the revised manuscript. Please see Fig. S1 for details.

      Thank you for your valuable feedback.

      Reviewer #2

      Q1: “Is the apparently increased complexity of encoding vs decoding so unexpected given the entropy, sparseness, and high dimensionality of neural signals (the "encoding") compared to the smoothness and low dimensionality of typical behavioural signals (the "decoding") recorded in neuroscience experiments? This is the title of the paper so it seems to be the main result on which the authors expect readers to focus. ”

      We use the term "unexpected" due to the disparity between our findings and the prior understanding concerning neural encoding and decoding. For neural encoding, as we said in the Introduction, in previous studies, weakly-tuned neurons are considered useless, and smaller variance PCs are considered noise, but we found they encode rich behavioral information. For neural decoding, the nonlinear decoding performance of raw signals is significantly superior to linear decoding. However, after eliminating the interference of irrelevant signals, we found the linear decoding performance is comparable to nonlinear decoding. Rooted in these findings, which counter previous thought, we employ the term "unexpected" to characterize our observations.

      Thank you for your valuable feedback.

      Q2: “I take issue with the premise that signals in the brain are "irrelevant" simply because they do not correlate with a fixed temporal lag with a particular behavioural feature hand-chosen by the experimenter. As an example, the presence of a reward signal in motor cortex [1] after the movement is likely to be of little use from the perspective of predicting kinematics from time-bin to time-bin using a fixed model across trials (the apparent definition of "relevant" for behaviour here), but an entire sub-field of neuroscience is dedicated to understanding the impact of these reward-related signals on future behaviour. Is there method sophisticated enough to see the behavioural "relevance" of this brief, transient, post-movement signal? This may just be an issue of semantics, and perhaps I read too much into the choice of words here. Perhaps the authors truly treat "irrelevant" and "without a fixed temporal correlation" as synonymous phrases and the issue is easily resolved with a clarifying parenthetical the first time the word "irrelevant" is used. But I remain troubled by some claims in the paper which lead me to believe that they read more deeply into the "irrelevancy" of these components.”

      In this paper, we employ terms like ‘behaviorally-relevant’ and ‘behaviorally-irrelevant’ only regarding behavioral variables of interest measured within a given task, such as arm kinematics during a motor control task. A similar definition can be found in the PSID[1].

      Thank you for your valuable feedback.

      [1] Sani, Omid G., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification." Nature Neuroscience 24.1 (2021): 140-149.

      Q3: “The authors claim the "irrelevant" responses underpin an unprecedented neuronal redundancy and reveal that movement behaviors are distributed in a higher-dimensional neural space than previously thought." Perhaps I just missed the logic, but I fail to see the evidence for this. The neural space is a fixed dimensionality based on the number of neurons. A more sparse and nonlinear distribution across this set of neurons may mean that linear methods such as PCA are not effective ways to approximate the dimensionality. But ultimately the behaviourally relevant signals seem quite low-dimensional in this paper even if they show some nonlinearity may help.”

      The evidence for the “useless” responses underpin an unprecedented neuronal redundancy is shown in Fig. 5a, d and Fig. S9a. Specifically, the sum of the decoding performance of smaller R2 neurons and larger R2 neurons is significantly greater than that of all neurons for relevant signals (red bar), demonstrating that movement parameters are encoded very redundantly in neuronal population. In contrast, we can not find this degree of neural redundancy in raw signals (purple bar).

      The evidence for the “useless” responses reveal that movement behaviors are distributed in a higher-dimensional neural space than previously thought is shown in the left plot (involving KF decoding) of Fig. 6c, f and Fig. S9f. Specifically, the improvement of KF using secondary signals is significantly higher than using raw signals composed of the same number of dimensions as the secondary signals. These results demonstrate that these dimensions, spanning roughly from ten to thirty, encode much information, suggesting that behavioral information exists in a higher-dimensional subspace than anticipated from raw signals.

      Thank you for your valuable feedback.

      Q5: “there is an apparent logical fallacy that begins in the abstract and persists in the paper: "Surprisingly, when incorporating often-ignored neural dimensions, behavioral information can be decoded linearly as accurately as nonlinear decoding, suggesting linear readout is performed in motor cortex." Don't get me wrong: the equivalency of linear and nonlinear decoding approaches on this dataset is interesting, and useful for neuroscientists in a practical sense. However, the paper expends much effort trying to make fundamental scientific claims that do not feel very strongly supported. This reviewer fails to see what we can learn about a set of neurons in the brain which are presumed to "read out" from motor cortex. These neurons will not have access to the data analyzed here. That a linear model can be conceived by an experimenter does not imply that the brain must use a linear model. The claim may be true, and it may well be that a linear readout is implemented in the brain. Other work [2,3] has shown that linear readouts of nonlinear neural activity patterns can explain some behavioural features. The claim in this paper, however, is not given enough”

      Due to the limitations of current observational methods and our incomplete understanding of brain mechanisms, it is indeed challenging to ascertain the specific data the brain acquires to generate behavior and whether it employs a linear readout. Conventionally, the neural data recorded in the motor cortex do encode movement behaviors and can be used to analyze neural encoding and decoding. Based on these data, we found that the linear decoder KF achieves comparable performance to that of the nonlinear decoder ANN on distilled relevant signals. This finding has undergone validation across three widely used datasets, providing substantial evidence. Furthermore, we conducted experiments on synthetic data to show that this conclusion is not a by-product of our model. In the revised manuscript, we added a more detailed description of this conclusion.

      Thank you for your valuable feedback.

      Q6: “Relatedly, I would like to note that the exercise of arbitrarily dividing a continuous distribution of a statistic (the "R2") based on an arbitrary threshold is a conceptually flawed exercise. The authors read too much into the fact that neurons which have a low R2 w.r.t. PDs have behavioural information w.r.t. other methods. To this reviewer, it speaks more about the irrelevance, so to speak, of the preferred direction metric than anything fundamental about the brain.”

      We chose the R2 threshold in accordance with the guidelines provided in reference [1]. It's worth mentioning that this threshold does not exert any significant influence on the overall conclusions.

      Thank you for your valuable feedback.

      [1] Inoue, Y., Mao, H., Suway, S.B., Orellana, J. and Schwartz, A.B., 2018. Decoding arm speed during reaching. Nature communications, 9(1), p.5243.

      Q7: “I am afraid I may be missing something, as I did not understand the fano factor analysis of Figure 3. In a sense the behaviourally relevant signals must have lower FF given they are in effect tied to the temporally smooth (and consistent on average across trials) behavioural covariates. The point of the original Churchland paper was to show that producing a behaviour squelches the variance; naturally these must appear in the behaviourally relevant components. A control distribution or reference of some type would possibly help here.”

      We agree that including reference signals could provide more context. The Churchland paper said stimulus onset can lead to a reduction in neural variability. However, our experiment focuses specifically on the reaching process, and thus, we don't have comparative experiments involving different types of signals.

      Thank you for your valuable feedback.

      Q8: “The authors compare the method to LFADS. While this is a reasonable benchmark as a prominent method in the field, LFADS does not attempt to solve the same problem as d-VAE. A better and much more fair comparison would be TNDM [4], an extension of LFADS which is designed to identify behaviourally relevant dimensions.”

      We have added the comparison experiments with TNDM in the revised manuscript (see Fig. 2 and Fig. S2). The details of model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Reviewer #3

      Q1.1: “TNDM: LFADS is not the best baseline for comparison. The authors should have compared with TNDM (Hurwitz et al. 2021), which is an extension of LFADS that (unlike LFADS) actually attempts to extract behaviorally relevant factors by adding a behavior term to the loss. The code for TNDM is also available on Github. LFADS is not even supervised by behavior and does not aim to address the problem that d-VAE aims to address, so it is not the most appropriate comparison. ”

      We have added the comparison experiments with TNDM in the revised manuscript (see Fig. 2 and Fig. S2). The details of model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Q1.2: “LFADS: LFADS is a sequential autoencoder that processes sections of data (e.g. trials). No explanation is given in Methods for how the data was passed to LFADS. Was the moving averaged smoothed data passed to LFADS or the raw spiking data (at what bin size)? Was a gaussian loss used or a poisson loss? What are the trial lengths used in each dataset, from which part of trials? For dataset C that has back-to-back reaches, was data chopped into segments? How long were these segments? Were the edges of segments overlapped and averaged as in (Keshtkaran et al. 2022) to avoid noisy segment edges or not? These are all critical details that are not explained. The same details would also be needed for a TNDM comparison (comment 1.1) since it has largely the same architecture as LFADS.

      It is also critical to briefly discuss these fundamental differences between the inputs of methods in the main text. LFADS uses a segment of data whereas VAE methods just use one sample at a time. What does this imply in the results? I guess as long as VAEs outperform LFADS it is ok, but if LFADS outperforms VAEs in a given metric, could it be because it received more data as input (a whole segment)? Why was the factor dimension set to 50? I presume it was to match the latent dimension of the VAE methods, but is the LFADS factor dimension the correct match for that to make things comparable?

      I am also surprised by the results. How do the authors justify LFADS having lower neural similarity (fig 2d) than VAE methods that operate on single time steps? LFADS is not supervised by behavior, so of course I don't expect it to necessarily outperform methods on behavior decoding. But all LFADS aims to do is to reconstruct the neural data so at least in this metric it should be able to outperform VAEs that just operate on single time steps? Is it because LFADS smooths the data too much? This is important to discuss and show examples of. These are all critical nuances that need to be discussed to validate the results and interpret them.”

      Regarding “Was the moving averaged smoothed data passed to LFADS or the raw spiking data (at what bin size)? Was a gaussian loss used or a poisson loss?”: The data used by all models was applied to the same preprocessing procedure. That is, using moving averaged smoothed data with three bins, where the bin size is 100ms. For all models except PSID, we used a Poisson loss.

      Regrading “What are the trial lengths used in each dataset, from which part of trials? For dataset C that has back-to-back reaches, was data chopped into segments? How long were these segments? Were the edges of segments overlapped and averaged as in (Keshtkaran et al. 2022) to avoid noisy segment edges or not?”:

      For datasets A and B, a trial length of eighteen is set. Trials with lengths below the threshold are zero-padded, while trials exceeding the threshold are truncated to the threshold length from their starting point. In dataset A, there are several trials with lengths considerably longer than that of most trials. We found that padding all trials with zeros to reach the maximum length (32) led to poor performance. Consequently, we chose a trial length of eighteen, effectively encompassing the durations of most trials and leading to the removal of approximately 9% of samples. For dataset B (center-out), the trial lengths are relatively consistent with small variation, and the maximum length across all trials is eighteen. For dataset C, we set the trial length as ten because we observed the video of this paradigm and found that the time for completing a single trial was approximately one second. The segments are not overlapped.

      Regarding “Why was the factor dimension set to 50? I presume it was to match the latent dimension of the VAE methods, but is the LFADS factor dimension the correct match for that to make things comparable?”: We performed a grid search for latent dimensions in {10,20,50} and found 50 is the best.

      Regarding “I am also surprised by the results. How do the authors justify LFADS having lower neural similarity (fig 2d) than VAE methods that operate on single time steps? LFADS is not supervised by behavior, so of course I don't expect it to necessarily outperform methods on behavior decoding. But all LFADS aims to do is to reconstruct the neural data so at least in this metric it should be able to outperform VAEs that just operate on single time steps? Is it because LFADS smooths the data too much?”: As you pointed out, we found that LFADS tends to produce excessively smooth and consistent data, which can lead to a reduction in neural similarity.

      Thank you for your valuable feedback.

      Q1.3: “PSID: PSID is linear and uses past input samples to predict the next sample in the output. Again, some setup choices are not well justified, and some details are left out in the 1-line explanation given in Methods.

      Why was a latent dimension of 6 chosen? Is this the behaviorally relevant latent dimension or the total latent dimension (for the use case here it would make sense to set all latent states to be behaviorally relevant)? Why was a horizon hyperparameter of 3 chosen? First, it is important to mention fundamental parameters such as latent dimension for each method in the main text (not just in methods) to make the results interpretable. Second, these hyperparameters should be chosen with a grid search in each dataset (within the training data, based on performance on the validation part of the training data), just as the authors do for their method (line 779). Given that PSID isn't a deep learning method, doing a thorough grid search in each fold should be quite feasible. It is important that high values for latent dimension and a wider range of other hyperparmeters are included in the search, because based on how well the residuals (x_i) for this method are shown predict behavior in Fig 2, the method seems to not have been used appropriately. I would expect ANN to improve decoding for PSID versus its KF decoding since PSID is fully linear, but I don't expect KF to be able to decode so well using the residuals of PSID if the method is used correctly to extract all behaviorally relevant information from neural data. The low neural reconstruction in Fid 2d could also partly be due to using too small of a latent dimension.

      Again, another import nuance is the input to this method and how differs with the input to VAE methods. The learned PSID model is a filter that operates on all past samples of input to predict the output in the "next" time step. To enable a fair comparison with VAE methods, the authors should make sure that the last sample "seen" by PSID is the same as then input sample seen by VAE methods. This is absolutely critical given how large the time steps are, otherwise PSID might underperform simply because it stopped receiving input 300ms earlier than the input received by VAE methods. To fix this, I think the authors can just shift the training and testing neural time series of PSID by 1 sample into the past (relative to the behavior), so that PSID's input would include the input of VAE methods. Otherwise, VAEs outperforming PSID is confounded by PSID's input not including the time step that was provided to VAE.”

      Thanks for your suggestions for letting PSID see the current neural observations. We did it per your suggestions and then performed a grid search for the hyperparameters for PSID. Specifically, we performed a grid search for the horizon hyperparameter in {2,3,4,5,6,7}. Since the relevant latent dimension should be lower than the horizon times the dimension of behavior variables (two-dimensional velocity in this paper) and increasing the dimension will reach performance saturation, we directly set the relevant latent dimensions as the maximum. The horizon number of datasets A, B, C, and synthetic datasets is 7, 6, 6 and 5, respectively.

      And thus the latent dimension of datasets A, B, and C and the synthetic dataset is 14, 12, 12 and 10, respectively.

      Our experiments show that KF can decode information from irrelevant signals obtained by PSID. Although PSID extracts the linear part of raw signals, KF can still use the linear part of the residuals for decoding. The low reconstruction performance of PSID may be because the relationship between latent variables and neural signals is linear, and the relationship between latent variables and behaviors is also linear; this is equivalent to the linear relationship between behaviors and neural signals, and linear models can only explain a small fraction of neural signals.

      Thank you for your valuable feedback.

      Q1.4: “CEBRA: results for CEBRA are incomplete. Similarity to raw signals is not shown. Decoding of behaviorally irrelevant residuals for CEBRA is not shown. Per Fig. S2, CEBRA does better or similar ANN decoding in datasets A and C, is only slightly worse in Dataset B, so it is important to show the other key metrics otherwise it is unclear whether d-VAE has some tangible advantage over CEBRA in those 2 datasets or if they are similar in every metric. Finally, it would be better if the authors show the results for CEBRA on Fig. 2, just as is done for other methods because otherwise it is hard to compare all methods.”

      CEBRA is a non-generative model, this model cannot generate behaviorally-relevant signals. Therefore, we only compared the decoding performance of latent embeddings of CEBRA and signals of d-VAE.

      Thank you for your valuable feedback.

      Q2: “Given the fact that d-VAE infers the latent (z) based on the population activity (x), claims about properties of the inferred behaviorally relevant signals (x_r) that attribute properties to individual neurons are confounded.

      The authors contrast their approach to population level approaches in that it infers behaviorally relevant signals for individual neurons. However, d-VAE is also a population method as it aggregates population information to infer the latent (z), from which behaviorally relevant part of the activity of each neuron (x_r) is inferred. The authors note this population level aggregation of information as a benefit of d-VAE, but only acknowledge it as a confound briefly in the context of one of their analyses (line 340): "The first is that the larger R2 neurons leak their information to the smaller R2 neurons, causing them contain too much behavioral information". They go on to dismiss this confounding possibility by showing that the inferred behaviorally relevant signal of each neuron is often most similar to its own raw signals (line 348-352) compared with all other neurons. They also provide another argument specific to that result section (i.e., residuals are not very behavior predictive), which is not general so I won't discuss it in depth here. These arguments however do not change the basic fact that d-VAE aggregates information from other neurons when extracting the behaviorally relevant activity of any given neuron, something that the authors note as a benefit of d-VAE in many instances. The fact that d-VAE aggregates population level info to give the inferred behaviorally relevant signal for each neuron confounds several key conclusions. For example, because information is aggregated across neurons, when trial to trial variability looks smoother after applying d-VAE (Fig 3i), or reveals better cosine tuning (Fig 3b), or when neurons that were not very predictive of behavior become more predictive of behavior (Fig 5), one cannot really attribute the new smoother single trial activity or the improved decoding to the same single neurons; rather these new signals/performances include information from other neurons. Unless the connections of the encoder network (z=f(x)) is zero for all other neurons, one cannot claim that the inferred rates for the neuron are truly solely associated with that neuron. I believe this a fundamental property of a population level VAE, and simply makes the architecture unsuitable for claims regarding inherent properties of single neurons. This confound is partly why the first claim in the abstract are not supported by data: observing that neurons that don't predict behavior very well would predict it much better after applying d-VAE does not prove that these neurons themselves "encode rich[er] behavioral information in complex nonlinear ways" (i.e., the first conclusion highlighted in the abstract) because information was also aggregated from other neurons. The other reason why this claim is not supported by data is the characterization of the encoding for smaller R2 neurons as "complex nonlinear", which the method is not well equipped to tease apart from linear mappings as I explain in my comment 3.”

      We acknowledge that we cannot obtain the exact single neuronal activity that does not contain any information from other neurons. However, we believe our model can extract accurate approximation signals of the ground truth relevant signals. These signals preserve the inherent properties of single neuronal activity to some extent and can be used for analysis at the single-neuron level.

      We believe d-VAE is a reasonable approach to extract effective relevant signals that preserve inherent properties of single neuronal activity for four key reasons:

      1) d-VAE is a latent variable model that adheres to the neural population doctrine. The neural population doctrine posits that information is encoded within interconnected groups of neurons, with the existence of latent variables (neural modes) responsible for generating observable neuronal activity [1, 2]. If we can perfectly obtain the true generative model from latent variables to neuronal activity, then we can generate the activity of each neuron from hidden variables without containing any information from other neurons. However, without a complete understanding of the brain’s encoding strategies (or generative model), we can only get the approximation signals of the ground truth signals.

      2) After the generative model is established, we need to infer the parameters of the generative model and the distribution of latent variables. During the inference process, inference algorithms such as variational inference or EM algorithms will be used. Generally, the obtained latent variables are also approximations of the real latent variables. When inferring the latent variables, it is inevitable to aggregation the information of the neural population, and latent variables are derived through weighted combinations of neuronal populations [3].

      This inference process is consistent with that of d-VAE (or VAE-based models).

      3) Latent variables are derived from raw neural signals and used to explain raw neural signals. Considering the unknown ground truth of latent variables and behaviorally-relevant signals, it becomes evident that the only reliable reference at the signal level is the raw signals. A crucial criterion for evaluating the reliability of latent variable models (including latent variables and generated relevant signals) is their capability to effectively explain the raw signals [3]. Consequently, we firmly maintain the belief that if the generated signals closely resemble the raw signals to the greatest extent possible, in accordance with an equivalence principle, we can claim that these obtained signals faithfully retain the inherent properties of single neurons. d-VAE explicitly constrains the generated signal to closely resemble the raw signals. These results demonstrate that d-VAE can extract effective relevant signals that preserve inherent properties of single neuronal activity.

      Based on the above reasons, we hold that generating single neuronal activities with the VAE framework is a reasonable approach. The remaining question is whether our model can obtain accurate relevant signals in the absence of ground truth. To our knowledge, in cases where the ground truth of relevant signals is unknown, there are typically two approaches to verifying the reliability of extracted signals:

      1) Conducting synthetic experiments where the ground truth is known.

      2) Validation based on expert knowledge (Three criteria were proposed in this paper). Both our extracted signals and key conclusions have been validated using these two approaches.

      Next, we will provide a detailed response to the concerns regarding our first key conclusion that smaller R2 neurons encode rich information.

      We acknowledge that larger R2 neurons play a role in aiding the reconstruction of signals in smaller R2 neurons through their neural activity. However, considering that neurons are correlated rather than independent entities, we maintain the belief that larger R2 neurons assist damaged smaller R2 neurons in restoring their original appearance. Taking image denoising as an example, when restoring noisy pixels to their original appearance, relying solely on the noisy pixels themselves is often impractical. Assistance from their correlated, clean neighboring pixels becomes necessary.

      The case we need to be cautious of is that the larger R2 neurons introduce additional signals (m) that contain substantial information to smaller R2 neurons, which they do not inherently possess. We believe this case does not hold for two reasons. Firstly, logically, adding extra signals decreases the reconstruction performance, and the information carried by these additional signals is redundant for larger R2 neurons, thus they do not introduce new information that can enhance the decoding performance of the neural population. Therefore, it seems unlikely and unnecessary for neural networks to engage in such counterproductive actions. Secondly, even if this occurs, our second criterion can effectively exclude the selection of these signals. To clarify, if we assume that x, y, and z denote the raw, relevant, and irrelevant signals of smaller R2 neurons, with x=y+z, and the extracted relevant signals become y+m, the irrelevant signals become z-m in this case. Consequently, the irrelevant signals contain a significant amount of information. It's essential to emphasize that this criterion holds significant importance in excluding undesirable signals.

      Furthermore, we conducted a synthetic experiment to show that d-VAE can indeed restore the damaged information of smaller R2 neurons with the help of larger R2 neurons, and the restored neuronal activities are more similar to ground truth compared to damaged raw signals. Please see Fig. S11a,b for details.

      Thank you for your valuable feedback.

      [1] Saxena, S. and Cunningham, J.P., 2019. Towards the neural population doctrine. Current opinion in neurobiology, 55, pp.103-111.

      [2] Gallego, J.A., Perich, M.G., Miller, L.E. and Solla, S.A., 2017. Neural manifolds for the control of movement. Neuron, 94(5), pp.978-984.

      [3] Cunningham, J.P. and Yu, B.M., 2014. Dimensionality reduction for large-scale neural recordings. Nature neuroscience, 17(11), pp.1500-1509.

      Q3: “Given the nonlinear architecture of the VAE, claims about the linearity or nonlinearity of cortical readout are confounded and not supported by the results.

      The inference of behaviorally relevant signals from raw signals is a nonlinear operation, that is x_r=g(f(x)) is nonlinear function of x. So even when a linear KF is used to decode behavior from the inferred behaviorally relevant signals, the overall decoding from raw signals to predicted behavior (i.e., KF applied to g(f(x))) is nonlinear. Thus, the result that decoding of behavior from inferred behaviorally relevant signals (x_r) using a linear KF and a nonlinear ANN reaches similar accuracy (Fig 2), does not suggest that a "linear readout is performed in the motor cortex", as the authors claim (line 471). The authors acknowledge this confound (line 472) but fail to address it adequately. They perform a simulation analysis where the decoding gap between KF and ANN remains unchanged even when d-VAE is used to infer behaviorally relevant signals in the simulation. However, this analysis is not enough for "eliminating the doubt" regarding the confound. I'm sure the authors can also design simulations where the opposite happens and just like in the data, d-VAE can improve linear decoding to match ANN decoding. An adequate way to address this concern would be to use a fully linear version of the autoencoder where the f(.) and g(.) mappings are fully linear. They can simply replace these two networks in their model with affine mappings, redo the modeling and see if the model still helps the KF decoding accuracy reach that of the ANN decoding. In such a scenario, because the overall KF decoding from original raw signals to predicted behavior (linear d-VAE + KF) is linear, then they could move toward the claim that the readout is linear. Even though such a conclusion would still be impaired by the nonlinear reference (d-VAE + ANN decoding) because the achieved nonlinear decoding performance could always be limited by network design and fitting issues. Overall, the third conclusion highlighted in the abstract is a very difficult claim to prove and is unfortunately not supported by the results.”

      We aim to explore the readout mechanism of behaviorally-relevant signals, rather than raw signals. Theoretically, the process of removing irrelevant signals should not be considered part of the inherent decoding mechanisms of the relevant signals. Assuming that the relevant signals we extracted are accurate, the conclusion of linear readout is established. On the synthetic data where the ground truth is known, our distilled signals show a significant improvement in neural similarity to the ground truth when compared to raw signals (refer to Fig. S2l). This observation demonstrates that our distilled signals are accurate approximations of the ground truth. Furthermore, on the three widely-used real datasets, our distilled signals meet the stringent criteria we have proposed (see Fig. 2), also providing strong evidence for their accuracy.

      Regarding the assertion that we could create simulations in which d-VAE can make signals that are inherently nonlinearly decodable into linearly decodable ones: In reality, we cannot achieve this, as the second criterion can rule out the selection of such signals. Specifically,z=x+y=n^2+y, where z, x, y, and n denote raw signals, relevant signals, irrelevant signals and latent variables. If the relevant signals obtained by d-VAE are n, then these signals can be linear decoded accurately. However, the corresponding irrelevant signals are n^2-n+z; thus, irrelevant signals will have much information, and these extracted relevant signals will not be selected. Furthermore, our synthetic experiments offer additional evidence supporting the conclusion that d-VAE does not make inherently nonlinearly decodable signals become linearly decodable ones. As depicted in Fig. S11c, there exists a significant performance gap between KF and ANN when decoding the ground truth signals of smaller R2 neurons. KF exhibits notably low performance, leaving substantial room for compensation by d-VAE. However, following processing by d-VAE, KF's performance of distilled signals fails to surpass its already low ground truth performance and remains significantly inferior to ANN's performance. These results collectively confirm that our approach does not convert signals that are inherently nonlinearly decodable into linearly decodable ones, and the conclusion of linear readout is not a by-product by d-VAE.

      Regarding the suggestion of using linear d-VAE + KF, as discussed in the Discussion section, removing the irrelevant signals requires a nonlinear operation, and linear d-VAE can not effectively separate relevant and irrelevant signals.

      Thank you for your valuable feedback.

      Q4: “The authors interpret several results as indications that "behavioral information is distributed in a higher-dimensional subspace than expected from raw signals", which is the second main conclusion highlighted in the abstract. However, several of these arguments do not convincingly support that conclusion.

      4.1) The authors observe that behaviorally relevant signals for neurons with small principal components (referred to as secondary) have worse decoding with KF but better decoding with ANN (Fig. 6b,e), which also outperforms ANN decoding from raw signals. This observation is taken to suggest that these secondary behaviorally relevant signals encode behavior information in highly nonlinear ways and in a higher dimensions neural space than expected (lines 424 and 428). These conclusions however are confounded by the fact that A) d-VAE uses nonlinear encoding, so one cannot conclude from ANN outperforming KF that behavior is encoded nonlinearly in the motor cortex (see comment 3 above), and B) d-VAE aggregates information across the population so one cannot conclude that these secondary neurons themselves had as much behavior information (see comment 2 above).

      4.2) The authors observe that the addition of the inferred behaviorally relevant signals for neurons with small principal components (referred to as secondary) improves the decoding of KF more than it improves the decoding of ANN (red curves in Fig 6c,f). This again is interpreted similarly as in 4.1, and is confounded for similar reasons (line 439): "These results demonstrate that irrelevant signals conceal the smaller variance PC signals, making their encoded information difficult to be linearly decoded, suggesting that behavioral information exists in a higher-dimensional subspace than anticipated from raw signals". This is confounded by because of the two reasons explained in 4.1. To conclude nonlinear encoding based on the difference in KF and ANN decoding, the authors would need to make the encoding/decoding in their VAE linear to have a fully linear decoder on one hand (with linear d-VAE + KF) and a nonlinear decoder on the other hand (with linear d-VAE + ANN), as explained in comment 3.

      4.3) From S Fig 8, where the authors compare cumulative variance of PCs for raw and inferred behaviorally relevant signals, the authors conclude that (line 554): "behaviorally-irrelevant signals can cause an overestimation of the neural dimensionality of behaviorally-relevant responses (Supplementary Fig. S8)." However, this analysis does not really say anything about overestimation of "behaviorally relevant" neural dimensionality since the comparison is done with the dimensionality of "raw" signals. The next sentence is ok though: "These findings highlight the need to filter out relevant signals when estimating the neural dimensionality.", because they use the phrase "neural dimensionality" not "neural dimensionality of behaviorally-relevant responses".”

      Questions 4.1 and 4.2 are a combination of Q2 and Q3. Please refer to our responses to Q2 and Q3.

      Regarding question 4.3 about “behaviorally-irrelevant signals can cause an overestimation of the neural dimensionality of behaviorally-relevant responses”: Previous studies usually used raw signals to estimate the neural dimensionality of specific behaviors. We mean that using raw signals, which include many irrelevant signals, will cause an overestimation of the neural dimensionality. We have modified this sentence in the revised manuscripts.

      Thank you for your valuable feedback.

      Q5: “Imprecise use of language in many places leads to inaccurate statements. I will list some of these statements”

      5.1) In the abstract: "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive due to the unknown ground truth of behaviorally-relevant signals". This statement is not accurate because it implies no prior work does this. The authors should make their statement more specific and also refer to some goal that existing linear (e.g., PSID) and nonlinear (e.g., TNDM) methods for extracting behaviorally relevant signals fail to achieve.

      5.2) In the abstract: "we found neural responses previously considered useless encode rich behavioral information" => what does "useless" mean operationally? Low behavior tuning? More precise use of language would be better.

      5.3) "... recent studies (Glaser 58 et al., 2020; Willsey et al., 2022) demonstrate nonlinear readout outperforms linear readout." => do these studies show that nonlinear "readout" outperforms linear "readout", or just that nonlinear models outperform linear models?

      5.4) Line 144: "The first criterion is that the decoding performance of the behaviorally-relevant signals (red bar, Fig.1) should surpass that of raw signals (the red dotted line, Fig.1).". Do the authors mean linear decoding here or decoding in general? If the latter, how can something extracted from neural surpass decoding of neural data, when the extraction itself can be thought of as part of decoding? The operational definition for this "decoding performance" should be clarified.

      5.5) Line 311: "we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that behaviorally-irrelevant signals lead to an overestimation of the neural dimensionality of behaviorally-relevant signals." => here the dimensionality of the total PC space (i.e., primary subspace of raw signals) is being compared with that of inferred behaviorally-relevant signals, so the former being higher does not indicate that neural dimensionality of behaviorally-relevant signals was overestimated. The former is simply not behavioral so this conclusion is not accurate.

      5.6) Section "Distilled behaviorally-relevant signals uncover that smaller R2 neurons encode rich behavioral information in complex nonlinear ways". Based on what kind of R2 are the neurons grouped? Behavior decoding R2 from raw signals? Using what mapping? Using KF? If KF is used, the result that small R2 neurons benefit a lot from d-VAE could be somewhat expected, given the nonlinearity of d-VAE: because only ANN would have the capacity to unwrap the nonlinear encoding of d-VAE as needed. If decoding performance that is used to group neurons is based on data, regression to the mean could also partially explain the result: the neurons with worst raw decoding are most likely to benefit from a change in decoder, than neurons that already had good decoding. In any case, the R2 used to partition and sort neurons should be more clearly stated and reminded throughout the text and I Fig 3.

      5.7) Line 346 "...it is impossible for our model to add the activity of larger R2 neurons to that of smaller R2 neurons" => Is it really impossible? The optimization can definitely add small-scale copies of behaviorally relevant information to all neurons with minimal increase in the overall optimization loss, so this statement seems inaccurate.

      5.8) Line 490: "we found that linear decoders can achieve comparable performance to that of nonlinear decoders, providing compelling evidence for the presence of linear readout in the motor cortex." => inaccurate because no d-VAE decoding is really linear, as explained in comment 3 above.

      5.9) Line 578: ". However, our results challenge this idea by showing that signals composed of smaller variance PCs nonlinearly encode a significant amount of behavioral information." => inaccurate as results are confounded by nonlinearity of d-VAE as explained in comment 3 above.

      5.10) Line 592: "By filtering out behaviorally-irrelevant signals, our study found that accurate decoding performance can be achieved through linear readout, suggesting that the motor cortex may perform linear readout to generate movement behaviors." => inaccurate because it us confounded by the nonlinearity of d-VAE as explained in comment 3 above.”

      Regarding “5.1) In the abstract: "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive due to the unknown ground truth of behaviorally-relevant signals". This statement is not accurate because it implies no prior work does this. The authors should make their statement more specific and also refer to some goal that existing linear (e.g., PSID) and nonlinear (e.g., TNDM) methods for extracting behaviorally relevant signals fail to achieve”:

      We believe our statement is accurate. Our primary objective is to extract accurate behaviorally-relevant signals that closely approximate the ground truth relevant signals. To achieve this, we strike a balance between the reconstruction and decoding performance of the generated signals, aiming to effectively capture the relevant signals. This crucial aspect of our approach sets it apart from other methods. In contrast, other methods tend to emphasize the extraction of valuable latent neural dynamics. We have provided elaboration on the distinctions between d-VAE and other approaches in the Introduction and Discussion sections.

      Thank you for your valuable feedback.

      Regarding “5.2) In the abstract: "we found neural responses previously considered useless encode rich behavioral information" => what does "useless" mean operationally? Low behavior tuning? More precise use of language would be better.”:

      In the analysis of neural signals, smaller variance PC signals are typically seen as noise and are often discarded. Similarly, smaller R2 neurons are commonly thought to be dominated by noise and are not further analyzed. Given these considerations, we believe that the term "considered useless" is appropriate in this context. Thank you for your valuable feedback.

      Regarding “5.3) "... recent studies (Glaser 58 et al., 2020; Willsey et al., 2022) demonstrate nonlinear readout outperforms linear readout." => do these studies show that nonlinear "readout" outperforms linear "readout", or just that nonlinear models outperform linear models?”:

      In this paper, we consider the two statements to be equivalent. Thank you for your valuable feedback.

      Regarding “5.4) Line 144: "The first criterion is that the decoding performance of the behaviorally-relevant signals (red bar, Fig.1) should surpass that of raw signals (the red dotted line, Fig.1).". Do the authors mean linear decoding here or decoding in general? If the latter, how can something extracted from neural surpass decoding of neural data, when the extraction itself can be thought of as part of decoding? The operational definition for this "decoding performance" should be clarified.”:

      We mean the latter, as we said in the section “Framework for defining, extracting, and separating behaviorally-relevant signals”, since raw signals contain too many behaviorally-irrelevant signals, deep neural networks are more prone to overfit raw signals than relevant signals. Therefore the decoding performance of relevant signals should surpass that of raw signals. Thank you for your valuable feedback.

      Regarding “5.5) Line 311: "we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that behaviorally-irrelevant signals lead to an overestimation of the neural dimensionality of behaviorally-relevant signals." => here the dimensionality of the total PC space (i.e., primary subspace of raw signals) is being compared with that of inferred behaviorally-relevant signals, so the former being higher does not indicate that neural dimensionality of behaviorally-relevant signals was overestimated. The former is simply not behavioral so this conclusion is not accurate.”: In practice, researchers usually used raw signals to estimate the neural dimensionality. We mean that using raw signals to do this would overestimate the neural dimensionality. Thank you for your valuable feedback.

      Regarding “5.6) Section "Distilled behaviorally-relevant signals uncover that smaller R2 neurons encode rich behavioral information in complex nonlinear ways". Based on what kind of R2 are the neurons grouped? Behavior decoding R2 from raw signals? Using what mapping? Using KF? If KF is used, the result that small R2 neurons benefit a lot from d-VAE could be somewhat expected, given the nonlinearity of d-VAE: because only ANN would have the capacity to unwrap the nonlinear encoding of d-VAE as needed. If decoding performance that is used to group neurons is based on data, regression to the mean could also partially explain the result: the neurons with worst raw decoding are most likely to benefit from a change in decoder, than neurons that already had good decoding. In any case, the R2 used to partition and sort neurons should be more clearly stated and reminded throughout the text and I Fig 3.”:

      When employing R2 to characterize neurons, it indicates the extent to which neuronal activity is explained by the linear encoding model [1-3]. Smaller R2 neurons have a lower capacity for linearly tuning (encoding) behaviors, while larger R2 neurons have a higher capacity for linearly tuning (encoding) behaviors. Specifically, the approach involves first establishing an encoding relationship from velocity to neural signal using a linear model, i.e., y=f(x), where f represents a linear regression model, x denotes velocity, and y denotes the neural signal. Subsequently, R2 is utilized to quantify the effectiveness of the linear encoding model in explaining neural activity. We have provided a comprehensive explanation in the revised manuscript. Thank you for your valuable feedback.

      [1] Collinger, J.L., Wodlinger, B., Downey, J.E., Wang, W., Tyler-Kabara, E.C., Weber, D.J., McMorland, A.J., Velliste, M., Boninger, M.L. and Schwartz, A.B., 2013. High-performance neuroprosthetic control by an individual with tetraplegia. The Lancet, 381(9866), pp.557-564.

      [2] Wodlinger, B., et al. "Ten-dimensional anthropomorphic arm control in a human brain− machine interface: difficulties, solutions, and limitations." Journal of neural engineering 12.1 (2014): 016011.

      [3] Inoue, Y., Mao, H., Suway, S.B., Orellana, J. and Schwartz, A.B., 2018. Decoding arm speed during reaching. Nature communications, 9(1), p.5243.

      Regarding Questions 5.7, 5.8, 5.9, and 5.10:

      We believe our conclusions are solid. The reasons can be found in our replies in Q2 and Q3. Thank you for your valuable feedback.

      Q6: “Imprecise use of language also sometimes is not inaccurate but just makes the text hard to follow.

      6.1) Line 41: "about neural encoding and decoding mechanisms" => what is the definition of encoding/decoding and how do these differ? The definitions given much later in line 77-79 is also not clear.

      6.2) Line 323: remind the reader about what R2 is being discussed, e.g., R2 of decoding behavior using KF. It is critical to know if linear or nonlinear decoding is being discussed.

      6.3) Line 488: "we found that neural responses previously considered trivial encode rich behavioral information in complex nonlinear ways" => "trivial" in what sense? These phrases would benefit from more precision, for example: "neurons that may seem to have little or no behavior information encoded". The same imprecise word ("trivial") is also used in many other places, for example in the caption of Fig S9.

      6.4) Line 611: "The same should be true for the brain." => Too strong of a statement for an unsupported claim suggesting the brain does something along the lines of nonlin VAE + linear readout.

      6.5) In Fig 1, legend: what is the operational definition of "generating performance"? Generating what? Neural reconstruction?”

      Regarding “6.1) Line 41: "about neural encoding and decoding mechanisms" => what is the definition of encoding/decoding and how do these differ? The definitions given much later in line 77-79 is also not clear.”:

      We would like to provide a detailed explanation of neural encoding and decoding. Neural encoding means how neuronal activity encodes the behaviors, that is, y=f(x), where y denotes neural activity and, x denotes behaviors, f is the encoding model. Neural decoding means how the brain decodes behaviors from neural activity, that is, x=g(y), where g is the decoding model. For further elaboration, please refer to [1]. We have included references that discuss the concepts of encoding and decoding in the revised manuscript. Thank you for your valuable feedback.

      [1] Kriegeskorte, Nikolaus, and Pamela K. Douglas. "Interpreting encoding and decoding models." Current opinion in neurobiology 55 (2019): 167-179.

      Regarding “6.2) Line 323: remind the reader about what R2 is being discussed, e.g., R2 of decoding behavior using KF. It is critical to know if linear or nonlinear decoding is being discussed.”:

      This question is the same as Q5.6. Please refer to the response to Q5.6. Thank you for your valuable feedback.

      Regarding “6.3) Line 488: "we found that neural responses previously considered trivial encode rich behavioral information in complex nonlinear ways" => "trivial" in what sense? These phrases would benefit from more precision, for example: "neurons that may seem to have little or no behavior information encoded". The same imprecise word ("trivial") is also used in many other places, for example in the caption of Fig S9.”:

      We have revised this statement in the revised manuscript. Thanks for your recommendation.

      Regarding “6.4) Line 611: "The same should be true for the brain." => Too strong of a statement for an unsupported claim suggesting the brain does something along the lines of nonlin VAE + linear readout.”

      We mean that removing the interference of irrelevant signals and decoding the relevant signals should logically be two stages. We have revised this statement in the revised manuscript. Thank you for your valuable feedback.

      Regarding “6.5) In Fig 1, legend: what is the operational definition of "generating performance"? Generating what? Neural reconstruction?””:

      We have replaced “generating performance” with “reconstruction performance” in the revised manuscript. Thanks for your recommendation.

      Q7: “In the analysis presented starting in line 449, the authors compare improvement gained for decoding various speed ranges by adding secondary (small PC) neurons to the KF decoder (Fig S11). Why is this done using the KF decoder, when earlier results suggest an ANN decoder is needed for accurate decoding from these small PC neurons? It makes sense to use the more accurate nonlinear ANN decoder to support the fundamental claim made here, that smaller variance PCs are involved in regulating precise control”

      Because when the secondary signal is superimposed on the primary signal, the enhancement in KF performance is substantial. We wanted to explore in which aspect of the behavior the KF performance improvement is mainly reflected. In comparison, the improvement of ANN by the secondary signal is very small, rendering the exploration of the aforementioned questions inconsequential. Thank you for your valuable feedback.

      Q8: “A key limitation of the VAE architecture is that it doesn't aggregate information over multiple time samples. This may be why the authors decided to use a very large bin size of 100ms and beyond that smooth the data with a moving average. This limitation should be clearly stated somewhere in contrast with methods that can aggregate information over time (e.g., TNDM, LFADS, PSID) ”

      We have added this limitation in the Discussion in the revised manuscript. Thanks for your recommendation.

      Q9: “Fig 5c and parts of the text explore the decoding when some neurons are dropped. These results should come with a reminder that dropping neurons from behaviorally relevant signals is not technically possible since the extraction of behaviorally relevant signals with d-VAE is a population level aggregation that requires the raw signal from all neurons as an input. This is also important to remind in some places in the text for example:

      • Line 498: "...when one of the neurons is destroyed."

      • Line 572: "In contrast, our results show that decoders maintain high performance on distilled signals even when many neurons drop out."”

      We want to explore the robustness of real relevant signals in the face of neuron drop-out. The signals our model extracted are an approximation of the ground truth relevant signals and thus serve as a substitute for ground truth to study this problem. Thank you for your valuable feedback.

      Q10: “Besides the confounded conclusions regarding the readout being linear (see comment 3 and items related to it in comment 5), the authors also don't adequately discuss prior works that suggest nonlinearity helps decoding of behavior from the motor cortex. Around line 594, a few works are discussed as support for the idea of a linear readout. This should be accompanied by a discussion of works that support a nonlinear encoding of behavior in the motor cortex, for example (Naufel et al. 2019; Glaser et al. 2020), some of which the authors cite elsewhere but don't discuss here.”

      We have added this discussion in the revised manuscript. Thanks for your recommendation.

      Q11: “Selection of hyperparameters is not clearly explained. Starting line 791, the authors give some explanation for one hyperparameter, but not others. How are the other hyperparameters determined? What is the search space for the grid search of each hyperparameter? Importantly, if hyperparameters are determined only based on the training data of each fold, why is only one value given for the hyperparameter selected in each dataset (line 814)? Did all 5 folds for each dataset happen to select exactly the same hyperparameter based on their 5 different training/validation data splits? That seems unlikely.”

      We perform a grid search in {0.001, 0.01,0.1,1} for hyperparameter beta. And we found that 0.001 is the best for all datasets. As for the model parameters, such as hidden neuron numbers, this model capacity has reached saturation decoding performance and does not influence the results.

      Regarding “Importantly, if hyperparameters are determined only based on the training data of each fold, why is only one value given for the hyperparameter selected in each dataset (line 814)? Did all 5 folds for each dataset happen to select exactly the same hyperparameter based on their 5 different training/validation data splits”: We selected the hyperparameter based on the average performance of 5 folds data on validation sets. The selected value denotes the one that yields the highest average performance across the 5 folds data.

      Thank you for your valuable feedback.

      Q12: “d-VAE itself should also be explained more clearly in the main text. Currently, only the high-level idea of the objective is explained. The explanation should be more precise and include the idea of encoding to latent state, explain the relation to pip-VAE, explain inputs and outputs, linearity/nonlinearity of various mappings, etc. Also see comment 1 above, where I suggest adding more details about other methods in the main text.”

      Our primary objective is to delve into the encoding and decoding mechanisms using the separated relevant signals. Therefore, providing an excessive amount of model details could potentially distract from the main focus of the paper. In response to your suggestion, we have included a visual representation of d-VAE's structure, input, and output (see Fig. S1) in the revised manuscript, which offers a comprehensive and intuitive overview. Additionally, we have expanded on the details of d-VAE and other methods in the Methods section.

      Thank you for your valuable feedback.

      Q13: “In Fig 1f and g, shouldn't the performance plots be swapped? The current plots seem counterintuitive. If there is bias toward decoding (panel g), why is the irrelevant residual so good at decoding?”

      The placement of the performance plots in Fig. 1f and 1g is accurate. When the model exhibits a bias toward decoding, it prioritizes extracting the most relevant features (latent variables) for decoding purposes. As a consequence, the model predominantly generates signals that are closely associated with these extracted features. This selective signal extraction and generation process may result in the exclusion of other potentially useful information, which will be left in the residuals. To illustrate this concept, consider the example of face recognition: if a model can accurately identify an individual using only the person's eyes (assuming these are the most useful features), other valuable information, such as details of the nose or mouth, will be left in the residuals, which could also be used to identify the individual.

      Thank you for your valuable feedback.

    1. Author Response:

      The following is the authors’ response to the previous reviews.

      We carefully read through the second-round reviews and the additional reviews. To us, the review process is somewhat unusual and very much dominated by referee 2, who aggressively insists that we mixed up the trigeminal nucleus and inferior olive and that as a consequence our results are meaningless. We think the stance of referee 2 and the focus on one single issue (the alleged mix-up of trigeminal nucleus and inferior olive) is somewhat unfortunate, leaves out much of our findings and we debated at length on how to deal with further revisions. In the end, we decided to again give priority to addressing the criticism of referees 2, because it is hard to go on with a heavily attacked paper without resolving the matter at stake. The following is a summary of, what we did:

      Additional experimental work:

      (1) We checked if the peripherin-antibody indeed reliably identifies climbing fibers.

      To this end, we sectioned the elephant cerebellum and stained sections with the peripherin-antibody. We find: (i) the cerebellar white matter is strongly reactive for peripherin-antibodies, (ii) cerebellar peripherin-antibody staining of has an axonal appearance. (iii) Cerebellar Purkinje cell somata appear to be ensheated by peripherin-antibody staining. (iv) We observed that the peripherin-antibody reactivity gradually decreases from Purkinje cell somata to the pia in the cerebellar molecular layer. This work is shown in our revised Figure 2. All these four features align with the distribution of climbing fibers (which arrive through the white matter, are axons, ensheat Purkinje cell somata, and innervate Purkinje cell proximally not reaching the pia). In line with previous work, which showed similar cerebellar staining patterns in several species (Errante et al. 1998), we conclude that elephant climbing fibers are strongly reactive for peripherin-antibodies.

      (2) We delineated the elephant olivo-cerebellar tract.

      The strong peripherin-antibody reactivity of elephant climbing fibers enabled us to delineate the elephant olivo-cerebellar tract. We find the elephant olivo-cerebellar tract is a strongly peripherin-antibody reactive, well-delineated fiber tract several millimeters wide and about a centimeter in height. The unstained olivo-cerebellar tract has a greyish appearance. In the anterior regions of the olivo-cerebellar tract, we find that peripherin-antibody reactive fibers run in the dorsolateral brainstem and approach the cerebellar peduncle, where the tract gradually diminishes in size, presumably because climbing fibers discharge into the peduncle. Indeed, peripherin-antibody reactive fibers can be seen entering the cerebellar peduncle. Towards the posterior end of the peduncle, the olivo-cerebellar disappears (in the dorsal brainstem directly below the peduncle. We note that the olivo-cerebellar tract was referred to as the spinal trigeminal tract by Maseko et al. 2013. We think the tract in question cannot be the spinal trigeminal tract for two reasons: (i) This tract is the sole brainstem source of peripherin-positive climbing fibers entering the peduncle/ the cerebellum; this is the defining characteristic of the olivo-cerebellar tract. (ii) The tract in question is much smaller than the trigeminal nerve, disappears posterior to where the trigeminal nerve enters the brainstem (see below), and has no continuity with the trigeminal nerve; the continuity with the trigeminal nerve is the defining characteristic of the spinal trigeminal tract, however.

      The anterior regions of the elephant olivo-cerebellar tract are similar to the anterior regions of olivo-cerebellar tract of other mammals in its dorsolateral position and the relation to the cerebellar peduncle. In its more posterior parts, the elephant olivo-cerebellar tract continues for a long distance (~1.5 cm) in roughly the same dorsolateral position and enters the serrated nucleus that we previously identified as the elephant inferior olive. The more posterior parts of the elephant olivo-cerebellar tract therefore differ from the more posterior parts of the olivo-cerebellar tract of other mammals, which follows a ventromedial trajectory towards a ventromedially situated inferior olive. The implication of our delineation of the elephant olivo-cerebellar tract is that we correctly identified the elephant inferior olive.

      (3) An in-depth analysis of peripherin-antibody reactivity also indicates that the trigeminal nucleus receives no climbing fiber input.

      We also studied the peripherin-antibody reactivity in and around the trigeminal nucleus. We had also noted in the previous submission that the trigeminal nucleus is weakly positive for peripherin, but that the staining pattern is uniform and not the type of axon bundle pattern that is seen in the inferior olive of other mammals. To us, this observation already argued against the presence of climbing fibers in the trigeminal nucleus. We also noted that the myelin stripes of the trigeminal nucleus were peripherin-antibody-negative. In the context of our olivo-cerebellar tract tracing we now also scrutinized the surroundings of the trigeminal nucleus for peripherin-antibody reactivity. We find that the ventral brainstem surrounding the trigeminal nucleus is devoid of peripherin-antibody reactivity. Accordingly, no climbing fibers, (which we have shown to be strongly peripherin-antibody-positive, see our point 1) arrive at the trigeminal nucleus. The absence of climbing fiber input indicates that previous work that identified the (trigeminal) nucleus as the inferior olive (Maseko et al 2013) is unlikely to be correct.

      (4) We characterized the entry of the trigeminal nerve into the elephant brain.

      To better understand how trigeminal information enters the elephant’s brain, we characterized the entry of the trigeminal nerve. This analysis indicated to us that the trigeminal nerve is not continuous with the olivo-cerebellar tract (the spinal trigeminal tract of Maseko et al. 2013) as previously claimed by Maseko et al. 2013. We show some of this evidence in Referee-Figure 1 below. The reason we think the trigeminal nerve is discontinuous with the olivo-cerebellar tract is the size discrepancy between the two structures. We first show this for the tracing data of Maseko et al. 2013. In the Maseko et al. 2013 data the trigeminal nerve (Referee-Figure 1A, their plate Y) has 3-4 times the diameter of the olivocerebellar tract (the alleged spinal trigeminal tract, Referee-Figure 1B, their plate Z). Note that most if not all trigeminal fibers are thought to continue from the nerve into the trigeminal tract (see our rat data below). We plotted the diameter of the trigeminal nerve and diameter of the olivo-cerebellar (the spinal trigeminal tract according to Maseko et al. 2013) from the Maseko et al. 2013 data (Referee-Figure 1C) and we found that the olivocerebellar tract has a fairly consistent diameter (46 ± 9 mm2, mean ± SD). Statistical considerations and anatomical evidence suggest that the tracing of the trigeminal nerve into the olivo-cerebellar (the spinal trigeminal tract according to Maseko et al. 2013) is almost certainly wrong. The most anterior point of the alleged spinal trigeminal tract has a diameter of 51 mm2 which is more than 15 standard deviations different from the most posterior diameter (194 mm2) of the trigeminal tract. For this assignment to be correct three-quarters of trigeminal nerve fibers would have to spontaneously disappear, something that does not happen in the brain. We also made similar observations in the African elephant Bibi, where the trigeminal nerve (Referee-Figure 1D) is much larger in diameter than the olivocerebellar tract (Referee-Figure 1E). We could also show that the olivocerebellar tract disappears into the peduncle posterior to where the trigeminal nerve enters (Referee-Figure 1F). Our data are very similar to Maseko et al. indicating that their outlining of structures was done correctly. What appears to have been oversimplified, is the assignment of structures as continuous. We also quantified the diameter of the trigeminal nerve and the spinal trigeminal tract in rats (from the Paxinos & Watson atlas; Referee-Figure 1D); as expected we found the trigeminal nerve and spinal trigeminal tract diameters are essentially continuous.

      In our hands, the trigeminal nerve does not continue into a well-defined tract that could be traced after its entry. In this regard, it differs both from the olivo-cerebellar tract of the elephant or the spinal trigeminal tract of the rodent, both of which are well delineated. We think the absence of a well-delineated spinal trigeminal tract in elephants might have contributed to the putative tracing error highlighted in our Referee-Figure 1A-C.

      We conclude that a size mismatch indicates trigeminal fibers do not run in the olivo-cerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013).

      Author response image 1.

      The trigeminal nerve is discontinuous with the olivo-cerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013). A, Trigeminal nerve (orange) in the brain of African elephant LAX as delineated by Maseko et al. 2013 (coronal section; their plate Y). B, Most anterior appearance of the spinal trigeminal tract of Maseko et al. 2013 (blue; coronal section; their plate Z). Note the much smaller diameter of the spinal trigeminal tract compared to the trigeminal nerve shown in C, which argues against the continuity of the two structures. Indeed, our peripherin-antibody staining showed that the spinal trigeminal tract of Maseko corresponds to the olivo-cerebellar tract and is discontinuous with the trigeminal nerve. C, Plot of the trigeminal nerve and olivo-cerebellar tracts (the spinal trigeminal tract according to Maseko et al. 2013) diameter along the anterior-posterior axis. The trigeminal nerve is much larger in diameter than the olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013). C, D measurements, for which sections are shown in panels C and D respectively. The olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013) has a consistent diameter; data replotted from Maseko et al. 2013. At mm 25 the inferior olive appears. D, Trigeminal nerve entry in the brain of African elephant Bibi; our data, coronal section, the trigeminal nerve is outlined in orange, note the large diameter. E, Most anterior appearance of the olivo-cerebellar tract in the brain of African elephant Bibi; our data, coronal section, approximately 3 mm posterior to the section shown in A, the olivocerebellar tract is outlined in blue. Note the smaller diameter of the olivo-cerebellar tract compared to the trigeminal nerve, which argues against the continuity of the two structures. F, Plot of the trigeminal nerve and olivo-cerebellar tract diameter along the anterior-posterior axis. The nerve and olivo-cerebellar tract are discontinuous and the trigeminal nerve is much larger in diameter than the olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013); our data. D, E measurements, for which sections are shown in panels D and E respectively. At mm 27 the inferior olive appears. G, In the rat the trigeminal nerve is continuous in size with the spinal trigeminal tract. Data replotted from Paxinos and Watson.

      Reviewer 2 (Public Review):

      As indicated in my previous review of this manuscript (see above), it is my opinion that the authors have misidentified, and indeed switched, the inferior olivary nuclear complex (IO) and the trigeminal nuclear complex (Vsens). It is this specific point only that I will address in this second review, as this is the crucial aspect of this paper - if the identification of these nuclear complexes in the elephant brainstem by the authors is incorrect, the remainder of the paper does not have any scientific validity.

      Comment: We agree with the referee that it is most important to sort out, the inferior olivary nuclear complex (IO) and the trigeminal nuclear complex, respectively.Change: We did additional experimental work to resolve this matter as detailed at the beginning of our response. Specifically, we ascertained that elephant climbing fibers are strongly peripherin-positive. Based on elephant climbing fiber peripherin-reactivity we delineated the elephant olivo-cerebellar tract. We find that the olivo-cerebellar connects to the structure we refer to as inferior olive to the cerebellum (the referee refers to this structure as the trigeminal nuclear complex). We also found that the trigeminal nucleus (the structure the referee refers to as inferior olive) appears to receive no climbing fibers. We provide indications that the tracing of the trigeminal nerve into the olivo-cerebellar tract by Maseko et al. 2023 was erroneous (Author response image 1). These novel findings support our ideas but are very difficult to reconcile with the referee’s partitioning scheme.

      The authors, in their response to my initial review, claim that I "bend" the comparative evidence against them. They further claim that as all other mammalian species exhibit a "serrated" appearance of the inferior olive, and as the elephant does not exhibit this appearance, that what was previously identified as the inferior olive is actually the trigeminal nucleus and vice versa. 

      For convenience, I will refer to IOM and VsensM as the identification of these structures according to Maseko et al (2013) and other authors and will use IOR and VsensR to refer to the identification forwarded in the study under review. <br /> The IOM/VsensR certainly does not have a serrated appearance in elephants. Indeed, from the plates supplied by the authors in response (Referee Fig. 2), the cytochrome oxidase image supplied and the image from Maseko et al (2013) shows a very similar appearance. There is no doubt that the authors are identifying structures that closely correspond to those provided by Maseko et al (2013). It is solely a contrast in what these nuclear complexes are called and the functional sequelae of the identification of these complexes (are they related to the trunk sensation or movement controlled by the cerebellum?) that is under debate.

      Elephants are part of the Afrotheria, thus the most relevant comparative data to resolve this issue will be the identification of these nuclei in other Afrotherian species. Below I provide images of these nuclear complexes, labelled in the standard nomenclature, across several Afrotherian species. 

      (A) Lesser hedgehog tenrec (Echinops telfairi) 

      Tenrecs brains are the most intensively studied of the Afrotherian brains, these extensive neuroanatomical studies undertaken primarily by Heinz Künzle. Below I append images (coronal sections stained with cresol violet) of the IO and Vsens (labelled in the standard mammalian manner) in the lesser hedgehog tenrec. It should be clear that the inferior olive is located in the ventral midline of the rostral medulla oblongata (just like the rat) and that this nucleus is not distinctly serrated. The Vsens is located in the lateral aspect of the medulla skirted laterally by the spinal trigeminal tract (Sp5). These images and the labels indicating structures correlate precisely with that provide by Künzle (1997, 10.1016, see his Figure 1K,L. Thus, in the first case of a related species, there is no serrated appearance of the inferior olive, the location of the inferior olive is confirmed through connectivity with the superior colliculus (a standard connection in mammals) by Künzle (1997), and the location of Vsens is what is considered to be typical for mammals. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (B) Giant otter shrew (Potomogale velox) 

      The otter shrews are close relatives of the Tenrecs. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see hints of the serration of the IO as defined by the authors, but we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      (C) Four-toed sengi (Petrodromus tetradactylus) 

      The sengis are close relatives of the Tenrecs and otter shrews, these three groups being part of the Afroinsectiphilia, a distinct branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see vague hints of the serration of the IO (as defined by the authors), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (D) Rock hyrax (Procavia capensis) 

      The hyraxes, along with the sirens and elephants form the Paenungulata branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per the standard mammalian anatomy. Here we see hints of the serration of the IO (as defined by the authors), but we also see evidence of a more "bulbous" appearance of subnuclei of the IO (particularly the principal nucleus), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (E) West Indian manatee (Trichechus manatus) 

      The sirens are the closest extant relatives of the elephants in the Afrotheria. Below I append images of cresyl violet (top) and myelin (bottom) stained coronal sections (taken from the University of Wisconsin-Madison Brain Collection, https://brainmuseum.org, and while quite low in magnification they do reveal the structures under debate) through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see the serration of the IO (as defined by the authors). Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      These comparisons and the structural identification, with which the authors agree as they only distinguish the elephants from the other Afrotheria, demonstrate that the appearance of the IO can be quite variable across mammalian species, including those with a close phylogenetic affinity to the elephants. Not all mammal species possess a "serrated" appearance of the IO. Thus, it is more than just theoretically possible that the IO of the elephant appears as described prior to this study. 

      So what about elephants? Below I append a series of images from coronal sections through the African elephant brainstem stained for Nissl, myelin, and immunostained for calretinin. These sections are labelled according to standard mammalian nomenclature. In these complete sections of the elephant brainstem, we do not see a serrated appearance of the IOM (as described previously and in the current study by the authors). Rather the principal nucleus of the IOM appears to be bulbous in nature. In the current study, no image of myelin staining in the IOM/VsensR is provided by the authors. However, in the images I provide, we do see the reported myelin stripes in all stains - agreement between the authors and reviewer on this point. The higher magnification image to the bottom left of the plate shows one of the IOM/VsensR myelin stripes immunostained for calretinin, and within the myelin stripes axons immunopositive for calretinin are seen (labelled with an arrow). The climbing fibres of the elephant cerebellar cortex are similarly calretinin immunopositive (10.1159/000345565). In contrast, although not shown at high magnification, the fibres forming the Sp5 in the elephant (in the Maseko description, unnamed in the description of the authors) show no immunoreactivity to calretinin. 

      Comment: We appreciate the referee’s additional comments. We concede the possibility that some relatives of elephants have a less serrated inferior olive than most other mammals. We maintain, however, that the elephant inferior olive (our Figure 1J) has the serrated appearance seen in the vast majority of mammals.

      Change: None.

      Peripherin Immunostaining 

      In their revised manuscript the authors present immunostaining of peripherin in the elephant brainstem. This is an important addition (although it does replace the only staining of myelin provided by the authors which is unusual as the word myelin is in the title of the paper) as peripherin is known to specifically label peripheral nerves. In addition, as pointed out by the authors, peripherin also immunostains climbing fibres (Errante et al., 1998). The understanding of this staining is important in determining the identification of the IO and Vsens in the elephant, although it is not ideal for this task as there is some ambiguity. Errante and colleagues (1998; Fig. 1) show that climbing fibres are peripherin-immunopositive in the rat. But what the authors do not evaluate is the extensive peripherin staining in the rat Sp5 in the same paper (Errante et al, 1998, Fig. 2). The image provided by the authors of their peripherin immunostaining (their new Figure 2) shows what I would call the Sp5 of the elephant to be strongly peripherin immunoreactive, just like the rat shown in Errant et al (1998), and more over in the precise position of the rat Sp5! This makes sense as this is where the axons subserving the "extraordinary" tactile sensitivity of the elephant trunk would be found (in the standard model of mammalian brainstem anatomy). Interestingly, the peripherin immunostaining in the elephant is clearly lamellated...this coincides precisely with the description of the trigeminal sensory nuclei in the elephant by Maskeo et al (2013) as pointed out by the authors in their rebuttal. Errante et al (1998) also point out peripherin immunostaining in the inferior olive, but according to the authors this is only "weakly present" in the elephant IOM/VsensR. This latter point is crucial. Surely if the elephant has an extraordinary sensory innervation from the trunk, with 400 000 axons entering the brain, the VsensR/IOM should be highly peripherin-immunopositive, including the myelinated axon bundles?! In this sense, the authors argue against their own interpretation - either the elephant trunk is not a highly sensitive tactile organ, or the VsensR is not the trigeminal nuclei it is supposed to be. 

      Comment: We made sure that elephant climbing fibers are strongly peripherin-positive (our revised Figure 2). As we noted in already our previous ms, we see weak diffuse peripherin-reactivity in the trigeminal nucleus (the inferior olive according to the referee), but no peripherin-reactive axon bundles (i.e. climbing fibers) that are seen in the inferior olive of other species. We also see no peripherin-reactive axon bundles (i.e. the olivo-cerebellar tract) arriving in the trigeminal nucleus as the tissue surrounding the trigeminal nucleus is devoid of peripherin-reactivity. Again, this finding is incompatible with the referee’s ideas. As far as we can tell, the trigeminal fibers are not reactive for peripherin in the elephant, i.e. we did not observe peripherin-reactivity very close to the nerve entry, but unfortunately, we did not stain for peripherin-reactivity into the nerve. As the referee alludes to the absence of peripherin-reactivity in the trigeminal tract is a difference between rodents and elephants.

      Change: Our novel Figure 2.

      Summary: 

      (1) Comparative data of species closely related to elephants (Afrotherians) demonstrates that not all mammals exhibit the "serrated" appearance of the principal nucleus of the inferior olive. 

      (2) The location of the IO and Vsens as reported in the current study (IOR and VsensR) would require a significant, and unprecedented, rearrangement of the brainstem in the elephants independently. I argue that the underlying molecular and genetic changes required to achieve this would be so extreme that it would lead to lethal phenotypes. Arguing that the "switcheroo" of the IO and Vsens does occur in the elephant (and no other mammals) and thus doesn't lead to lethal phenotypes is a circular argument that cannot be substantiated. 

      (3) Myelin stripes in the subnuclei of the inferior olivary nuclear complex are seen across all related mammals as shown above. Thus, the observation made in the elephant by the authors in what they call the VsensR, is similar to that seen in the IO of related mammals, especially when the IO takes on a more bulbous appearance. These myelin stripes are the origin of the olivocerebellar pathway, and are indeed calretinin immunopositive in the elephant as I show. 

      (4) What the authors see aligns perfectly with what has been described previously, the only difference being the names that nuclear complexes are being called. But identifying these nuclei is important, as any functional sequelae, as extensively discussed by the authors, is entirely dependent upon accurately identifying these nuclei. 

      (4) The peripherin immunostaining scores an own goal - if peripherin is marking peripheral nerves (as the authors and I believe it is), then why is the VsensR/IOM only "weakly positive" for this stain? This either means that the "extraordinary" tactile sensitivity of the elephant trunk is non-existent, or that the authors have misinterpreted this staining. That there is extensive staining in the fibre pathway dorsal and lateral to the IOR (which I call the spinal trigeminal tract), supports the idea that the authors have misinterpreted their peripherin immunostaining.

      (5) Evolutionary expediency. The authors argue that what they report is an expedient way in which to modify the organisation of the brainstem in the elephant to accommodate the "extraordinary" tactile sensitivity. I disagree. As pointed out in my first review, the elephant cerebellum is very large and comprised of huge numbers of morphologically complex neurons. The inferior olivary nuclei in all mammals studied in detail to date, give rise to the climbing fibres that terminate on the Purkinje cells of the cerebellar cortex. It is more parsimonious to argue that, in alignment with the expansion of the elephant cerebellum (for motor control of the trunk), the inferior olivary nuclei (specifically the principal nucleus) have had additional neurons added to accommodate this cerebellar expansion. Such an addition of neurons to the principal nucleus of the inferior olive could readily lead to the loss of the serrated appearance of the principal nucleus of the inferior olive, and would require far less modifications in the developmental genetic program that forms these nuclei. This type of quantitative change appears to be the primary way in which structures are altered in the mammalian brainstem. 

      Comment: We still disagree with the referee. We note that our conclusions rest on the analysis of 8 elephant brainstems, which we sectioned in three planes and stained with a variety of metabolic and antibody stains and in which assigned two structures (the inferior olive and the trigeminal nucleus). Most of the evidence cited by the referee stems from a single paper, in which 147 structures were identified based on the analysis of a single brainstem sectioned in one plane and stained with a limited set of antibodies. Our synopsis of the evidence is the following.

      (1) We agree with the referee that concerning brainstem position our scheme of a ventromedial trigeminal nucleus and a dorsolateral inferior olive deviates from the usual mammalian position of these nuclei (i.e. a dorsolateral trigeminal nucleus and a ventromedial inferior olive).

      (2) Cytoarchitectonics support our partitioning scheme. The compact cellular appearance of our ventromedial trigeminal nucleus is characteristic of trigeminal nuclei. The serrated appearance of our dorsolateral inferior olive is characteristic of the mammalian inferior olive; we acknowledge that the referee claims exceptions here. To our knowledge, nobody has described a mammalian trigeminal nucleus with a serrated appearance (which would apply to the elephant in case the trigeminal nucleus is situated dorsolaterally).

      (3) Metabolic staining (Cyto-chrome-oxidase reactivity) supports our partitioning scheme. Specifically, our ventromedial trigeminal nucleus shows intense Cyto-chrome-oxidase reactivity as it is seen in the trigeminal nuclei of trigeminal tactile experts.

      (4) Isomorphism. The myelin stripes on our ventromedial trigeminal nucleus are isomorphic to trunk wrinkles. Isomorphism is a characteristic of somatosensory brain structures (barrel, barrelettes, nose-stripes, etc) and we know of no case, where such isomorphism was misleading.

      (5) The large-scale organization of our ventromedial trigeminal nuclei in anterior-posterior repeats is characteristic of the mammalian trigeminal nuclei. To our knowledge, no such organization has ever been reported for the inferior olive.

      (6) Connectivity analysis supports our partitioning scheme. According to our delineation of the elephant olivo-cerebellar tract, our dorsolateral inferior olive is connected via peripherin-positive climbing fibers to the cerebellum. In contrast, our ventromedial trigeminal nucleus (the referee’s inferior olive) is not connected via climbing fibers to the cerebellum.

      Change: As discussed, we advanced further evidence in this revision. Our partitioning scheme (a ventromedial trigeminal nucleus and a dorsolateral inferior olive) is better supported by data and makes more sense than the referee’s suggestion (a dorsolateral trigeminal nucleus and a ventromedial inferior olive). It should be published.

      Reviewer #3 (Public Review):

      Summary: 

      The study claims to investigate trunk representations in elephant trigeminal nuclei located in the brainstem. The researchers identify large protrusions visible from the ventral surface of the brainstem, which they examined using a range of histological methods. However, this ventral location is usually where the inferior olivary complex is found, which challenges the author's assertions about the nucleus under analysis. They find that this brainstem nucleus of elephants contains repeating modules, with a focus on the anterior and largest unit which they define as the putative nucleus principalis trunk module of the trigeminal. The nucleus exhibits low neuron density, with glia outnumbering neurons significantly. The study also utilizes synchrotron X-ray phase contrast tomography to suggest that myelin-stripe-axons traverse this module. The analysis maps myelin-rich stripes in several specimens and concludes that based on their number and patterning that they likely correspond with trunk folds; however this conclusion is not well supported if the nucleus has been misidentified. 

      Comment: The referee provides a summary of our work. The referee also notes that the correct identification of the trigeminal nucleus is critical to the message of our paper.

      Change: In line with these assessments we focused our revision efforts on the issue of trigeminal nucleus identification, please see our introductory comments and our response to Referee 2.

      Strengths: 

      The strength of this research lies in its comprehensive use of various anatomical methods, including Nissl staining, myelin staining, Golgi staining, cytochrome oxidase labeling, and synchrotron X-ray phase contrast tomography. The inclusion of quantitative data on cell numbers and sizes, dendritic orientation and morphology, and blood vessel density across the nucleus adds a quantitative dimension. Furthermore, the research is commendable for its high-quality and abundant images and figures, effectively illustrating the anatomy under investigation.

      Comment: We appreciate this positive assessment.

      Change: None

      Weaknesses: 

      While the research provides potentially valuable insights if revised to focus on the structure that appears to be inferior olivary nucleus, there are certain additional weaknesses that warrant further consideration. First, the suggestion that myelin stripes solely serve to separate sensory or motor modules rather than functioning as an "axonal supply system" lacks substantial support due to the absence of information about the neuronal origins and the termination targets of the axons. Postmortem fixed brain tissue limits the ability to trace full axon projections. While the study acknowledges these limitations, it is important to exercise caution in drawing conclusions about the precise role of myelin stripes without a more comprehensive understanding of their neural connections. 

      Comment: We understand these criticisms and the need for cautious interpretation. As we noted previously, we think that the Elife-publishing scheme, where critical referee commentary is published along with our ms, will make this contribution particularly valuable.

      Change: Our additional efforts to secure the correct identification of the trigeminal nucleus.

      Second, the quantification presented in the study lacks comparison to other species or other relevant variables within the elephant specimens (i.e., whole brain or brainstem volume). The absence of comparative data to different species limits the ability to fully evaluate the significance of the findings. Comparative analyses could provide a broader context for understanding whether the observed features are unique to elephants or more common across species. This limitation in comparative data hinders a more comprehensive assessment of the implications of the research within the broader field of neuroanatomy. Furthermore, the quantitative comparisons between African and Asian elephant specimens should include some measure of overall brain size as a covariate in the analyses. Addressing these weaknesses would enable a richer interpretation of the study's findings. 

      Comment: We understand, why the referee asks for additional comparative data, which would make our study more meaningful. We note that we already published a quantitative comparison of African and Asian elephant facial nuclei (Kaufmann et al. 2022). The quantitative differences between African and Asian elephant facial nuclei are similar in magnitude to what we observed here for the trigeminal nucleus, i.e. African elephants have about 10-15% more facial nucleus neurons than Asian elephants. The referee also notes that data on overall elephant brain size might be important for interpreting our data. We agree with this sentiment and we are preparing a ms on African and Asian elephant brain size. We find – unexpectedly given the larger body size of African elephants – that African elephants have smaller brains than Asian elephants. The finding might imply that African elephants, which have more facial nucleus neurons and more trigeminal nucleus trunk module neurons, are neurally more specialized in trunk control than Asian elephants.

      Change: We are preparing a further ms on African and Asian elephant brain size, a first version of this work has been submitted.

      Reviewer #4 (Public Review): 

      Summary: 

      The authors report a novel isomorphism in which the folds of the elephant trunk are recognizably mapped onto the principal sensory trigeminal nucleus in the brainstem. Further, they identifiy the enlarged nucleus as being situated in this species in an unusual ventral midline position. 

      Comment: The referee summarizes our work.

      Change: None.

      Strengths: 

      The identity of the purported trigeminal nucleus and the isomorphic mapping with the trunk folds is supported by multiple lines of evidence: enhanced staining for cytochrome oxidase, an enzyme associated with high metabolic activity; dense vascularization, consistent with high metabolic activity; prominent myelinated bundles that partition the nucleus in a 1:1 mapping of the cutaneous folds in the trunk periphery; near absence of labeling for the anti-peripherin antibody, specific for climbing fibers, which can be seen as expected in the inferior olive; and a high density of glia.

      Comment: The referee again reviews some of our key findings.

      Change: None. 

      Weaknesses: 

      Despite the supporting evidence listed above, the identification of the gross anatomical bumps, conspicuous in the ventral midline, is problematic. This would be the standard location of the inferior olive, with the principal trigeminal nucleus occupying a more dorsal position. This presents an apparent contradiction which at a minimum needs further discussion. Major species-specific specializations and positional shifts are well-documented for cortical areas, but nuclear layouts in the brainstem have been considered as less malleable. 

      Comment: The referee notes that our discrepancy with referee 2, needs to be addressed with further evidence and discussion, given the unusual position of both inferior olive and trigeminal nucleus in the partitioning scheme and that the mammalian brainstem tends to be positionally conservative. We agree with the referee. We note that – based on the immense size of the elephant trigeminal ganglion (50 g), half the size of a monkey brain – it was expected that the elephant trigeminal nucleus ought to be exceptionally large.

      Change: We did additional experimental work to resolve this matter: (i) We ascertained that elephant climbing fibers are strongly peripherin-positive. (ii) Based on elephant climbing fiber peripherin-reactivity we delineated the elephant olivo-cerebellar tract. We find that the olivo-cerebellar connects to the structure we refer to as inferior olive to the cerebellum. (iii) We also found that the trigeminal nucleus (the structure the referee refers to as inferior olive) appears to receive no climbing fibers. (iv) We provide indications that the tracing of the trigeminal nerve into the olivo-cerebellar tract by Maseko et al. 2023 was erroneous (Referee-Figure 1). These novel findings support our ideas.

      Reviewer #5 (Public Review): 

      After reading the manuscript and the concerns raised by reviewer 2 I see both sides of the argument - the relative location of trigeminal nucleus versus the inferior olive is quite different in elephants (and different from previous studies in elephants), but when there is a large disproportionate magnification of a behaviorally relevant body part at most levels of the nervous system (certainly in the cortex and thalamus), you can get major shifting in location of different structures. In the case of the elephant, it looks like there may be a lot of shifting. Something that is compelling is that the number of modules separated but the myelin bands correspond to the number of trunk folds which is different in the different elephants. This sort of modular division based on body parts is a general principle of mammalian brain organization (demonstrated beautifully for the cuneate and gracile nucleus in primates, VP in most of species, S1 in a variety of mammals such as the star nosed mole and duck-billed platypus). I don't think these relative changes in the brainstem would require major genetic programming - although some surely exists. Rodents and elephants have been independently evolving for over 60 million years so there is a substantial amount of time for changes in each l lineage to occur.

      I agree that the authors have identified the trigeminal nucleus correctly, although comparisons with more out groups would be needed to confirm this (although I'm not suggesting that the authors do this). I also think the new figure (which shows previous divisions of the brainstem versus their own) allows the reader to consider these issues for themselves. When reviewing this paper, I actually took the time to go through atlases of other species and even look at some of my own data from highly derived species. Establishing homology across groups based only on relative location is tough especially when there appears to be large shifts in relative location of structures. My thoughts are that the authors did an extraordinary amount of work on obtaining, processing and analyzing this extremely valuable tissue. They document their work with images of the tissue and their arguments for their divisions are solid. I feel that they have earned the right to speculate - with qualifications - which they provide. 

      Comment: The referee summarizes our work and appears to be convinced by the line of our arguments. We are most grateful for this assessment. We add, again, that the skeptical assessment of referee 2 will be published as well and will give the interested reader the possibility to view another perspective on our work.

      Change: None. 

      Recommendations for the authors: 

      Reviewer #1 (Recommendations For The Authors):

      With this manuscript being virtually identical to the previous version, it is possible that some of the definitive conclusions about having identified the elephant trigeminal nucleus and trunk representation should be moderated in a more nuanced manner, especially given the careful and experienced perspective from reviewers with first hand knowledge elephant neuroanatomy.

      Comment: We agree that both our first and second revisions were very much centered on the debate of the correct identification of the trigeminal nucleus and that our ms did not evolve as much in other regards. This being said we agree with Referee 2 that we needed to have this debate. We also think we advanced important novel data in this context (the delineation of elephant olivo-cerebellar tract through the peripherin-antibody).

      Changes: Our revised Figure 2. 

      The peripherin staining adds another level of argument to the authors having identified the trigeminal brainstem instead of the inferior olive, if differential expression of peripherin is strong enough to distinguish one structure from the other.

      Comment: We think we showed too little peripherin-antibody staining in our previous revision. We have now addressed this problem.

      Changes: Our revised Figure 2, i.e. the delineation of elephant olivo-cerebellar tract through the peripherin-antibody).

      There are some minor corrections to be made with the addition of Fig. 2., including renumbering the figures in the manuscript (e.g., 406, 521). 

      I continue to appreciate this novel investigation of the elephant brainstem and find it an interesting and thorough study, with the use of classical and modern neuroanatomical methods.

      Comment: We are thankful for this positive assessment.

      Reviewer #2 (Recommendations For The Authors):

      I do realise the authors are very unhappy with me and the reviews I have submitted. I do apologise if feelings have been hurt, and I do understand the authors put in a lot of hard work and thought to develop what they have; however, it is unfortunate that the work and thoughts are not correct. Science is about the search for the truth and sometimes we get it wrong. This is part of the scientific process and why most journals adhere to strict review processes of scientific manuscripts. As I said previously, the authors can use their data to write a paper describing and quantifying Golgi staining of neurons in the principal olivary nucleus of the elephant that should be published in a specialised journal and contextualised in terms of the motor control of the trunk and the large cerebellum of the elephant. 

      Comment: We appreciate the referee’s kind words. Also, no hard feelings from our side, this is just a scientific debate. In our experience, neuroanatomical debates are resolved by evidence and we note that we provide evidence strengthening our identification of the trigeminal nucleus and inferior olive. As far as we can tell from this effort and the substantial evidence accumulated, the referee is wrong.

      Reviewer #4 (Recommendations For The Authors):

      As a new reviewer, I have benefited from reading the previous reviews and Author response, even while having several new comments to add. 

      (1) The identification of the inferior olive and trigeminal nuclei is obviously center stage. An enlargement of the trigeminal nuclei is not necessarily problematic, given the published reports on the dramatic enlargement of the trigeminal nerve (Purkart et al., 2022). At issue is the conspicuous relocation of the trigeminal nuclei that is being promoted by Reveyaz et al. Conspicuous rearrangements are not uncommon; for example, primary sensory cortical fields in different species (fig. 1 in H.H.A. Oelschlager for dolphins; S. De Vreese et al. (2023) for cetaceans, L. Krubitzer on various species, in the context of evolution). The difficult point here concerns what looks like a rather conspicuous gross anatomical rearrangement, in BRAINSTEM - the assumption being that the brainstem bauplan is going to be specifically conservative and refractory to gross anatomical rearrangement. 

      Comment: We agree with the referee that the brainstem rearrangements are unexpected. We also think that the correct identification of nuclei needs to be at the center of our revision efforts.

      Change: Our revision provided further evidence (delineation of the olivo-cerebellar tract, characterization of the trigeminal nerve entry) about the identity of the nuclei we studied.

      Why would a major nucleus shift to such a different location? and how? Can ex vivo DTI provide further support of the correct identification? Is there other "disruption" in the brainstem? What occupies the traditional position of the trigeminal nuclei? An atlas-equivalent coronal view of the entire brainstem would be informative. The Authors have assembled multiple criteria to support their argument that the ventral "bumps" are in fact a translocated trigeminal principal nucleus: enhanced CO staining, enhanced vascularization, enhanced myelination (via Golgi stains and tomography), very scant labeling for a climbing fiber specific antibody ( anti-peripherin), vs. dense staining of this in the alternative structure that they identify as IO; and a high density of glia. Admittedly, this should be sufficient, but the proposed translocation (in the BRAINSTEM) is sufficiently startling that this is arguably NOT sufficient. <br /> The terminology of "putative" is helpful, but a more cogent presentation of the results and more careful discussion might succeed in winning over at least some of a skeptical readership. 

      Comment: We do not know, what led to the elephant brainstem rearrangements we propose. If the trigeminal nuclei had expanded isometrically in elephants from the ancestral pattern, one would have expected a brain with big lateral bumps, not the elephant brain with its big ventromedial bumps. We note, however, that very likely the expansion of the elephant trigeminal nuclei did not occur isometrically. Instead, the neural representation of the elephant nose expanded dramatically and in rodents the nose is represented ventromedially in the brainstem face representation. Thus, we propose a ‘ventromedial outgrowth model’ according to which the elephant ventromedial trigeminal bumps result from a ventromedially direct outgrowth of the ancestral ventromedial nose representation.

      We advanced substantially more evidence to support our partitioning scheme, including the delineation of the olivo-cerebellar tract based on peripherin-reactivity. We also identified problems in previous partitioning schemes, such as the claim that the trigeminal nerve continues into the ~4x smaller olivocerebellar tract (Referee-Figure 1C, D); we think such a flow of fibers, (which is also at odds with peripherin-antibody-reactivity and the appearance of nerve and olivocerebellar tract), is highly unlikely if not physically impossible. With all that we do not think that we overstate our case in our cautiously presented ms.

      Change: We added evidence on the identification of elephant trigeminal nuclei and inferior olive.

      (2) Role of myelin. While the photos of myelin are convincing, it would be nice to have further documentation. Gallyas? Would antibodies to MBP work? What is the myelin distribution in the "standard" trigeminal nuclei (human? macaque or chimpanzee?). What are alternative sources of the bundles? Regardless, I think it would be beneficial to de-emphasize this point about the role of myelin in demarcating compartments. <br /> I would in fact suggest an alternative (more neutral) title that might highlight instead the isomorphic feature; for example, "An isomorphic representation of Trunk folds in the Elephant Trigeminal Nucleus." The present title stresses myelin, but figure 1 already focuses on CO. Additionally, the folds are actually mentioned almost in passing until later in the manuscript. I recommend a short section on these at the beginning of the Results to serve as a useful framework.

      Here I'm inclined to agree with the Reviewer, that the Authors' contention that the myelin stipes serve PRIMARILY to separate trunk-fold domains is not particularly compelling and arguably a distraction. The point can be made, but perhaps with less emphasis. After all, the fact that myelin has multiple roles is well-established, even if frequently overlooked. In addition, the Authors might make better use of an extensive relevant literature related to myelin as a compartmental marker; for example, results and discussion in D. Haenelt....N. Weiskopf (eLife, 2023), among others. Another example is the heavily myelinated stria of Gennari in primate visual cortex, consisting of intrinsic pyramidal cell axons, but where the role of the myelination has still not been elucidated. 

      Comment: (1) Documentation of myelin. We note that we show further identification of myelinated fibers by the fluorescent dye fluomyelin in Figure 4B. We also performed additional myelin stains as the gold-myelin stain after the protocol of Schmued (Referee-Figure 2). In the end, nothing worked quite as well to visualize myelin-stripes as the bright-field images shown in Figure 4A and it is only the images that allowed us to match myelin-stripes to trunk folds. Hence, we focus our presentation on these images.

      (2) Title: We get why the referee envisions an alternative title. This being said, we would like to stick with our current title, because we feel it highlights the major novelty we discovered.

      (3) We agree with many of the other comments of the referee on myelin phenomenology. We missed the Haenelt reference pointed out by the referee and think it is highly relevant to our paper

      Change: 1. Review image 2. Inclusion of the Haenelt-reference.

      Author response image 2.

      Myelin stripes of the elephant trunk module visualized by Gold-chloride staining according to Schmued. A, Low magnification micrograph of the trunk module of African elephant Indra stained with AuCl according to Schmued. The putative finger is to the left, proximal is to the right. Myelin stripes can easily be recognized. The white box indicates the area shown in B. B, high magnification micrograph of two myelin stripes. Individual gold-stained (black) axons organized in myelin stripes can be recognized.

      Schmued, L. C. (1990). A rapid, sensitive histochemical stain for myelin in frozen brain sections. Journal of Histochemistry & Cytochemistry,38(5), 717-720.

      Are the "bumps" in any way "analogous" to the "brain warts" seen in entorhinal areas of some human brains (G. W. van Hoesen and A. Solodkin (1993)? 

      Comment: We think this is a similar phenomenon.

      Change: We included the Hoesen and A. Solodkin (1993) reference in our discussion.

      At least slightly more background (ie, a separate section or, if necessary, supplement) would be helpful, going into more detail on the several subdivisions of the ION and if these undergo major alterations in the elephant.

      Comment: The strength of the paper is the detailed delineation of the trunk module, based on myelin stripes and isomorphism. We don’t think we have strong evidence on ION subdivisions, because it appears the trigeminal tract cannot be easily traced in elephants. Accordingly, we find it difficult to add information here.

      Change: None.

      Is there evidence from the literature of other conspicuous gross anatomical translocations, in any species, especially in subcortical regions? 

      Comment: The best example that comes to mind is the star-nosed mole brainstem. There is a beautiful paper comparing the star-nosed mole brainstem to the normal mole brainstem (Catania et al 2011). The principal trigeminal nucleus in the star-nosed mole is far more rostral and also more medial than in the mole; still, such rearrangements are minor compared to what we propose in elephants.

      Catania, Kenneth C., Duncan B. Leitch, and Danielle Gauthier. "A star in the brainstem reveals the first step of cortical magnification." PloS one 6.7 (2011): e22406.

      Change: None.

      (3) A major point concerns the isomorphism between the putative trigeminal nuclei and the trunk specialization. I think this can be much better presented, at least with more discussion and other examples. The Authors mention about the rodent "barrels," but it seemed strange to me that they do not refer to their own results in pig (C. Ritter et al., 2023) nor the work from Ken Catania, 2002 (star-nosed mole; "fingerprints in the brain") or other that might be appropriate. I concur with the Reviewer that there should be more comparative data. 

      Comment: We agree.

      Change: We added a discussion of other isomorphisms including the the star-nosed mole to our paper.

      (4) Textual organization could be improved. 

      The Abstract all-important Introduction is a longish, semi "run-on" paragraph. At a minimum this should be broken up. The last paragraph of the Introduction puts forth five issues, but these are only loosely followed in the Results section. I think clarity and good organization is of the upmost importance in this manuscript. I recommend that the Authors begin the Results with a section on the trunk folds (currently figure 5, and discussion), continue with the several points related to the identification of the trigeminal nuclei, and continue with a parallel description of ION with more parallel data on the putative trigeminal and IO structures (currently referee Table 1, but incorporate into the text and add higher magnification of nucleus-specific cell types in the IO and trigeminal nuclei). Relevant comparative data should be included in the Discussion.

      Comment: 1. We agree with the referee that our abstract needed to be revised. 2. We also think that our ms was heavily altered by the insertion of the new Figure 2, which complemented Figure 1 from our first submission and is concerned with the identification of the inferior olive. From a standpoint of textual flow such changes were not ideal, but the revisions massively added to the certainty with which we identify the trigeminal nuclei. Thus, although we are not as content as we were with the flow, we think the ms advanced in the revision process and we would like to keep the Figure sequence as is. 3. We already noted above that we included additional comparative evidence.

      Change: 1. We revised our abstract. 2. We added comparative evidence.

      Reviewer #5 (Recommendations For The Authors): 

      The data is invaluable and provides insights into some of the largest mammals on the planet. 

      Comment: We are incredibly thankful for this positive assessment.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife Assessment

      This neuroimaging and electrophysiology study in a small cohort of congenital cataract patients with sight recovery aims to characterize the effects of early visual deprivation on excitatory and inhibitory balance in visual cortex. While contrasting sight-recovery with visually intact controls suggested the existence of persistent alterations in Glx/GABA ratio and aperiodic EEG signals, it provided only incomplete evidence supporting claims about the effects of early deprivation itself. The reported data were considered valuable, given the rare study population. However, the small sample sizes, lack of a specific control cohort and multiple methodological limitations will likely restrict usefulness to scientists working in this particular subfield.

      We thank the reviewing editors for their consideration and updated assessment of our manuscript after its first revision.

      In order to assess the effects of early deprivation, we included an age-matched, normally sighted control group recruited from the same community, measured in the same scanner and laboratory. This study design is analogous to numerous studies in permanently congenitally blind humans, which typically recruited sighted controls, but hardly ever individuals with a different, e.g. late blindness history. In order to improve the specificity of our conclusions, we used a frontal cortex voxel in addition to a visual cortex voxel (MRS). Analogously, we separately analyzed occipital and frontal electrodes (EEG).

      Moreover, we relate our findings in congenital cataract reversal individuals to findings in the literature on permanent congenital blindness. Note, there are, to the best of our knowledge, neither MRS nor resting-state EEG studies in individuals with permanent late blindness.

      Our participants necessarily have nystagmus and low visual acuity due to their congenital deprivation phase, and the existence of nystagmus is a recruitment criterion to diagnose congenital cataracts.

      It might be interesting for future studies to investigate individuals with transient late blindness. However, such a study would be ill-motivated had we not found differences between the most “extreme” of congenital visual deprivation conditions and normally sighted individuals (analogous to why earlier research on permanent blindness investigated permanent congenitally blind humans first, rather than permanently late blind humans, or both in the same study). Any result of these future work would need the reference to our study, and neither results in these additional groups would invalidate our findings.

      Since all our congenital cataract reversal individuals by definition had visual impairments, we included an eyes closed condition, both in the MRS and EEG assessment. Any group effect during the eyes closed condition cannot be due to visual acuity deficits changing the bottom-up driven visual activation.

      As we detail in response to review 3, our EEG analyses followed the standards in the field.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      In this human neuroimaging and electrophysiology study, the authors aimed to characterise effects of a period of visual deprivation in the sensitive period on excitatory and inhibitory balance in the visual cortex. They attempted to do so by comparing neurochemistry conditions ('eyes open', 'eyes closed') and resting state, and visually evoked EEG activity between ten congenital cataract patients with recovered sight (CC), and ten age-matched control participants (SC) with normal sight.

      First, they used magnetic resonance spectroscopy to measure in vivo neurochemistry from two locations, the primary location of interest in the visual cortex, and a control location in the frontal cortex. Such voxels are used to provide a control for the spatial specificity of any effects, because the single-voxel MRS method provides a single sampling location. Using MR-visible proxies of excitatory and inhibitory neurotransmission, Glx and GABA+ respectively, the authors report no group effects in GABA+ or Glx, no difference in the functional conditions 'eyes closed' and 'eyes open'. They found an effect of group in the ratio of Glx/GABA+ and no similar effect in the control voxel location. They then perform multiple exploratory correlations between MRS measures and visual acuity, and report a weak positive correlation between the 'eyes open' condition and visual acuity in CC participants.

      The same participants then took part in an EEG experiment. The authors selected two electrodes placed in the visual cortex for analysis and report a group difference in an EEG index of neural activity, the aperiodic intercept, as well as the aperiodic slope, considered a proxy for cortical inhibition. Control electrodes in the frontal region did not present with the same pattern. They report an exploratory correlation between the aperiodic intercept and Glx in one out of three EEG conditions.

      The authors report the difference in E/I ratio, and interpret the lower E/I ratio as representing an adaptation to visual deprivation, which would have initially caused a higher E/I ratio. Although intriguing, the strength of evidence in support of this view is not strong. Amongst the limitations are the low sample size, a critical control cohort that could provide evidence for higher E/I ratio in CC patients without recovered sight for example, and lower data quality in the control voxel. Nevertheless, the study provides a rare and valuable insight into experience-dependent plasticity in the human brain.

      Strengths of study

      How sensitive period experience shapes the developing brain is an enduring and important question in neuroscience. This question has been particularly difficult to investigate in humans. The authors recruited a small number of sight-recovered participants with bilateral congenital cataracts to investigate the effect of sensitive period deprivation on the balance of excitation and inhibition in the visual brain using measures of brain chemistry and brain electrophysiology. The research is novel, and the paper was interesting and well written.

      Limitations

      Low sample size. Ten for CC and ten for SC, and further two SC participants were rejected due to lack of frontal control voxel data. The sample size limits the statistical power of the dataset and increases the likelihood of effect inflation.

      In the updated manuscript, the authors have provided justification for their sample size by pointing to prior studies and the inherent difficulties in recruiting individuals with bilateral congenital cataracts. Importantly, this highlights the value the study brings to the field while also acknowledging the need to replicate the effects in a larger cohort.

      Lack of specific control cohort. The control cohort has normal vision. The control cohort is not specific enough to distinguish between people with sight loss due to different causes and patients with congenital cataracts with co-morbidities. Further data from a more specific populations, such as patients whose cataracts have not been removed, with developmental cataracts, or congenitally blind participants, would greatly improve the interpretability of the main finding. The lack of a more specific control cohort is a major caveat that limits a conclusive interpretation of the results.

      In the updated version, the authors have indicated that future studies can pursue comparisons between congenital cataract participants and cohorts with later sight loss.

      MRS data quality differences. Data quality in the control voxel appears worse than in the visual cortex voxel. The frontal cortex MRS spectrum shows far broader linewidth than the visual cortex (Supplementary Figures). Compared to the visual voxel, the frontal cortex voxel has less defined Glx and GABA+ peaks; lower GABA+ and Glx concentrations, lower NAA SNR values; lower NAA concentrations. If the data quality is a lot worse in the FC, then small effects may not be detectable.

      In the updated version, the authors have added more information that informs the reader of the MRS quality differences between voxel locations. This increases the transparency of their reporting and enhances the assessment of the results.

      Because of the direction of the difference in E/I, the authors interpret their findings as representing signatures of sight improvement after surgery without further evidence, either within the study or from the literature. However, the literature suggests that plasticity and visual deprivation drives the E/I index up rather than down. Decreasing GABA+ is thought to facilitate experience dependent remodelling. What evidence is there that cortical inhibition increases in response to a visual cortex that is over-sensitised to due congenital cataracts? Without further experimental or literature support this interpretation remains very speculative.

      The updated manuscript contains key reference from non-human work to justify their interpretation.

      Heterogeneity in patient group. Congenital cataract (CC) patients experienced a variety of duration of visual impairment and were of different ages. They presented with co-morbidities (absorbed lens, strabismus, nystagmus). Strabismus has been associated with abnormalities in GABAergic inhibition in the visual cortex. The possible interactions with residual vision and confounds of co-morbidities are not experimentally controlled for in the correlations, and not discussed.

      The updated document has addressed this caveat.

      Multiple exploratory correlations were performed to relate MRS measures to visual acuity (shown in Supplementary Materials), and only specific ones shown in the main document. The authors describe the analysis as exploratory in the 'Methods' section. Furthermore, the correlation between visual acuity and E/I metric is weak, not corrected for multiple comparisons. The results should be presented as preliminary, as no strong conclusions can be made from them. They can provide a hypothesis to test in a future study.

      This has now been done throughout the document and increases the transparency of the reporting.

      P.16 Given the correlation of the aperiodic intercept with age ("Age negatively correlated with the aperiodic intercept across CC and SC individuals, that is, a flattening of the intercept was observed with age"), age needs to be controlled for in the correlation between neurochemistry and the aperiodic intercept. Glx has also been shown to negatively correlates with age.

      This caveat has been addressed in the revised manuscript.

      Multiple exploratory correlations were performed to relate MRS to EEG measures (shown in Supplementary Materials), and only specific ones shown in the main document. Given the multiple measures from the MRS, the correlations with the EEG measures were exploratory, as stated in the text, p.16, and in Fig.4. yet the introduction said that there was a prior hypothesis "We further hypothesized that neurotransmitter changes would relate to changes in the slope and intercept of the EEG aperiodic activity in the same subjects." It would be great if the text could be revised for consistency and the analysis described as exploratory.

      This has been done throughout the document and increases the transparency of the reporting.

      The analysis for the EEG needs to take more advantage of the available data. As far as I understand, only two electrodes were used, yet far more were available as seen in their previous study (Ossandon et al., 2023). The spatial specificity is not established. The authors could use the frontal cortex electrode (FP1, FP2) signals as a control for spatial specificity in the group effects, or even better, all available electrodes and correct for multiple comparisons. Furthermore, they could use the aperiodic intercept vs Glx in SC to evaluate the specificity of the correlation to CC.

      This caveat has been addressed. The authors have added frontal electrodes to their analysis, providing an essential regional control for the visual cortex location.

      Comments on the latest version:

      The authors have made reasonable adjustments to their manuscript that addressed most of my comments by adding further justification for their methodology, essential literature support, pointing out exploratory analyses, limitations and adding key control analyses. Their revised manuscript has overall improved, providing valuable information, though the evidence that supports their claims is still incomplete.

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      The study examined 10 congenitally blind patients who recovered vision through the surgical removal of bilateral dense cataracts, measuring neural activity and neuro chemical profiles from the visual cortex. The declared aim is to test whether restoring visual function after years of complete blindness impacts excitation/inhibition balance in the visual cortex.

      Strengths:

      The findings are undoubtedly useful for the community, as they contribute towards characterising the many ways in which this special population differs from normally sighted individuals. The combination of MRS and EEG measures is a promising strategy to estimate a fundamental physiological parameter - the balance between excitation and inhibition in the visual cortex, which animal studies show to be heavily dependent upon early visual experience. Thus, the reported results pave the way for further studies, which may use a similar approach to evaluate more patients and control groups.

      Weaknesses:

      The main methodological limitation is the lack of an appropriate comparison group or condition to delineate the effect of sight recovery (as opposed to the effect of congenital blindness). Few previous studies suggested that Excitation/Inhibition ratio in the visual cortex is increased in congenitally blind patients; the present study reports that E/I ratio decreases instead. The authors claim that this implies a change of E/I ratio following sight recovery. However, supporting this claim would require showing a shift of E/I after vs. before the sight-recovery surgery, or at least it would require comparing patients who did and did not undergo the sight-recovery surgery (as common in the field).

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Since we have not been able to acquire longitudinal data with the experimental design of the present study in congenital cataract reversal individuals, we compared the MRS and EEG results of congenital cataract reversal individuals  to published work in congenitally permanent blind individuals. We consider this as a resource saving approach. We think that the results of our cross-sectional study now justify the costs and enormous efforts (and time for the patients who often have to travel long distances) associated with longitudinal studies in this rare population.

      There are also more technical limitations related to the correlation analyses, which are partly acknowledged in the manuscript. A bland correlation between GLX/GABA and the visual impairment is reported, but this is specific to the patients group (N=10) and would not hold across groups (the correlation is positive, predicting the lowest GLX/GABA ratio values for the sighted controls - opposite of what is found). There is also a strong correlation between GLX concentrations and the EEG power at the lowest temporal frequencies. Although this relation is intriguing, it only holds for a very specific combination of parameters (of the many tested): only with eyes open, only in the patients group.

      Given the exploratory nature of the correlations, we do not base the majority of our conclusions on this analysis. There are no doubts that the reported correlations need replication; however, replication is only possible after a first report. Thus, we hope to motivate corresponding analyses in further studies.

      It has to be noted that in the present study significance testing for correlations were corrected for multiple comparisons, and that some findings replicate earlier reports (e.g. effects on EEG aperiodic slope, alpha power, and correlations with chronological age).

      Conclusions:

      The main claim of the study is that sight recovery impacts the excitation/inhibition balance in the visual cortex, estimated with MRS or through indirect EEG indices. However, due to the weaknesses outlined above, the study cannot distinguish the effects of sight recovery from those of visual deprivation. Moreover, many aspects of the results are interesting but their validation and interpretation require additional experimental work.

      We interpret the group differences between individuals tested years after congenital visual deprivation and normally sighted individuals as supportive of the E/I ratio being impacted by congenital visual deprivation. In the absence of a sensitive period for the development of an E/I ratio, individuals with a transient phase of congenital blindness might have developed a visual system indistinguishable  from normally sighted individuals. As we demonstrate, this is not so. Comparing the results of congenitally blind humans with those of congenitally permanently blind humans (from previous studies) allowed us to identify changes of E/I ratio, which add to those found for congenital blindness.  

      We thank the reviewer for the helpful comments and suggestions related to the first submission and first revision of our manuscript. We are keen to translate some of them into future studies.

      Reviewer #3 (Public review):

      This manuscript examines the impact of congenital visual deprivation on the excitatory/inhibitory (E/I) ratio in the visual cortex using Magnetic Resonance Spectroscopy (MRS) and electroencephalography (EEG) in individuals whose sight was restored. Ten individuals with reversed congenital cataracts were compared to age-matched, normally sighted controls, assessing the cortical E/I balance and its interrelationship and to visual acuity. The study reveals that the Glx/GABA ratio in the visual cortex and the intercept and aperiodic signal are significantly altered in those with a history of early visual deprivation, suggesting persistent neurophysiological changes despite visual restoration.

      First of all, I would like to disclose that I am not an expert in congenital visual deprivation, nor in MRS. My expertise is in EEG (particularly in the decomposition of periodic and aperiodic activity) and statistical methods.

      Although the authors addressed some of the concerns of the previous version, major concerns and flaws remain in terms of methodological and statistical approaches along with the (over)interpretation of the results. Specific concerns include:

      (1 3.1) Response to Variability in Visual Deprivation<br /> Rather than listing the advantages and disadvantages of visual deprivation, I recommend providing at least a descriptive analysis of how the duration of visual deprivation influenced the measures of interest. This would enhance the depth and relevance of the discussion.

      Although Review 2 and Review 3 (see below) pointed out problems in interpreting multiple correlational analyses in small samples, we addressed this request by reporting such correlations between visual deprivation history and measured EEG/MRS outcomes.

      Calculating the correlation between duration of visual deprivation and behavioral or brain measures is, in fact, a common suggestion. The existence of sensitive periods, which are typically assumed to not follow a linear gradual decline of neuroplasticity, does not necessary allow predicting a correlation with duration of blindness. Daphne Maurer has additionally worked on the concept of “sleeper effects” (Maurer et al., 2007), that is, effects on the brain and behavior by early deprivation which are observed only later in life when the function/neural circuits matures.

      In accordance with this reasoning, we did not observe a significant correlation between duration of visual deprivation and any of our dependent variables.

      (2 3.2) Small Sample Size<br /> The issue of small sample size remains problematic. The justification that previous studies employed similar sample sizes does not adequately address the limitation in the current study. I strongly suggest that the correlation analyses should not feature prominently in the main manuscript or the abstract, especially if the discussion does not substantially rely on these correlations. Please also revisit the recommendations made in the section on statistical concerns.

      In the revised manuscript, we explicitly mention that our sample size is not atypical for the special group investigated, but that a replication of our results in larger samples would foster their impact. We only explicitly mention correlations that survived stringent testing for multiple comparisons in the main manuscript.

      Given the exploratory nature of the correlations, we have not based the majority of our claims on this analysis.

      (3 3.3) Statistical Concerns<br /> While I appreciate the effort of conducting an independent statistical check, it merely validates whether the reported statistical parameters, degrees of freedom (df), and p-values are consistent. However, this does not address the appropriateness of the chosen statistical methods.

      We did not intend for the statcheck report to justify the methods used for statistics, which we have done in a separate section with normality and homogeneity testing (Supplementary Material S9), and references to it in the descriptions of the statistical analyses (Methods, Page 13, Lines 326-329 and Page 15, Lines 400-402).

      Several points require clarification or improvement:<br /> (4) Correlation Methods: The manuscript does not specify whether the reported correlation analyses are based on Pearson or Spearman correlation.

      The depicted correlations are Pearson correlations. We will add this information to the Methods.

      (5) Confidence Intervals: Include confidence intervals for correlations to represent the uncertainty associated with these estimates.

      We have added the confidence intervals for all measured correlations to the second revision of our manuscript.

      (6) Permutation Statistics: Given the small sample size, I recommend using permutation statistics, as these are exact tests and more appropriate for small datasets.

      Our study focuses on a rare population, with a sample size limited by the availability of participants. Our findings provide exploratory insights rather than make strong inferential claims. To this end, we have ensured that our analysis adheres to key statistical assumptions (Shapiro-Wilk as well as Levene’s tests, Supplementary Material S9), and reported our findings with effect sizes, appropriate caution and context.

      (7) Adjusted P-Values: Ensure that reported Bonferroni corrected p-values (e.g., p > 0.999) are clearly labeled as adjusted p-values where applicable.

      In the revised manuscript, we have changed Figure 4 to say ‘adjusted p,’  which we indeed reported.

      (8) Figure 2C

      Figure 2C still lacks crucial information that the correlation between Glx/GABA ratio and visual acuity was computed solely in the control group (as described in the rebuttal letter). Why was this analysis restricted to the control group? Please provide a rationale.

      Figure 2C depicts the correlation between Glx/GABA+ ratio and visual acuity in the congenital cataract reversal group, not the control group. This is mentioned in the Figure 2 legend, as well as in the main text where the figure is referred to (Page 18, Line 475).

      The correlation analyses between visual acuity and MRS/EEG measures were only performed in the congenital cataract reversal group since the sighed control group comprised of individuals with vision in the normal range; thus this analyses would not make sense. Table 1 with the individual visual acuities for all participants, including the normally sighted controls, shows the low variance in the latter group.  

      For variables in which no apiori group differences in variance were predicted, we performed the correlation analyses across groups (see Supplementary Material S12, S15).

      We have now highlighted these motivations more clearly in the Methods of the revised manuscript (Page 16, Lines 405-410).

      (9 3.4) Interpretation of Aperiodic Signal

      Relying on previous studies to interpret the aperiodic slope as a proxy for excitation/inhibition (E/I) does not make the interpretation more robust.

      How to interpret aperiodic EEG activity has been subject of extensive investigation. We cite studies which provide evidence from multiple species (monkeys, humans) and measurements (EEG, MEG, ECoG), including studies which pharmacologically manipulated E/I balance.

      Whether our findings are robust, in fact, requires a replication study. Importantly, we analyzed the intercept of the aperiodic activity fit as well, and discuss results related to the intercept.

      Quote:

      “(3.4) Interpretation of aperiodic signal:

      - Several recent papers demonstrated that the aperiodic signal measured in EEG or ECoG is related to various important aspects such as age, skull thickness, electrode impedance, as well as cognition. Thus, currently, very little is known about the underlying effects which influence the aperiodic intercept and slope. The entire interpretation of the aperiodic slope as a proxy for E/I is based on a computational model and simulation (as described in the Gao et al. paper).

      Apart from the modeling work from Gao et al., multiple papers which have also been cited which used ECoG, EEG and MEG and showed concomitant changes in aperiodic activity with pharmacological manipulation of the E/I ratio (Colombo et al., 2019; Molina et al., 2020; Muthukumaraswamy & Liley, 2018). Further, several prior studies have interpreted changes in the aperiodic slope as reflective of changes in the E/I ratio, including studies of developmental groups (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Schaworonkow & Voytek, 2021) as well as patient groups (Molina et al., 2020; Ostlund et al., 2021).

      - The authors further wrote: We used the slope of the aperiodic (1/f) component of the EEG spectrum as an estimate of E/I ratio (Gao et al., 2017; Medel et al., 2020; Muthukumaraswamy & Liley, 2018). This is a highly speculative interpretation with very little empirical evidence. These papers were conducted with ECoG data (mostly in animals) and mostly under anesthesia. Thus, these studies only allow an indirect interpretation by what the 1/f slope in EEG measurements is actually influenced.

      Note that Muthukumaraswamy et al. (2018) used different types of pharmacological manipulations and analyzed periodic and aperiodic MEG activity in humans, in addition to monkey ECoG (Muthukumaraswamy & Liley, 2018). Further, Medel et al. (now published as Medel et al., 2023) compared EEG activity in addition to ECoG data after propofol administration. The interpretation of our results are in line with a number of recent studies in developing (Hill et al., 2022; Schaworonkow & Voytek, 2021) and special populations using EEG. As mentioned above, several prior studies have used the slope of the 1/f component/aperiodic activity as an indirect measure of the E/I ratio (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Molina et al., 2020; Ostlund et al., 2021; Schaworonkow & Voytek, 2021), including studies using scalp-recorded EEG from humans.

      In the introduction of the revised manuscript, we have made more explicit that this metric is indirect (Page 3, Line 91), (additionally see Discussion, Page 24, Lines 644-645, Page 25, Lines 650-657).

      While a full understanding of aperiodic activity needs to be provided, some convergent ideas have emerged. We think that our results contribute to this enterprise, since our study is, to the best of our knowledge, the first which assessed MRS measured neurotransmitter levels and EEG aperiodic activity. “

      (10) Additionally, the authors state:

      "We cannot think of how any of the exploratory correlations between neurophysiological measures and MRS measures could be accounted for by a difference e.g. in skull thickness."

      (11) This could be addressed directly by including skull thickness as a covariate or visualizing it in scatterplots, for instance, by representing skull thickness as the size of the dots.

      We are not aware of any study that would justify such an analysis.

      Our analyses were based on previous findings in the literature.

      Since to the best of our knowledge, no evidence exists that congenital cataracts go together with changes in skull thickness, and that skull thickness might selectively modulate visual cortex Glx/GABA+ but not NAA measures, we decided against following this suggestion.

      Notably, the neurotransmitter concentration reported here is after tissue segmentation of the voxel region. The tissue fraction was shown to not differ between groups in the MRS voxels (Supplementary Material S4). The EEG electrode impedance was lowered to <10 kOhm in every participant (Methods, Page 13, Line 344), and preparation was identical across groups.

      (12 3.5) Problems with EEG Preprocessing and Analysis

      Downsampling: The decision to downsample the data to 60 Hz "to match the stimulation rate" is problematic. This choice conflates subsequent spectral analyses due to aliasing issues, as explained by the Nyquist theorem. While the authors cite prior studies (Schwenk et al., 2020; VanRullen & MacDonald, 2012) to justify this decision, these studies focused on alpha (8-12 Hz), where aliasing is less of a concern compared of analyzing aperiodic signal. Furthermore, in contrast, the current study analyzes the frequency range from 1-20 Hz, which is too narrow for interpreting the aperiodic signal as E/I. Typically, this analysis should include higher frequencies, spanning at least 1-30 Hz or even 1-45 Hz (not 20-40 Hz).

      As previously mentied in the Methods (Page 15 Line 376) and the previous response, the pop_resample function used by EEGLAB applies an anti-aliasing filter, at half the resampling frequency (as per the Nyquist theorem

      https://eeglab.org/tutorials/05_Preprocess/resampling.html). The upper cut off of the low pass filter set by EEGlab prior to down sampling (30 Hz) is still far above the frequency of interest in the current study  (1-20 Hz), thus allowing us to derive valid results.

      Quote:

      “- The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which ranged in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; Vanrullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .”

      Moreover, the resting-state data were not resampled to 60 Hz. We have made this clearer in the Methods of the second revision (Page 15, Line 367).

      Our consistent results of group differences across all three EEG conditions, thus, exclude any possibility that they were driven by aliasing artifacts.

      The expected effects of this anti-aliasing filter can be seen in the attached Author response image 1, showing an example participant’s spectrum in the 1-30 Hz range (as opposed to the 1-20 Hz plotted in the manuscript), clearly showing a 30-40 dB drop at 30 Hz. Any aliasing due to, for example, remaining line noise, would additionally be visible in this figure (as well as Figure 3) as a peak.

      Author response image 1.

      Power spectral density of one congenital cataract-reversal (CC) participant in the visual stimulation condition across all channels. The reduced power at 30 Hz shows the effects of the anti-aliasing filter applied by EEGLAB’s pop_resample function.

      As we stated in the manuscript, and in previous reviews, so far there has been no consensus on the exact range of measuring aperiodic activity. We made a principled decision based on the literature (showing a knee in aperiodic fits of this dataset at 20 Hz) (Medel et al., 2023; Ossandón et al., 2023), data quality (possible contamination by line noise at higher frequencies) and the purpose of the visual stimulation experiment (to look at the lower frequency range by stimulating up to 60 Hz, thereby limiting us to quantifying below 30 Hz), that 1-20 Hz would be the fit range in this dataset.

      Quote:

      “(3) What's the underlying idea of analyzing two separate aperiodic slopes (20-40Hz and 1-19Hz). This is very unusual to compute the slope between 20-40 Hz, where the SNR is rather low.

      "Ossandón et al. (2023), however, observed that in addition to the flatter slope of the aperiodic power spectrum in the high frequency range (20-40 Hz), the slope of the low frequency range (1-19 Hz) was steeper in both, congenital cataract-reversal individuals, as well as in permanently congenitally blind humans."

      The present manuscript computed the slope between 1-20 Hz. Ossandón et al. as well as Medel et al. (2023) found a “knee” of the 1/f distribution at 20 Hz and describe further the motivations for computing both slope ranges. For example, Ossandón et al. used a data driven approach and compared single vs. dual fits and found that the latter fitted the data better. Additionally, they found the best fit if a knee at 20 Hz was used. We would like to point out that no standard range exists for the fitting of the 1/f component across the literature and, in fact, very different ranges have been used (Gao et al., 2017; Medel et al., 2023; Muthukumaraswamy & Liley, 2018). “

      (13) Baseline Removal: Subtracting the mean activity across an epoch as a baseline removal step is inappropriate for resting-state EEG data. This preprocessing step undermines the validity of the analysis. The EEG dataset has fundamental flaws, many of which were pointed out in the previous review round but remain unaddressed. In its current form, the manuscript falls short of standards for robust EEG analysis. If I were reviewing for another journal, I would recommend rejection based on these flaws.

      The baseline removal step from each epoch serves to remove the DC component of the recording and detrend the data. This is a standard preprocessing step (included as an option in preprocessing pipelines recommended by the EEGLAB toolbox, FieldTrip toolbox and MNE toolbox), additionally necessary to improve the efficacy of ICA decomposition (Groppe et al., 2009).

      In the previous review round, a clarification of the baseline timing was requested, which we added. Beyond this request, there was no mention of the appropriateness of the baseline removal and/or a request to provide reasons for why it might not undermine the validity of the analysis.

      Quote:

      “- "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has been explicitly stated in the revised manuscript (Page 13, Line 354).”

      Prior work in the time (not frequency) domain on event-related potential (ERP) analysis has suggested that the baselining step might cause spurious effects (Delorme, 2023) (although see (Tanner et al., 2016)). We did not perform ERP analysis at any stage. One recent study suggests spurious group differences in the 1/f signal might be driven by an inappropriate dB division baselining method (Gyurkovics et al., 2021), which we did not perform.

      Any effect of our baselining procedure on the FFT spectrum would be below the 1 Hz range, which we did not analyze.  

      Each of the preprocessing steps in the manuscript match pipelines described and published in extensive prior work. We document how multiple aspects of our EEG results replicate prior findings (Supplementary Material S15, S18, S19), reports of other experimenters, groups and locations, validating that our results are robust.

      We therefore reject the claim of methodological flaws in our EEG analyses in the strongest possible terms.

      Quote:

      “(3.5) Problems with EEG preprocessing and analysis:

      - It seems that the authors did not identify bad channels nor address the line noise issue (even a problem if a low pass filter of below-the-line noise was applied).

      As pointed out in the methods and Figure 1, we only analyzed data from two occipital channels, O1 and O2 neither of which were rejected for any participant. Channel rejection was performed for the larger dataset, published elsewhere (Ossandón et al., 2023; Pant et al., 2023). As control sites we added the frontal channels FP1 and Fp2 (see Supplementary Material S14)

      Neither Ossandón et al. (2023) nor Pant et al. (2023) considered frequency ranges above 40 Hz to avoid any possible contamination with line noise. Here, we focused on activity between 0 and 20 Hz, definitely excluding line noise contaminations (Methods, Page 14, Lines 365-367). The low pass filter (FIR, 1-45 Hz) guaranteed that any spill-over effects of line noise would be restricted to frequencies just below the upper cutoff frequency.

      Additionally, a prior version of the analysis used spectrum interpolation to remove line noise; the group differences remained stable (Ossandón et al., 2023). We have reported this analysis in the revised manuscript (Page 14, Lines 364-357).

      Further, both groups were measured in the same lab, making line noise (~ 50 Hz) as an account for the observed group effects in the 1-20 Hz frequency range highly unlikely. Finally, any of the exploratory MRS-EEG correlations would be hard to explain if the EEG parameters would be contaminated with line noise.

      - What was the percentage of segments that needed to be rejected due to the 120μV criteria? This should be reported specifically for EO & EC and controls and patients.

      The mean percentage of 1 second segments rejected for each resting state condition and the percentage of 6.25 long segments rejected in each group for the visual stimulation condition have been added to the revised manuscript (Supplementary Material S10), and referred to in the Methods on Page 14, Lines 372-373).

      - The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which changed in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; VanRullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .

      - "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has now been explicitly stated in the revised manuscript (Page 14, Lines 379-380).

      - "We excluded the alpha range (8-14 Hz) for this fit to avoid biasing the results due to documented differences in alpha activity between CC and SC individuals (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023)." This does not really make sense, as the FOOOF algorithm first fits the 1/f slope, for which the alpha activity is not relevant.

      We did not use the FOOOF algorithm/toolbox in this manuscript. As stated in the Methods, we used a 1/f fit to the 1-20 Hz spectrum in the log-log space, and subtracted this fit from the original spectrum to obtain the corrected spectrum. Given the pronounced difference in alpha power between groups (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023), we were concerned it might drive differences in the exponent values. Our analysis pipeline had been adapted from previous publications of our group and other labs (Ossandón et al., 2023; Voytek et al., 2015; Waschke et al., 2017).

      We have conducted the analysis with and without the exclusion of the alpha range, as well as using the FOOOF toolbox both in the 1-20 Hz and 20-40 Hz ranges (Ossandón et al., 2023). The findings of a steeper slope in the 1-20 Hz range as well as lower alpha power in CC vs SC individuals remained stable. In Ossandón et al., the comparison between the piecewise fits and FOOOF fits led the authors to use the former, as it outperformed the FOOOF algorithm for their data.

      - The model fits of the 1/f fitting for EO, EC, and both participant groups should be reported.

      In Figure 3 of the manuscript, we depicted the mean spectra and 1/f fits for each group.

      In the revised manuscript, we added the fit quality metrics (average R<sup>2</sup> values > 0.91 for each group and condition) (Methods Page 15, Lines 395-396; Supplementary Material S11) and additionally show individual subjects’ fits (Supplementary Material S11). “

      (14) The authors mention:

      "The EEG data sets reported here were part of data published earlier (Ossandón et al., 2023; Pant et al., 2023)." Thus, the statement "The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) " is a circular argument and should be avoided."

      The authors addressed this comment and adjusted the statement. However, I do not understand, why not the full sample published earlier (Ossandón et al., 2023) was used in the current study?

      The recording of EEG resting state data stated in 2013, while MRS testing could only be set up by the second half of 2019. Moreover, not all subjects who qualify for EEG recording qualify for being scanned (e.g. due to MRI safety, claustrophobia)

      References

      Bottari, D., Troje, N. F., Ley, P., Hense, M., Kekunnaya, R., & Röder, B. (2016). Sight restoration after congenital blindness does not reinstate alpha oscillatory activity in humans. Scientific Reports. https://doi.org/10.1038/srep24683

      Colombo, M. A., Napolitani, M., Boly, M., Gosseries, O., Casarotto, S., Rosanova, M., Brichant, J. F., Boveroux, P., Rex, S., Laureys, S., Massimini, M., Chieregato, A., & Sarasso, S. (2019). The spectral exponent of the resting EEG indexes the presence of consciousness during unresponsiveness induced by propofol, xenon, and ketamine. NeuroImage, 189(September 2018), 631–644. https://doi.org/10.1016/j.neuroimage.2019.01.024

      Delorme, A. (2023). EEG is better left alone. Scientific Reports, 13(1), 2372. https://doi.org/10.1038/s41598-023-27528-0

      Favaro, J., Colombo, M. A., Mikulan, E., Sartori, S., Nosadini, M., Pelizza, M. F., Rosanova, M., Sarasso, S., Massimini, M., & Toldo, I. (2023). The maturation of aperiodic EEG activity across development reveals a progressive differentiation of wakefulness from sleep. NeuroImage, 277. https://doi.org/10.1016/J.NEUROIMAGE.2023.120264

      Gao, R., Peterson, E. J., & Voytek, B. (2017). Inferring synaptic excitation/inhibition balance from field potentials. NeuroImage, 158(March), 70–78. https://doi.org/10.1016/j.neuroimage.2017.06.078

      Groppe, D. M., Makeig, S., & Kutas, M. (2009). Identifying reliable independent components via split-half comparisons. NeuroImage, 45(4), 1199–1211. https://doi.org/10.1016/j.neuroimage.2008.12.038

      Gyurkovics, M., Clements, G. M., Low, K. A., Fabiani, M., & Gratton, G. (2021). The impact of 1/f activity and baseline correction on the results and interpretation of time-frequency analyses of EEG/MEG data: A cautionary tale. NeuroImage, 237. https://doi.org/10.1016/j.neuroimage.2021.118192

      Hill, A. T., Clark, G. M., Bigelow, F. J., Lum, J. A. G., & Enticott, P. G. (2022). Periodic and aperiodic neural activity displays age-dependent changes across early-to-middle childhood. Developmental Cognitive Neuroscience, 54, 101076. https://doi.org/10.1016/J.DCN.2022.101076

      Maurer, D., Mondloch, C. J., & Lewis, T. L. (2007). Sleeper effects. In Developmental Science. https://doi.org/10.1111/j.1467-7687.2007.00562.x

      McSweeney, M., Morales, S., Valadez, E. A., Buzzell, G. A., Yoder, L., Fifer, W. P., Pini, N., Shuffrey, L. C., Elliott, A. J., Isler, J. R., & Fox, N. A. (2023). Age-related trends in aperiodic EEG activity and alpha oscillations during early- to middle-childhood. NeuroImage, 269, 119925. https://doi.org/10.1016/j.neuroimage.2023.119925

      Medel, V., Irani, M., Crossley, N., Ossandón, T., & Boncompte, G. (2023). Complexity and 1/f slope jointly reflect brain states. Scientific Reports, 13(1), 21700. https://doi.org/10.1038/s41598-023-47316-0

      Molina, J. L., Voytek, B., Thomas, M. L., Joshi, Y. B., Bhakta, S. G., Talledo, J. A., Swerdlow, N. R., & Light, G. A. (2020). Memantine Effects on Electroencephalographic Measures of Putative Excitatory/Inhibitory Balance in Schizophrenia. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(6), 562–568. https://doi.org/10.1016/j.bpsc.2020.02.004

      Muthukumaraswamy, S. D., & Liley, D. T. (2018). 1/F electrophysiological spectra in resting and drug-induced states can be explained by the dynamics of multiple oscillatory relaxation processes. NeuroImage, 179(November 2017), 582–595. https://doi.org/10.1016/j.neuroimage.2018.06.068

      Ossandón, J. P., Stange, L., Gudi-Mindermann, H., Rimmele, J. M., Sourav, S., Bottari, D., Kekunnaya, R., & Röder, B. (2023). The development of oscillatory and aperiodic resting state activity is linked to a sensitive period in humans. NeuroImage, 275, 120171. https://doi.org/10.1016/J.NEUROIMAGE.2023.120171

      Ostlund, B. D., Alperin, B. R., Drew, T., & Karalunas, S. L. (2021). Behavioral and cognitive correlates of the aperiodic (1/f-like) exponent of the EEG power spectrum in adolescents with and without ADHD. Developmental Cognitive Neuroscience, 48, 100931. https://doi.org/10.1016/j.dcn.2021.100931

      Pant, R., Ossandón, J., Stange, L., Shareef, I., Kekunnaya, R., & Röder, B. (2023). Stimulus-evoked and resting-state alpha oscillations show a linked dependence on patterned visual experience for development. NeuroImage: Clinical, 103375. https://doi.org/10.1016/J.NICL.2023.103375

      Schaworonkow, N., & Voytek, B. (2021). Longitudinal changes in aperiodic and periodic activity in electrophysiological recordings in the first seven months of life. Developmental Cognitive Neuroscience, 47. https://doi.org/10.1016/j.dcn.2020.100895

      Schwenk, J. C. B., VanRullen, R., & Bremmer, F. (2020). Dynamics of Visual Perceptual Echoes Following Short-Term Visual Deprivation. Cerebral Cortex Communications, 1(1). https://doi.org/10.1093/TEXCOM/TGAA012

      Tanner, D., Norton, J. J. S., Morgan-Short, K., & Luck, S. J. (2016). On high-pass filter artifacts (they’re real) and baseline correction (it’s a good idea) in ERP/ERMF analysis. Journal of Neuroscience Methods, 266, 166–170. https://doi.org/10.1016/j.jneumeth.2016.01.002

      Vanrullen, R., & MacDonald, J. S. P. (2012). Perceptual echoes at 10 Hz in the human brain. Current Biology. https://doi.org/10.1016/j.cub.2012.03.050

      Voytek, B., Kramer, M. A., Case, J., Lepage, K. Q., Tempesta, Z. R., Knight, R. T., & Gazzaley, A. (2015). Age-related changes in 1/f neural electrophysiological noise. Journal of Neuroscience, 35(38). https://doi.org/10.1523/JNEUROSCI.2332-14.2015

      Waschke, L., Wöstmann, M., & Obleser, J. (2017). States and traits of neural irregularity in the age-varying human brain. Scientific Reports 2017 7:1, 7(1), 1–12. https://doi.org/10.1038/s41598-017-17766-4

    1. Author response:

      The following is the authors’ response to the previous reviews

      Response to the reviewer #2 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions.

      Regarding in vivo Treg homing assay, we did not exclude doublets and dead cells from the analysis of Kaede-expressing Tregs migrated to the aorta, which may affect the results. We described this issue as the limitation of this study in the revised manuscript. Nonetheless, we believe the reliability of our findings because we repeated this experiment three times and obtained similar results.

      There is no evidence to support the clinical relevance of our findings. Future clinical research on this topic is highly desired.

      Response to the reviewer #3 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions.

      Despite the controversial role of Th17 cells in atherosclerosis, we understand the possible involvement of Th17 cells and the Th1 cell/Th17 cell balance in lymphoid tissues and aortic lesions in accelerated inflammation and atherosclerosis in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice. Although we could not completely evaluate the changes in these immune responses in detail, future study may elucidate interesting mechanisms mediated by Th17 cell responses.

      As the reviewer suggested, we understand that it is necessary to provide in vivo evidence for the Treg suppressive effects on DC activation. Based on the results of in vitro experiments, we described the discussion on the in vivo evidence in the revised manuscript.

      We understand methodological limitations for flow cytometric analysis of immune cells in the aorta and in vivo Treg homing assay. We described this issue as the limitation of this study in the revised manuscript. Regarding in vivo Treg homing assay, we statistically re-analyzed the combined data from multiple experiments and observed a tendency toward reduction in the proportion of CCR4-deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice, though there was no statistically significant difference in the migratory capacity of CCR4-intact or CCR4-deficient Kaede-expressing Tregs. Accordingly, we toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      The reviewer requested us to evaluate aortic inflammation in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice injected with CCR4-intact or CCR4-deficient Tregs. However, we think that this experiment will provide marginal information because Treg transfer experiments in Apoe<sup>-/-</sup> mice have already shown the protective role of CCR4 in Tregs against aortic inflammation and early atherosclerosis.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) #1 and #2: CD103 and CD86 expression should be discussed on the text and not only in the response to reviewer.

      In accordance with the reviewer’s suggestion, we added a discussion on the downregulated CD103 expression in peripheral LN Tregs and upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in the discussion section in the revised manuscript.

      (2) #5: Authors response is not satisfactory. No gate percentage is shown. As it currently is, the difference in the number of cells shown in the figure could be due to differences in events recorded. Furthermore, the gate strategy is not thorough. Considering the very low frequency of Kaede + cells detected, it is crucial to properly exclude doublets and dead cells.

      Authors reported a dramatic difference in Kaede + Tregs cells in the aorta across experiments. This could be addressed by normalization followed by appropriate statistical analysis (One sample t-test).

      The data shown is not strong enough to conclude that there is a reduced migration to the aorta.

      We understand the importance of reviewer’s suggestion. We described the percentage of Kaede+ Tregs in the aorta of Apoe<sup>-/-</sup> mice receiving transfer of Kaede-expressing CCR4-intact or CCR4-deficient Tregs in Figure 5I.

      As the reviewer pointed out, we understand that it would be important to properly exclude doublets and dead cells in in vivo Treg homing assay. However, it is difficult for us to resolve this issue because we need to perform the same experiments again which will require a great number of additional mice and substantial amount of time. We deeply regret that these important experimental procedures were not performed. We described this issue as the limitation of this study.

      In accordance with the reviewer’s suggestion, we re-analyzed the combined data from multiple experiments using one-sample t-test. We observed a tendency toward reduction in the proportion of CCR4-deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice, though there was no statistically significant difference in the migratory capacity of CCR4-intact or CCR4-deficient Kaede-expressing Tregs. By modifying the corresponding descriptions in the manuscript, we toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      (3) #8: There are still several not shown data

      In accordance with the reviewer’s suggestion, we showed the data on the responses of Tregs and effector memory T cells in 8-week-old wild-type or Ccr4<sup>-/-</sup> mice and Ccr4 mRNA expression in Tregs and non-Tregs from Apoe<sup>-/-</sup> or Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in Supplementary Figures 4 and 7.

      Reviewer #3 (Recommendations for the authors):

      (1) Issue 1. For future studies, I recommend not omitting viability controls during cell staining. Removal of dead cells and doublets should always be included during the gating strategy to avoid undesirable artefacts, especially when analysing less-represented cell populations. According to your previous report (ref #40), I agree that isotype controls were unnecessary using the same staining protocol. FMO controls should always be included in flow cytometry analysis (not mentioned in the methodology description and ref#40).

      As the reviewer suggested, we understand that it would be important to properly exclude dead cells and doublets and to prepare FMO controls in flow cytometric analysis. We deeply regret that these important experimental procedures were not performed. We described this issue as the limitation of this study.

      (2) Issue 3. Although Th17's role in atherosclerosis remains controversial, the data obtained in this work could provide valuable insights if discussed appropriately. As noted in my public review, I found it noteworthy that ROR γ t+ cells represented around 13% of effector TCD45+CD3+CD4+ lymphocytes in the aorta of Apoe<sup>-/-</sup> mice while Th1 less than 5% (Fig 4H and F, respectively). I recognise that differences in cell staining sensibility and robustness for different transcription factors may influence these percentages. However, analysing how CCR4 deficiency influences the Th1/TI h17 balance would yield interesting data, similar to what was done for the Th1/Treg ratio.

      Considering the higher proportion of Th17 cells than Th1 or Th2 cells in atherosclerotic aorta, we understand the importance of reviewer’s suggestion. However, we could not evaluate the effect of CCR4 deficiency on the Th1/Th17 balance in aorta because we did not perform flow cytometric analysis of aortic Th1 and Th17 cells in the same mice. Meanwhile, we could examine the Th1/Th17 balance in peripheral lymphoid tissues by flow cytometry. We found a significant increase in the Th1/Th17 ratio in the peripheral LNs of Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, while there were no changes in its ratio in the spleen or para-aortic LNs of these mice, which limits the contribution of the Th1/Th17 balance to exacerbated atherosclerosis. We showed these data below.

      Author response image 1.

      (3) Issue 4. I appreciate the authors for sharing data on the flow cytometry analysis of Tregs in para-aortic LNs of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup> Apoe<sup>-/-</sup> mice, which would have been included as a Supplementary figure. These results reinforce the notion that Treg dysfunction in CCR4-deficient mice may not be due to the downregulation of regulatory cell surface receptors.

      We showed the data on the expression of CTLA-4, CD103, and PD1 in Tregs in the para-aortic LNs of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in Supplementary Figure 8.

      (4) Issue 5. I agree that CD4+ T cell responses are substantially regulated by DCs. While CD80 and CD86 on DC primarily serve as costimulatory signals for T-cell activation, cytokines secreted by DCs are primordial signals for determining the differentiation phenotype of effector Th cells. Since the analysis of DC phenotype in lymphoid tissues of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup> Apoe<sup>-/-</sup> mice could not be addressed in this study, it is not possible to differentiate which processes may be mainly affected by CCR4-deficiency during CD4+ T cell activation. In this scenario, and considering in vitro studies, the results suggest a possible role of CCR4 in controlling the extent of activation of CD4+T cells rather than shifting the CD4+T cell differentiation profile in peripheral lymphoid tissues, where a predominant Th1 profile was already established in Apoe<sup>-/-</sup> mice. Therefore, I advise caution when concluding about shifts in CD4+ T cell responses.

      We thank the reviewer for providing us thoughtful comments. As the reviewer pointed out, we understand that we should carefully interpret the mechanisms for the shift of CD4+ T cell responses by CCR4 deficiency.

      (5) Regarding migration studies in the revised manuscript. I fully understand that Treg transference assays are challenging. The results do not suggest that CCR4 was critical for Treg migration to lymphoid tissues in the conditions assayed. Concerning migration to the aorta, I found the results inconclusive since the authors mention that: i) there was a dramatic difference in the absolute numbers of Kaede-expressing Tregs that migrated to the aorta impairing statistical analysis; ii) the number of Kaede-expressing Tregs that migrated to the aorta was extremely low; iii) dead cells and doublets were not removed in the flow cytometry analysis. In this context, I do not agree with the following statements and recommend revising them:

      - "CCR4 deficiency in Tregs impaired their migration to the atherosclerotic aorta" (lines 36-7),

      - "…we found a significant reduction in the proportion of CCR4 deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice" (lines 356-7),

      - "CCR4 expression on Tregs regulates the development of early atherosclerosis by....... mediating Treg migration to the atherosclerotic aorta" (lines 409-411),

      - "…we found that CCR4 expression on Tregs is critical for regulating atherosclerosis by mediating their migration to the atherosclerotic aorta" (lines 437-438),

      - "CCR4 protects against early atherosclerosis by mediating Treg migration to the aorta.... (lines 464-465),

      - "We showed that CCR4 expression on Tregs is critical for ...... mediating Treg migration to the atherosclerotic aorta" (503-505).

      We understand the importance of the reviewer’s suggestion. We described this issue as the limitation of this study. In accordance with the reviewer’s suggestion, we modified the above descriptions and toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      (6) Line 206: Mention the increased expression of CD86 by DCs

      We mentioned this result in the revised manuscript. We also added a discussion on the upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in the discussion section in the revised manuscript.

      (7) Lines 304-305. According to Fig 4F-H, a selective accumulation of Th1 cells seems to have occurred only in the aorta, coinciding with a higher Th1/Treg ratio. No selective accumulation of Th1 cells was observed in para-aortic lymph nodes. These results could be clarified.

      We modified the above description in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews

      Reviewer #1 (Public Review):

      Comment: The fact that there are Arid1a transcripts that escape the Cre system in the Arid1a KO mouse model might difficult the interpretation of the data. The phenotype of the Arid1a knockout is probably masked by the fact that many of the sequencing techniques used here are done on a heterogeneous population of knockout and wild type spermatocytes. In relation to this, I think that the use of the term "pachytene arrest" might be overstated, since this is not the phenotype truly observed. Knockout mice produce sperm, and probably litters, although a full description of the subfertility phenotype is lacking, along with identification of the stage at which cell death is happening by detection of apoptosis.

      Response: As the reviewer indicates, we did not observe a complete arrest at Pachynema. In fact, the histology shows the presence of spermatids and sperm in seminiferous tubules and epididymides (Fig. Sup. 3). However, our data argue that the wild-type haploid gametes produced were derived from spermatocyte precursors that have likely escaped Cre mediated activity (Fig. Sup. 4). Furthermore, diplotene and metaphase-I spermatocytes lacking ARID1A protein by IF were undetectable in the Arid1acKO testes (Fig. S4B). Therefore, although we do not demonstrate a strict pachytene arrest, it is reasonable to conclude that ARID1A is necessary to progress beyond pachynema. We have revised the manuscript to reflect this point (Abstract lines 17,18; Results lines 153,154)

      Comment: It is clear from this work that ARID1a is part of the protein network that contributes to silencing of the sex chromosomes. However, it is challenging to understand the timing of the role of ARID1a in the context of the well-known DDR pathways that have been described for MSCI.

      Response: With respect to the comment on the lack of clarity as to which stage of meiosis we observe cell death, our data do suggest that it is reasonable to conclude that mutant spermatocytes (ARID1A-) undergo cell death at pachynema given their inability to execute MSCI, which is a well-established phenotype.

      Comment: Staining of chromosome spreads with Arid1a antibody showed localization at the sex chromosomes by diplonema; however, analysis of gene expression in Arid1a KO was performed on pachytene spermatocytes. Therefore, is not very clear how the chromatin remodeling activity of Arid1a in diplonema is affecting gene expression of a previous stage. CUTnRUN showed that ARID1a is present at the sex chromatin in earlier stages, leading to hypothesize that immunofluorescence with ARID1a antibody might not reflect ARID1a real localization.

      Response: It is unclear what the reviewer means about not understanding how ARID1A activity at diplonema affects gene expression at earlier stages. Our interpretations were not based solely on the observation of ARID1A associations with the XY body at diplonema. In fact, mRNA expression and CUT&RUN analyses were performed on pachytene-enriched populations. ARID1A's association with the XY body is not exclusive to diplonema. Based on both CUT&RUN and IF data, ARID1A associates with XY chromatin as early as pachynema. Only at late diplonema did we observe ARID1A hyperaccumulation on the XY body by IF.

      Reviewer #2 (Public Review):

      Comment: The inefficient deletion of ARID1A in this mouse model does not allow any detailed analysis in a quantitative manner.

      Response: As explained in our response to these comments in the first revision, we respectfully disagree with this reviewer’s conclusions. We have been quantitative by co-staining for ARID1A, ensuring that we can score mutant pachytene spermatocytes from escapers. Additionally, we provide data to show the efficiency of ARID1A loss in the purified pachytene populations sampled in our genomic assays.

      Reviewer #3 (Public Review):

      Comment: The data demonstrate that the mutant cells fail to progress past pachytene, although it is unclear whether this specifically reflects pachytene arrest, as accumulation in other stages of Prophase also is suggested by the data in Table 1. The western blot showing ARID1A expression in WT vs. cKO spermatocytes (Fig. S2) is supportive of the cKO model but raises some questions. The blot shows many bands that are at lower intensity in the cKO, at MWs from 100-250kDa. The text and accompanying figure legend have limited information. Are the various bands with reduced expression different isoforms of ARID1A, or something else? What is the loading control 'NCL'? How was quantification done given the variation in signal across a large range of MWs?

      Response: The loading control is Nucleolin. With respect to the other bands in the range of 100-250 kDa, it is difficult to say whether they represent ARID1A isoforms. The Uniprot entry for Mouse ARID1A only indicates a large mol. wt sequence of ~242 kDa; therefore, the band corresponding to that size was quantified. There is no evidence to suggest that lower molecular weight isoforms may be translated. Although speculative, it is possible that the lower molecular weight bands represent proteolytic/proteasomal degradation products or products of antibody non-specificity. These points are addressed in the revised manuscript (Legend to Fig S2, lines 926-931). Blots were scanned on a LI-COR Odyssey CLx imager and viewed and quantified using Image Studio Version 5.2.5 (Methods, lines 640-642).

      Comment: An additional weakness relates to how the authors describe the relationship between ARID1A and DNA damage response (DDR) signaling. The authors don't see defects in a few DDR markers in ARID1A CKO cells (including a low-resolution assessment of ATR), suggesting that ARID1A may not be required for meiotic DDR signaling. However, as previously noted the data do not rule out the possibility that ARID1A is downstream of DDR signaling and the authors even indicate that "it is reasonable to hypothesize that DDR signaling might recruit BAF-A to the sex chromosomes (lines 509-510)." It therefore is difficult to understand why the authors continue to state that "...the mechanisms underlying ARID1A-mediated repression of the sex-linked transcription are mutually exclusive to DDR pathways regulating sex body formation" (p. 8) and that "BAF-A-mediated transcriptional repression of the sex chromosomes occurs independently of DDR signaling" (p. 16). The data provided do not justify these conclusions, as a role for DDR signaling upstream of ARID1A would mean that these mechanisms are not mutually exclusive or independent of one another.

      Response: The reviewer’s argument is reasonable, and we have made the recommended changes (Results, lines 212-215; Discussion, lines 499-500).

      Comment: A final comment relates to the impacts of ARID1A loss on DMC1 focus formation and the interesting observation of reduced sex chromosome association by DMC1. The authors additionally assess the related recombinase RAD51 and suggest that it is unaffected by ARID1A loss. However, only a single image of RAD51 staining in the cKO is provided (Fig. S11) and there are no associated quantitative data provided. The data are suggestive but it would be appropriate to add a qualifier to the conclusion regarding RAD51 in the discussion which states that "...loss of ARID1a decreases DMC1 foci on the XY chromosomes without affecting RAD51" given that the provided RAD51 data are not rigorous. In the long-term it also would be interesting to quantitatively examine DMC1 and RAD51 focus formation on autosomes as well.

      Response: We agree with the reviewer’s comment and have made the recommended changes (Discussion, lines 518-519).

      Response to non-public recommendations

      Reviewer 2:

      Comment: Meiotic arrest is usually judged based on testicular phenotypes. If mutant testes do not have any haploid spermatids, we can conclude that meiotic arrest is a phenotype. In this case, mutant testes have haploid spermatids and are fertile. The authors cannot conclude meiotic arrest. The mutant cells appear to undergo cell death in the pachytene stage, but the authors cannot say "meiotic arrest."

      Response: We disagree with this comment. By IF, we see that ~70% of the spermatocytes have deleted ARID1A. Furthermore, we never observed diplotene spermatocytes that lacked ARID1A. The conclusion that the absence of ARID1A results in a pachynema arrest and that the escapers produce the haploid spermatids is firm.

      Comment: Fig. S2 and S3 have wrong figure legends.

      Response: The figure legends for Fig. S2 and S3 are correct.

      Comment: The authors do not appear to evaluate independent mice for scoring (the result is about 74% deletion above, Table S1). Sup S2: how many independent mice did the authors examine?

      Response:These were Sta-Put purified fractions obtained from 14-15 WT and mutant mice. It is difficult to isolate pachytene spermatocytes by Sta-Put at the required purity in sufficient yields using one mouse at a time. We used three technical replicates to quantify the band intensity, and the error bars represent the standard error of the mean (S.E.M) of the band intensity.

      Comment: Comparison of cKO and wild-type littermate yielded nearly identical results (Avg total conc WT = 32.65 M/m; Avg total conc cKO = 32.06 M/ml)". This sounds like a negative result (i.e., no difference between WT and cKO).

      Response: This is correct. There is no difference between Arid1aWT and Arid1aCKO sperm production. This is because wild-type haploid gametes produced were derived from spermatocyte precursors that have escaped Cre-mediated activity (Fig. S4). These data merely serve to highlight an inherent caveat of our conditional knockout model and are not intended to support the main conclusion that ARID1A is necessary for pachytene progression.

      Comment: The authors now admit ~ 70 % efficiency in deletion, and the authors did not show the purity of these samples. If the purity of pachytene spermatocytes is ~ 80%, the real proportion of mutant cells can be ~ 56%. It is very difficult to interpret the data.

      Response: The original submission did refer to inefficient Cre-induced recombination. The reviewer asked for the % efficiency, which was provided in the revised version. Also, please refer to Fig. S2, where Western blot analysis demonstrates a significant loss of ARID1A protein levels in CKO relative to WT pachytene spermatocyte populations that were used for CUT&RUN data generation.

      Comment: The authors should not use the other study to justify their own data. The H3.3 ChIP-seq data in the NAR paper detected clear peaks on autosomes. However, in this study, as shown in Fig. S7A, the authors detected only 4 peaks on autosomes based on MACS2 peak calling. This must be a failed experiment. Also, S7A appears to have labeling errors.

      Response: I believe the reviewer is referring to supplementary figure 8A. Here, it is not clear which labeling errors the reviewer is referring to. In the wild type, the identified peaks were overwhelmingly sex-linked intergenic sites. This is consistent with the fact that H3.3 is hyper-accumulated on the sex chromosomes at pachynema.

      The authors of the NAR paper did not perform a peak-calling analysis using MACS2 or any other peak-calling algorithm. They merely compared the coverage of H3.3 relative to input. Therefore, it is not clear on what basis the reviewer says that the NAR paper identified autosomal peaks. Their H3.3 signal appears widely distributed over a 6 kb window centered at the TSS of autosomal genes, which, compared to input, appears enriched. Our data clearly demonstrates a less noisy and narrower window of H3.3 enrichment at autosomal TSSs in WT pachytene spermatocytes, albeit at levels lower than that seen in CKO pachytene spermatocytes (Fig S8B and see data copied below for each individual replicate). Moreover, the lack of peaks does not mean that there was an absence of H3.3 at these autosomal TSSs (Supp. Fig. S8B). Therefore, we disagree with the reviewer’s comment that the H3.3 CUT&RUN was a failed experiment.

      Author response image 1.

      H3.3 Occupancy at genes mis-regulated in the absence of ARID1A

      Comment: If the author wishes to study the function of ARID2 in spermatogenesis, they may need to try other cre-lines to have more robust phenotypes, and all analyses must be redone using a mouse model with efficient deletion of ARID2.

      Response: As noted, we chose Stra8-Cre to conditionally knockout Arid1a because ARID1A is haploinsufficient during embryonic development. The lack of Cre expression in the maternal germline allows for transmission of the floxed allele, allowing for the experiments to progress.

      Comment: The inefficient deletion of ARID1A in this mouse model does not allow any detailed analysis in a quantitative manner.

      Response: In many experiments, we have been quantitative when possible by co-staining for ARID1A, ensuring that we can score mutant pachytene spermatocytes from escapers. Additionally, we provide data to show the efficiency of ARID1A loss in the purified pachytene populations sampled in our genomic assays.

      Reviewer 3:

      Comment: The Methods section refers to antibodies as being in Supplementary Table 3, but the table is labeled as Supplementary Table 2.

      Response: This has been corrected

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations For The Authors):

      The additional data included in this revision nicely strengthens the major claim.

      I apologize that my comment about K+ concentration in the prior review was unclear. The cryoEM structure of KCNQ1 with S4 in the resting state was obtained with lowered K+ relative to the active state. Throughout the results and discussion it seems implied that the change in voltage sensor state is somehow causative of the change in selectivity filter state while the paper that identified the structures attributes the change in selectivity filter state not to voltage sensors, but to the change in [K+] between the 2 structures. Unless there is a flaw in my understanding of the conditions in which the selectivity filter structures used in modeling were generated, it seems misleading to ignore the change in [K+] when referring to the activated vs resting or up vs down structures. My understanding is that the closed conformation adopted in the resting/low [K+] is similar to that observed in low [K+] previously and is more commonly associated with [K+]-dependent inactivation, not resulting from voltage sensor deactivation as implied here. The original article presenting the low [K+] structure also suggests this. When discussing conformational changes in the selectivity filter, I strongly suggest referring to these structures as activated/high [K+] vs resting/low [K+] or something similar, as the [K+] concentration is a salient variable.

      There seems to be some major confusion here and we will try to explain how we think. Note that in the Mandela and MacKinnon paper, there is no significant difference in the amino acid positions in the selectivity filter between low and high K+ when S4 is in the activated position (See Mandala and Mackinnon, PNAS Suppl. Fig S5 C and D). There are only fewer K+ in the selectivity filter in low K+. So, the structure with the distorted selectivity filter is not due to low K+ by itself. Note that there is no real difference between macroscopic currents recorded in low and high K+ solutions (except what is expected from changes in driving force) for KCNQ1/KCNE1 channels (Larsen et al., Bioph J 2011), suggesting that low K+ do not promote the non-conductive state (Figure 1). We now include a section in the Discussion about high/low K+ in the structures and the absence of effects of K+ on the function of KCNQ1/KCNE1 channels.

      Author response image 1.

      Macroscopic KCNQ1/KCNE1 currents recorded in different K+ conditions.  Note that there is no difference between current recorded in low K+ (2 mM) conditions and high (96 mM) K+ conditions (n=3 oocytes). Currents were normalized in respect to high K+.

      Note also that, in the previous version of the manuscript, we did not propose that the position of S4 is what determines the state of the selectivity filter. We only reported that the CryoEM structure with S4 resting shows a distorted selectivity filter. It seems like our text confused the reviewer to think that we proposed that S4 determines the state of the selectivity filter, when we did not propose this earlier. We previously did not want to speculate too much about this, but we have now included a section in the Discussion to make our view clear in light of the confusion of the reviewers.

      It is clear from our data that the majority of sweeps are empty (which we assume is with S4 up), suggesting that the selectivity filter can be (and is in the majority of sweeps) in the non-conducting state even with S4 up.  We think that the selectivity filter switches between a non-conductive and a conductive conformation both with S4 down and with S4 up. The cryoEM structure in low K+ and S4 down just happened to catch the non-conductive state of the selectivity filter.  We have now added a section in the Discussion to clarify all this and explain how we think it works.

      However, S4 in the active conformation seems to stabilize the conductive conformation of the selectivity filter, because during long pulses the channel seems to stay open once opened (See Suppl Fig S2). So, one possibility is that the selectivity filter goes more readily into the non-conductive state when S4 is down (and maybe, or not, low K+ plays a role) and then when S4 moves up the selectivity filter sometimes recovers into the conductive state and stays there. We now have included a section in the Discussion to present our view. Since this whole discussion was initiated and pushed by the reviewer, we hope that the reviewers will not demand more data to support these ideas. We think that this addition makes sense since other readers might have the same questions and ideas as the reviewer, and we would like to prevent any confusion about this topic.

      Figure 1

      It remains unclear in the manuscript itself what "control" refers to. Are control patched the same patches that later receive LG?

      Yes, the control means the same patch before LG. We now indicate that in legends and text throughout.

      Supplementary Figure S1

      Unclear if any changes occur after addition of LG in left panel and if the LG data on right is paired in any way to data on left.

      Yes, in all cases the left and right panel in all figures are from the same patch. We now indicate that in legends and text throughout.

      The letter p is used both to represent open probability open probability from the all-point amplitude histogram and as a p-value statistical probability indicator sometime lower case, sometimes upper case. This was confusing.

      We have now exclusively use lower case p for statistical probability and Po for open probability.

      "This indicates that mutations of residues in the more intracellular region of the selectivity filter do not affect the Gmax increases and that the interactions that stabilize the channel involve only residues located near the external region part of the selectivity filter. "

      Seems too strongly worded, it remains possible that mutations of other residues in the more intracellular region of the selectivity filter could affect the Gmax increases.

      We have changed the text to: "Mutations of residues in the more intracellular region of the selectivity filter do not affect the Gmax increases, as if the interactions that stabilize the channel involve residues located near the external region part of the selectivity filter. "

      Supplementary Figure S7

      Please report Boltzmann fit parameters. What are "normalized" uA?

      We removed the uA, which was mistakenly inserted. The lines in the graphs are just lines connecting the dots and not Boltzmann fits, since we don’t have saturating curves in all panels to make unique fits.

      "We have previously shown that the effects of PUFAs on IKs channels involve the binding of PUFAs to two independent sites." Was binding to the sites actually shown? Suggest changing to: "We have previously proposed models in which the effects of PUFAs..."

      We have now changed this as the Reviewer suggested: " We have previously proposed models in which the effects of PUFAs on IKs channels involve the binding of PUFAs to two independent sites."

      Statistics used not always clear. Methods refer to multiple statistical tests but it is not clear which is used when.

      We use two different tests and it is now explained in figure legends when either was used.

      n values confusing. Sometimes # of sweeps used as n. Sometimes # patches used as n. In one instance "The average current during the single channel sweeps was increased by 2.3 {plus minus} 0.33 times (n = 4 patches, p =0.0006)" ...this sems a low p value for this n=4 sample?

      We have now more clearly indicated what n stands for in each case. There was an extra 0 in the p value, so now it is p = 0.006. Thanks for catching that error.

      Reviewer #2 (Recommendations For The Authors):

      I still have some comments for the revised manuscript.

      (1) (From the previous minor point #6) Since D317E and T309S did not show statistical significance in Figure 5A, the sentences such as "This data shows that Y315 and D317 are necessary for the ability of Lin-Glycine to increase Gmax" or "the effect of Lin-Glycine on Gmax of the KCNQ1/KCNE1 mutant was noticeably reduced compared to the WT channel showing the this residue contributes to the Gmax effect (Figure 5A)." may need to be toned down. Alternatively, I suggest the authors refer to Supplementary Figure S7 to confirm that Y315 and D317 are critical for increasing Gmax.

      We have redone the analysis and statistical evaluation in Fig 5. We no use the more appropriate value of the fitted Gmax (which use the whole dose response curve instead of only the 20 mM value) in the statistical evaluation and now Y315F and D317E are statistically different from wt.

      (2) Supplementary Fig. S1. All control diary plots include the green arrows to indicate the timing of lin-glycine (LG) application. It is a bit confusing why they are included. Is it to show that LG application did not have an immediate effect? Are the LG-free plots not available?

      Not sure what the Reviewer is asking about? In the previous review round the Reviewers asked specifically for this. The arrow shows when LG was applied and the plot on the right shows the effect of LG from the same patch.

      (3) The legend to Supplementary Figure S4, "The side chain of residues ... are highlighted as sticks and colored based on the atomic displacement values, from white to blue to red on a scale of 0 to 9 Å." They look mostly blue (or light blue). Which one is colored white? It might be better to use a different color code. It would also be nice to link the color code to the colors of Supplementary Figure S5, which currently uses a single color.

      We have removed “from white to blue to red on a scale of 0 to 9 Å” and instead now include a color scale directly in Fig S4 to show how much each atom moved based on the color.

      We feel it is not necessary to include color in Fig S5 since the scale of how much each atom moves is shown on the y axis.

      (4) Add unit (pA) to the y-axis of Supplementary Figure S2.

      pA has been added.

      Reviewer #3 (Recommendations For The Authors):

      Some issues on how data support conclusions are identified. Further justifications are suggested.

      186: “The decrease in first latency is most likely due to an effect of Lin-Glycine on Site I in the VSD and related to the shift in voltage dependence caused by Lin-Glycine." The results in Fig S1B do not seem to support this statement since the mutation Y315F in the pore helix seemed to have eliminated the effect of Lin-Glycine in reducing first latency. The authors may want to show that a mutation that eliminating Site I would eliminate the effect of Lin-Glycine on first latency. On the other hand, it will be also interesting to examine if another pore mutation, such as P320L (Fig 5) also reduce the effect of Lin-Glycine on first latency.

      These experiments are very hard and laborious, and we feel these are outside the scope of this paper which focuses on Site II and the mechanism of increasing Gmax. Further studies of the voltage shift and latency will have to be for a future study.

      The mutation D317E did not affect the effect of Lin-Glycine on Gmax significantly (Fig 5A, and Fig S7F comparing with Fig S7A), but the authors conclude that D317 is important for Lin-Glycine association. This conclusion needs a better justification.

      We have redone the analysis and statistical evaluation in Fig 5. We no use the more appropriate value of the fitted Gmax (which use the whole dose response curve instead of only the 20 mM value) in the statistical evaluation and now D317E is statistically different from wt

    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife assessment

      This useful manuscript challenges the utility of current paradigms for estimating brain-age with magnetic resonance imaging measures, but presents inadequate evidence to support the suggestion that an alternative approach focused on predicting cognition is more useful. The paper would benefit from a clearer explication of the methods and a more critical evaluation of the conceptual basis of the different models. This work will be of interest to researchers working on brain-age and related models.

      Thank you so much for providing high-quality reviews on our manuscript. We revised the manuscript to address all of the reviewers’ comments and provided full responses to each of the comments below. Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach as mentioned by the editor. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      Reviewer #1 (Public Review):

      In this paper, the authors evaluate the utility of brain age derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain age derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ('brain cognition') as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.

      Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      REVISED VERSION: while the authors have partially addressed my concerns, I do not feel they have addressed them all. I do not feel they have addressed the weight instability and concerns about the stacked regression models satisfactorily.

      Please see our responses to #3 below

      I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. This suffers from the same problem the authors raise with brain age and would indeed disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain cognition. I have indicated the main considerations about these points in the recommendations section below.

      Thank you so much for raising this point. We now have the following statement in the introduction and discussion to address this concern (see below).

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study.

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition, we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.”

      This is a reasonably good paper and the use of a commonality analysis is a nice contribution to understanding variance partitioning across different covariates. I have some comments that I believe the authors ought to address, which mostly relate to clarity and interpretation

      Reviewer #1 Public Review #1

      First, from a conceptual point of view, the authors focus exclusively on cognition as a downstream outcome. I would suggest the authors nuance their discussion to provide broader considerations of the utility of their method and on the limits of interpretation of brain age models more generally.

      Thank you for your comments on this issue.

      We now discussed the broader consideration in detail:

      (1) the consistency between our findings on fluid cognition and other recent works on brain disorders,

      (2) the difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021)

      and

      (3) suggested solutions we and others made to optimise the utility of Brain Age for both cognitive functioning and brain disorders.

      From Discussion:

      “This discrepancy between the predictive performance of age-prediction models and the utility of Brain Age indices as a biomarker is consistent with recent findings (for review, see Jirsaraie, Gorelik, et al., 2023), both in the context of cognitive functioning (Jirsaraie, Kaufmann, et al., 2023) and neurological/psychological disorders (Bashyam et al., 2020; Rokicki et al., 2021). For instance, combining different MRI modalities into the prediction models, similar to our stacked models, often leads to the highest performance of age-prediction models, but does not likely explain the highest variance across different phenotypes, including cognitive functioning and beyond (Jirsaraie, Gorelik, et al., 2023).”

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the latter as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore under-fitted models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age-prediction models from MRI data of largely healthy participants and apply the built age-prediction models to participants who are also largely healthy. Accordingly, the age-prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fitted. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      “Next, researchers should not select age-prediction models based solely on age-prediction performance. Instead, researchers could select age-prediction models that explained phenotypes of interest the best. Here we selected age-prediction models based on a set of features (i.e., modalities) of brain MRI. This strategy was found effective not only for fluid cognition as we demonstrated here, but also for neurological and psychological disorders as shown elsewhere (Jirsaraie, Gorelik, et al., 2023; Rokicki et al., 2021). Rokicki and colleagues (2021), for instance, found that, while integrating across MRI modalities led to age-prediction models with the highest age-prediction performance, using only T1 structural MRI gave age-prediction models that were better at classifying Alzheimer’s disease. Similarly, using only cerebral blood flow gave age-prediction models that were better at classifying mild/subjective cognitive impairment, schizophrenia and bipolar disorder.

      As opposed to selecting age-prediction models based on a set of features, researchers could also select age-prediction models based on modelling methods. For instance, Jirsaraie and colleagues (2023) compared gradient tree boosting (GTB) and deep-learning brain network (DBN) algorithms in building age-prediction models. They found GTB to have higher age-prediction performance but DBN to have better utility in explaining cognitive functioning. In this case, an algorithm with better utility (e.g., DBN) should be used for explaining a phenotype of interest. Similarly, Bashyam and colleagues (2020) built different DBN-based age-prediction models, varying in age-prediction performance. The DBN models with a higher number of epochs corresponded to higher age-prediction performance. However, DBN-based age-prediction models with a moderate (as opposed to higher or lower) number of epochs were better at classifying Alzheimer’s disease, mild cognitive impairment and schizophrenia. In this case, a model from the same algorithm with better utility (e.g., those DBN with a moderate epoch number) should be used for explaining a phenotype of interest. Accordingly, this calls for a change in research practice, as recently pointed out by Jirasarie and colleagues (2023, p7), “Despite mounting evidence, there is a persisting assumption across several studies that the most accurate brain age models will have the most potential for detecting differences in a given phenotype of interest”. Future neuroimaging research should aim to build age-prediction models that are not necessarily good at predicting age, but at capturing phenotypes of interest.”

      Reviewer #1 Public Review #2

      Second, from a methods perspective, there is not a sufficient explanation of the methodological procedures in the current manuscript to fully understand how the stacked regression models were constructed. I would request that the authors provide more information to enable the reader to better understand the stacked regression models used to ensure that these models are not overfit.

      Thank you for allowing us an opportunity to clarify our stacked model. We made additional clarification to make this clearer (see below). We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models.

      From Methods: “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features), “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. After looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values.

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 = 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      Reviewer #1 Public Review #3

      Please also provide an indication of the different regression strengths that were estimated across the different models and cross-validation splits. Also, how stable were the weights across splits?

      The focus of this article is on the predictions. Still, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features. We found Spearman’s ρ to be varied dramatically in both age-prediction (range=.31-.94) and fluid cognition-prediction (range=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.

      Reviewer #1 Public Review #4

      Please provide more details about the task designs, MRI processing procedures that were employed on this sample in addition to the regression methods and bias correction methods used. For example, there are several different parameterisations of the elastic net, please provide equations to describe the method used here so that readers can easily determine how the regularisation parameters should be interpreted.

      Thank you for the opportunity for us to provide more methodical details.

      First, for the task design, we included the following statements:

      From Methods:

      “HCP-A collected fMRI data from three tasks: Face Name (Sperling et al., 2001), Conditioned Approach Response Inhibition Task (CARIT) (Somerville et al., 2018) and VISual MOTOR (VISMOTOR) (Ances et al., 2009).

      First, the Face Name task (Sperling et al., 2001) taps into episodic memory. The task had three blocks. In the encoding block [Encoding], participants were asked to memorise the names of faces shown. These faces were then shown again in the recall block [Recall] when the participants were asked if they could remember the names of the previously shown faces. There was also the distractor block [Distractor] occurring between the encoding and recall blocks. Here participants were distracted by a Go/NoGo task. We computed six contrasts for this Face Name task: [Encode], [Recall], [Distractor], [Encode vs. Distractor], [Recall vs. Distractor] and [Encode vs. Recall].

      Second, the CARIT task (Somerville et al., 2018) was adapted from the classic Go/NoGo task and taps into inhibitory control. Participants were asked to press a button to all [Go] but not to two [NoGo] shapes. We computed three contrasts for the CARIT task: [NoGo], [Go] and [NoGo vs. Go].

      Third, the VISMOTOR task (Ances et al., 2009) was designed to test simple activation of the motor and visual cortices. Participants saw a checkerboard with a red square either on the left or right. They needed to press a corresponding key to indicate the location of the red square. We computed just one contrast for the VISMOTOR task: [Vismotor], which indicates the presence of the checkerboard vs. baseline.”

      Second, for MRI processing procedures, we included the following statements.

      From Methods: “HCP-A provides details of parameters for brain MRI elsewhere (Bookheimer et al., 2019; Harms et al., 2018). Here we used MRI data that were pre-processed by the HCP-A with recommended methods, including the MSMALL alignment (Glasser et al., 2016; Robinson et al., 2018) and ICA-FIX (Glasser et al., 2016) for functional MRI. We used multiple brain MRI modalities, covering task functional MRI (task fMRI), resting-state functional MRI (rsfMRI) and structural MRI (sMRI), and organised them into 19 sets of features.”

      “ Sets of Features 1-10: Task fMRI contrast (Task Contrast) Task contrasts reflect fMRI activation relevant to events in each task. Bookheimer and colleagues (2019) provided detailed information about the fMRI in HCP-A. Here we focused on the pre-processed task fMRI Connectivity Informatics Technology Initiative (CIFTI) files with a suffix, “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” These CIFTI files encompassed both the cortical mesh surface and subcortical volume (Glasser et al., 2013). Collected using the posterior-to-anterior (PA) phase, these files were aligned using MSMALL (Glasser et al., 2016; Robinson et al., 2018), linear detrended (see https://groups.google.com/a/humanconnectome.org/g/hcp-users/c/ZLJc092h980/m/GiihzQAUAwAJ) and cleaned from potential artifacts using ICA-FIX (Glasser et al., 2016).

      To extract Task Contrasts, we regressed the fMRI time series on the convolved task events using a double-gamma canonical hemodynamic response function via FMRIB Software Library (FSL)’s FMRI Expert Analysis Tool (FEAT) (Woolrich et al., 2001). We kept FSL’s default high pass cutoff at 200s (i.e., .005 Hz). We then parcellated the contrast ‘cope’ files, using the Glasser atlas (Gordon et al., 2016) for cortical surface regions and the Freesurfer’s automatic segmentation (aseg) (Fischl et al., 2002) for subcortical regions. This resulted in 379 regions, whose number was, in turn, the number of features for each Task Contrast set of features. “

      “ Sets of Features 11-13: Task fMRI functional connectivity (Task FC) Task FC reflects functional connectivity (FC ) among the brain regions during each task, which is considered an important source of individual differences (Elliott et al., 2019; Fair et al., 2007; Gratton et al., 2018). We used the same CIFTI file “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” as the task contrasts. Unlike Task Contrasts, here we treated the double-gamma, convolved task events as regressors of no interest and focused on the residuals of the regression from each task (Fair et al., 2007). We computed these regressors on FSL, and regressed them in nilearn (Abraham et al., 2014). Following previous work on task FC (Elliott et al., 2019), we applied a highpass at .008 Hz. For parcellation, we used the same atlases as Task Contrast (Fischl et al., 2002; Glasser et al., 2016). We computed Pearson’s correlations of each pair of 379 regions, resulting in a table of 71,631 non-overlapping FC indices for each task. We then applied r-to-z transformation and principal component analysis (PCA) of 75 components (Rasero et al., 2021; Sripada et al., 2019, 2020). Note to avoid data leakage, we conducted the PCA on each training set and applied its definition to the corresponding test set. Accordingly, there were three sets of 75 features for Task FC, one for each task.

      Set of Features 14: Resting-state functional MRI functional connectivity (Rest FC) Similar to Task FC, Rest FC reflects functional connectivity (FC ) among the brain regions, except that Rest FC occurred during the resting (as opposed to task-performing) period. HCP-A collected Rest FC from four 6.42-min (488 frames) runs across two days, leading to 26-min long data (Harms et al., 2018). On each day, the study scanned two runs of Rest FC, starting with anterior-to-posterior (AP) and then with posterior-to-anterior (PA) phase encoding polarity. We used the “rfMRI_REST_Atlas_MSMAll_hp0_clean.dscalar.nii” file that was pre-processed and concatenated across the four runs. We applied the same computations (i.e., highpass filter, parcellation, Pearson’s correlations, r-to-z transformation and PCA) with the Task FC.

      Sets of Features 15-18: Structural MRI (sMRI)

      sMRI reflects individual differences in brain anatomy. The HCP-A used an established pre-processing pipeline for sMRI (Glasser et al., 2013). We focused on four sets of features: cortical thickness, cortical surface area, subcortical volume and total brain volume. For cortical thickness and cortical surface area, we used Destrieux’s atlas (Destrieux et al., 2010; Fischl, 2012) from FreeSurfer’s “aparc.stats” file, resulting in 148 regions for each set of features. For subcortical volume, we used the aseg atlas (Fischl et al., 2002) from FreeSurfer’s “aseg.stats” file, resulting in 19 regions. For total brain volume, we had five FreeSurfer-based features: “FS_IntraCranial_Vol” or estimated intra-cranial volume, “FS_TotCort_GM_Vol” or total cortical grey matter volume, “FS_Tot_WM_Vol” or total cortical white matter volume, “FS_SubCort_GM_Vol” or total subcortical grey matter volume and “FS_BrainSegVol_eTIV_Ratio” or ratio of brain segmentation volume to estimated total intracranial volume.”

      Third, for regression methods and bias correction methods used, we included the following statements:

      From Methods:

      “For the machine learning algorithm, we used Elastic Net (Zou & Hastie, 2005). Elastic Net is a general form of penalised regressions (including Lasso and Ridge regression), allowing us to simultaneously draw information across different brain indices to predict one target variable. Penalised regressions are commonly used for building age-prediction models (Jirsaraie, Gorelik, et al., 2023). Previously we showed that the performance of Elastic Net in predicting cognitive abilities is on par, if not better than, many non-linear and more-complicated algorithms (Pat, Wang, Bartonicek, et al., 2022; Tetereva et al., 2022). Moreover, Elastic Net coefficients are readily explainable, allowing us the ability to explain how our age-prediction and cognition-prediction models made the prediction from each brain feature (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022) (see below).

      Elastic Net simultaneously minimises the weighted sum of the features’ coefficients. The degree of penalty to the sum of the feature’s coefficients is determined by a shrinkage hyperparameter ‘α’: the greater the α, the more the coefficients shrink, and the more regularised the model becomes. Elastic Net also includes another hyperparameter, ‘l1 ratio’, which determines the degree to which the sum of either the squared (known as ‘Ridge’; l1 ratio=0) or absolute (known as ‘Lasso’; l1 ratio=1) coefficients is penalised (Zou & Hastie, 2005). The objective function of Elastic Net as implemented by sklearn (Pedregosa et al., 2011) is defined as:

      where X is the features, y is the target, and β is the coefficient. In our grid search, we tuned two Elastic Net hyperparameters: α using 70 numbers in log space, ranging from .1 and 100, and l_1-ratio using 25 numbers in linear space, ranging from 0 and 1.

      To understand how Elastic Net made a prediction based on different brain features, we examined the coefficients of the tuned model. Elastic Net coefficients can be considered as feature importance, such that more positive Elastic Net coefficients lead to more positive predicted values and, similarly, more negative Elastic Net coefficients lead to more negative predicted values (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022). While the magnitude of Elastic Net coefficients is regularised (thus making it difficult for us to interpret the magnitude itself directly), we could still indicate that a brain feature with a higher magnitude weights relatively stronger in making a prediction. Another benefit of Elastic Net as a penalised regression is that the coefficients are less susceptible to collinearity among features as they have already been regularised (Dormann et al., 2013; Pat, Wang, Bartonicek, et al., 2022).

      Given that we used five-fold nested cross validation, different outer folds may have different degrees of ‘α’ and ‘l1 ratio’, making the final coefficients from different folds to be different. For instance, for certain sets of features, penalisation may not play a big part (i.e., higher or lower ‘α’ leads to similar predictive performance), resulting in different ‘α’ for different folds. To remedy this in the visualisation of Elastic Net feature importance, we refitted the Elastic Net model to the full dataset without splitting them into five folds and visualised the coefficients on brain images using Brainspace (Vos De Wael et al., 2020) and Nilern (Abraham et al., 2014) packages. Note, unlike other sets of features, Task FC and Rest FC were modelled after data reduction via PCA. Thus, for Task FC and Rest FC, we, first, multiplied the absolute PCA scores (extracted from the ‘components_’ attribute of ‘sklearn.decomposition.PCA’) with Elastic Net coefficients and, then, summed the multiplied values across the 75 components, leaving 71,631 ROI-pair indices. “

      References

      Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., & Varoquaux, G. (2014). Machine learning for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 8, 14. https://doi.org/10.3389/fninf.2014.00014

      Ances, B. M., Liang, C. L., Leontiev, O., Perthen, J. E., Fleisher, A. S., Lansing, A. E., & Buxton, R. B. (2009). Effects of aging on cerebral blood flow, oxygen metabolism, and blood oxygenation level dependent responses to visual stimulation. Human Brain Mapping, 30(4), 1120–1132. https://doi.org/10.1002/hbm.20574

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Satterthwaite, T. D., … on behalf of the ISTAGING Consortium, the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. https://doi.org/10.1093/brain/awaa160

      Bookheimer, S. Y., Salat, D. H., Terpstra, M., Ances, B. M., Barch, D. M., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Diaz-Santos, M., Elam, J. S., Fischl, B., Greve, D. N., Hagy, H. A., Harms, M. P., Hatch, O. M., Hedden, T., Hodge, C., Japardi, K. C., Kuhn, T. P., … Yacoub, E. (2019). The Lifespan Human Connectome Project in Aging: An overview. NeuroImage, 185, 335–348. https://doi.org/10.1016/j.neuroimage.2018.10.009

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Satterthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pitfalls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. https://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. https://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Destrieux, C., Fischl, B., Dale, A., & Halgren, E. (2010). Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1), 1–15. https://doi.org/10.1016/j.neuroimage.2010.06.010

      Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. https://doi.org/10.1098/rstb.2017.0284

      Elliott, M. L., Knodt, A. R., Cooke, M., Kim, M. J., Melzer, T. R., Keenan, R., Ireland, D., Ramrakha, S., Poulton, R., Caspi, A., Moffitt, T. E., & Hariri, A. R. (2019). General functional connectivity: Shared features of resting-state and task fMRI drive reliable and heritable individual differences in functional brain networks. NeuroImage, 189, 516–532. https://doi.org/10.1016/j.neuroimage.2019.01.068

      Fair, D. A., Schlaggar, B. L., Cohen, A. L., Miezin, F. M., Dosenbach, N. U. F., Wenger, K. K., Fox, M. D., Snyder, A. Z., Raichle, M. E., & Petersen, S. E. (2007). A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. NeuroImage, 35(1), 396–405. https://doi.org/10.1016/j.neuroimage.2006.11.051

      Fischl, B. (2012). FreeSurfer. NeuroImage, 62(2), 774–781. https://doi.org/10.1016/j.neuroimage.2012.01.021

      Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., & Dale, A. M. (2002). Whole Brain Segmentation. Neuron, 33(3), 341–355. https://doi.org/10.1016/S0896-6273(02)00569-X

      Glasser, M. F., Smith, S. M., Marcus, D. S., Andersson, J. L. R., Auerbach, E. J., Behrens, T. E. J., Coalson, T. S., Harms, M. P., Jenkinson, M., Moeller, S., Robinson, E. C., Sotiropoulos, S. N., Xu, J., Yacoub, E., Ugurbil, K., & Van Essen, D. C. (2016). The Human Connectome Project’s neuroimaging approach. Nature Neuroscience, 19(9), 1175–1187. https://doi.org/10.1038/nn.4361

      Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B., Andersson, J. L., Xu, J., Jbabdi, S., Webster, M., Polimeni, J. R., Van Essen, D. C., & Jenkinson, M. (2013). The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage, 80, 105–124. https://doi.org/10.1016/j.neuroimage.2013.04.127

      Gordon, E. M., Laumann, T. O., Adeyemo, B., Huckins, J. F., Kelley, W. M., & Petersen, S. E. (2016). Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cerebral Cortex, 26(1), 288–303. https://doi.org/10.1093/cercor/bhu239

      Gratton, C., Laumann, T. O., Nielsen, A. N., Greene, D. J., Gordon, E. M., Gilmore, A. W., Nelson, S. M., Coalson, R. S., Snyder, A. Z., Schlaggar, B. L., Dosenbach, N. U. F., & Petersen, S. E. (2018). Functional Brain Networks Are Dominated by Stable Group and Individual Factors, Not Cognitive or Daily Variation. Neuron, 98(2), 439-452.e5. https://doi.org/10.1016/j.neuron.2018.03.035

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fitting’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. https://doi.org/10.1093/brain/awaa454

      Harms, M. P., Somerville, L. H., Ances, B. M., Andersson, J., Barch, D. M., Bastiani, M., Bookheimer, S. Y., Brown, T. B., Buckner, R. L., Burgess, G. C., Coalson, T. S., Chappell, M. A., Dapretto, M., Douaud, G., Fischl, B., Glasser, M. F., Greve, D. N., Hodge, C., Jamison, K. W., … Yacoub, E. (2018). Extending the Human Connectome Project across ages: Imaging protocols for the Lifespan Development and Aging projects. NeuroImage, 183, 972–984. https://doi.org/10.1016/j.neuroimage.2018.09.060

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. https://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Gorelik, A. J., Gatavins, M. M., Engemann, D. A., Bogdan, R., Barch, D. M., & Sotiras, A. (2023). A systematic review of multimodal brain age studies: Uncovering a divergence between model accuracy and utility. Patterns, 4(4), 100712. https://doi.org/10.1016/j.patter.2023.100712

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. https://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. https://doi.org/10.1016/j.biopsych.2015.12.023

      Molnar, C. (2019). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. https://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain‐based predictive models mediate the relationships between childhood cognition and socio‐demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. https://doi.org/10.1002/hbm.26027

      Pat, N., Wang, Y., Bartonicek, A., Candia, J., & Stringaris, A. (2022). Explainable machine learning approach to predict and explain the relationship between task-based fMRI and individual differences in cognition. Cerebral Cortex, bhac235. https://doi.org/10.1093/cercor/bhac235

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. https://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. https://doi.org/10.1371/journal.pcbi.1008347

      Robinson, E. C., Garcia, K., Glasser, M. F., Chen, Z., Coalson, T. S., Makropoulos, A., Bozek, J., Wright, R., Schuh, A., Webster, M., Hutter, J., Price, A., Cordero Grande, L., Hughes, E., Tusor, N., Bayly, P. V., Van Essen, D. C., Smith, S. M., Edwards, A. D., … Rueckert, D. (2018). Multimodal surface matching with higher-order smoothness constraints. NeuroImage, 167, 453–465. https://doi.org/10.1016/j.neuroimage.2017.10.037

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. https://doi.org/10.1002/hbm.25323

      Somerville, L. H., Bookheimer, S. Y., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Dapretto, M., Elam, J. S., Gaffrey, M. S., Harms, M. P., Hodge, C., Kandala, S., Kastman, E. K., Nichols, T. E., Schlaggar, B. L., Smith, S. M., Thomas, K. M., Yacoub, E., Van Essen, D. C., & Barch, D. M. (2018). The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5–21 year olds. NeuroImage, 183, 456–468. https://doi.org/10.1016/j.neuroimage.2018.08.050

      Sperling, R. A., Bates, J. F., Cocchiarella, A. J., Schacter, D. L., Rosen, B. R., & Albert, M. S. (2001). Encoding novel face-name associations: A functional MRI study. Human Brain Mapping, 14(3), 129–139. https://doi.org/10.1002/hbm.1047

      Sripada, C., Angstadt, M., Rutherford, S., Kessler, D., Kim, Y., Yee, M., & Levina, E. (2019). Basic Units of Inter-Individual Variation in Resting State Connectomes. Scientific Reports, 9(1), Article 1. https://doi.org/10.1038/s41598-018-38406-5

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. https://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain‐cognition relationship: Integrating task‐based fMRI across tasks markedly boosts prediction and test‐retest reliability. NeuroImage, 263, 119588. https://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. https://doi.org/10.1016/j.intell.2022.101654

      Vos De Wael, R., Benkarim, O., Paquola, C., Lariviere, S., Royer, J., Tavakol, S., Xu, T., Hong, S.-J., Langs, G., Valk, S., Misic, B., Milham, M., Margulies, D., Smallwood, J., & Bernhardt, B. C. (2020). BrainSpace: A toolbox for the analysis of macroscale gradients in neuroimaging and connectomics datasets. Communications Biology, 3(1), 103. https://doi.org/10.1038/s42003-020-0794-7

      Woolrich, M. W., Ripley, B. D., Brady, M., & Smith, S. M. (2001). Temporal Autocorrelation in Univariate Linear Modeling of FMRI Data. NeuroImage, 14(6), 1370–1386. https://doi.org/10.1006/nimg.2001.0931

      Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x


      The following is the authors’ response to the previous reviews.

      eLife assessment

      This useful manuscript challenges the utility of current paradigms for estimating brain-age with magnetic resonance imaging measures, but presents inadequate evidence to support the suggestion that an alternative approach focused on predicting cognition is more useful. The paper would benefit from a clearer explication of the methods and a more critical evaluation of the conceptual basis of the different models. This work will be of interest to researchers working on brain-age and related models.

      Thank you so much for providing high-quality reviews on our manuscript. We revised the manuscript to address all of the reviewers’ comments and provided full responses to each of the comments below. Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      Public Reviews:

      Reviewer 1 (Public Review):

      In this paper, the authors evaluate the utility of brain-age-derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain-age-derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ("brain-cognition") as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.

      (1) I thank the authors for addressing many of my concerns with this revision. However, I do not feel they have addressed them all. In particular I think the authors could do more to address the concern I raised about the instability of the regression coefficients and about providing enough detail to determine that the stacked regression models do not overfit.

      Thank you Reviewer 1 for the comment. We addressed them in our response to Reviewer 1 Recommendations For The Authors #1 and #2 (see below).

      (2) In considering my responses to the authors revision, I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. To be fair, these conceptual problems are more widespread than this paper alone, so I do not believe the authors should be penalised for that. However, I would recommend to make these concerns more explicit in the manuscript

      Thank you Reviewer 1 for the comment. We addressed them in our response to Reviewer 1 Recommendations For The Authors #3 (see below).

      Reviewer 2 (Public Review):

      In this study, the authors aimed to evaluate the contribution of brain-age indices in capturing variance in cognitive decline and proposed an alternative index, brain-cognition, for consideration.

      The study employs suitable methods and data to address the research questions, and the methods and results sections are generally clear and easy to follow.

      I appreciate the authors' efforts in significantly improving the paper, including some considerable changes, from the original submission. While not all reviewer points were tackled, the majority of them were adequately addressed. These include additional analyses, more clarity in the methods and a much richer and nuanced discussion. While recognising the merits of the revised paper, I have a few additional comments.

      (1) Perhaps it would help the reader to note that it might be expected for brain-cognition to account for a significantly larger variance (11%) in fluid cognition, in contrast to brain-age. This stems from the fact that the authors specifically trained brain-cognition to predict fluid cognition, the very variable under consideration. In line with this, the authors later recommend that researchers considering the use of brain-age should evaluate its utility using a regression approach. The latter involves including a brain index (e.g. brain-cognition) previously trained to predict the regression's target variable (e.g. fluid cognition) alongside a brain-age index (e.g., corrected brain-age gap). If the target-trained brain index outperforms the brain-age metric, it suggests that relying solely on brain-age might not be the optimal choice. Although not necessarily the case, is it surprising for the target-trained brain index to demonstrate better performance than brain-age? This harks back to the broader point raised in the initial review: while brain-age may prove useful (though sometimes with modest effect sizes) across diverse outcomes as a generally applicable metric, a brain index tailored for predicting a specific outcome, such as brain-cognition in this case, might capture a considerably larger share of variance in that specific context but could lack broader applicability. The latter aspect needs to be empirically assessed.

      Thank you so much for raising this point. Reviewer 1 (Public Review #2/Recommendations For The Authors #3) and Reviewer 3 (Recommendations for the Authors #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (please see our responses to Reviewer 1 Recommendations For The Authors #3 below).

      Briefly, as in our 2nd revision, we did not intend to compare Brain Age with Brain Cognition since, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      (2) Furthermore, the discussion pertaining to training brain-age models on healthy populations for subsequent testing on individuals with neurological or psychological disorders seems somewhat one-sided within the broader debate. This one-sidedness might potentially confuse readers. It is worth noting that the choice to employ healthy participants in the training model is likely deliberate, serving as a norm against which atypical populations are compared. To provide a more comprehensive understanding, referencing Tim Hans's counterargument to Bashyam's perspective could offer a more complete view (https://academic.oup.com/brain/article/144/3/e31/6214475?login=false).

      Thank you Reviewer 2 for bringing up this issue. We have now revised the paragraph in question and added nuances on the usage of Brain Age for normative vs. case-control studies. We also cited Tim Hahn’s article that explained the conceptual foundation of the use of Brain Age in case-control studies. Please see below. Additionally, we also made a statement about our study not being able to address issues about the case-control studies directly in the newly written conclusion (see Reviewer 3 Recommendations for the Authors #3).

      Discussion:

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the latter as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore under-fitted models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age-prediction models from MRI data of largely healthy participants and apply the built age-prediction models to participants who are also largely healthy. Accordingly, the age-prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fitted. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      (3) Overall, this paper makes a significant contribution to the field of brain-age and related brain indices and their utility.

      Thank you for the encouragement.

      Reviewer 3 (Public Review):

      The main question of this article is as follows: "To what extent does having information on brain-age improve our ability to capture declines in fluid cognition beyond knowing a person's chronological age?" This question is worthwhile, considering that there is considerable confusion in the field about the nature of brain-age.

      (1) Thank you to the authors for addressing so many of my concerns with this revision. There are a few points that I feel still need addressing/clarifying related to 1) calculating brain cognition, 2) the inevitability of their results, and 3) their continued recommendation to use brain-age metrics.

      Thank you Reviewer 3 for the comment. We addressed them in our response to Reviewer 3 Recommendations For The Authors #1-3 (see below).

      Recommendations for the authors:

      Reviewer 1 (Recommendations For The Authors):

      (1) I do not feel the authors have fully addressed the concern I raised about the stacked regression models. Despite the new figure, it is still not entirely clear what the authors are using as the training set in the final step. To be clear, the problem occurs because of the parameters, not the hyperparameters (which the authors now state that they are optimising via nested grid search). in other words, given a regression model y = X*beta, if the X are taken to be predictions from a lower level regression model, then they contain information that is derived from both the training set at the test set for the model that this was trained on. If the split is the same (i.e. the predictions are derived on the same test set as is being used at the second level), then this can lead to overfitting. It is not clear to me whether the authors have done this or not. Please provide additional detail to clarify this point.

      Thank you for allowing us an opportunity to clarify our stacked model. We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models. We made additional clarification to make this clearer (see below). Let us explain what we did and provide the rationales below.

      From Methods:

      “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features), “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. After looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values.

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 = 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Author response image 1.

      Diagram of the nested cross-validation used for creating predictions for models of each set of features as well as predictions for stacked models.

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      (2) I also do not feel the authors have fully addressed the concern I raised about stability of the regression coefficients over splits of the data. I wanted to see the regression coefficients, not the predictions. The predictions can be stable when the coefficients are not.

      The focus of this article is on the predictions. Still, as pointed out by reviewer 1, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features. We found Spearman’s ρ to be varied dramatically in both age-prediction (range=.31-.94) and fluid cognition-prediction (range=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.

      Author response image 2.

      Stability of feature importance (i.e., Elastic Net Coefficients) of prediction models. Each dot represents rank stability (reflected by Spearman’s ρ) in the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, there were 10 Spearman’s ρs for each prediction model. The numbers to the right of the plots indicate the mean of Spearman’s ρ for each prediction model.

      (3) I also must say that I agree with Reviewer 3 about the limitations of the brain-age and brain-cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain-age model that is trained to predict age. This suffers from the same problem the authors raise with brain-age and I agree that this would probably disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain-age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain-cognition.

      Thank you so much for raising this point. Reviewer 2 (Public Review #1) and Reviewer 3 (Recommendations for the Authors #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (see below).

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study.

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition, we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.”

      Reviewer #3 (Recommendations For The Authors):

      Thank you to the authors for addressing so many of my concerns with this revision. There are a few points that I feel still need addressing/clarifying related to: 1) calculating brain cognition, 2) the inevitability of their results, and 3) their continued recommendation to use brain age metrics.

      (1) I understand your point here. I think the distinction is that it is fine to build predictive models, but then there is no need to go through this intermediate step of "brain-cognition". Just say that brain features can predict cognition XX well, and brain-age (or some related metric) can predict cognition YY well. It creates a confusing framework for the reader that can lead them to believe that "brain-cognition" is not just a predicted value of fluid cognition from a model using brain features to predict cognition. While you clearly state that that is in fact what it is in the text, which is a huge improvement, I do not see what is added by going through brain-cognition instead of simply just obtaining a change in R2 where the first model uses brain features alone to predict cognition, and the second adds on brain-age (or related metrics), or visa versa, depending on the question. Please do this analysis, and either compare and contrast it with going through "brain-cognition" in your paper, or switch to this analysis, as it more directly addresses the question of the incremental predictive utility of brain-age above and beyond brain features.

      Thank you so much for raising this point. Reviewer 1 (Public Review #2/Recommendations For The Authors #3) and Reviewer 2 (Public Review #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (see our responses to Reviewer 1 Recommendations For The Authors #3 above).

      Briefly, as in our 2nd revision, we made it explicitly clear that we did not intend to compare Brain Age with Brain Cognition since, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. And, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      We have thought about changing the name Brain Cognition into something along the lines of “predicted values of prediction models predicting fluid cognition based on brain MRI.” However, this made the manuscript hard to follow, especially with the commonality analyses. For instance, the sentence, “Here, we tested Brain Cognition’s unique effects in multiple regression models with a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition” would become “Here, we tested predicted values of prediction models predicting fluid cognition based on brain MRI unique effects in multiple regression models with a Brain Age index, chronological age and predicted values of prediction models predicting fluid cognition based on brain MRI as regressors to explain fluid cognition.” We believe, given our additional explanation (see our responses to Reviewer 1 Recommendations For The Authors #3 above), readers should understand what Brain Cognition is, and that we did not intend to compare Brain Age and Brain Cognition directly.

      As for the suggested analysis, “obtaining a change in R2 where the first model uses brain features alone to predict cognition, and the second adds on brain-age (or related metrics), or visa versa,” we have already done this in the form of commonality analysis (Nimon et al., 2008) (see Figure 7 below). That is, to obtain unique and common effects of the regressors, we need to look at all of the possible changes in R2 when all possible subsets of regressors were excluded or included, see equations 12 and 13 below.

      From Methods:

      “Similar to the above multiple regression model, we had chronological age, each Brain Age index and Brain Cognition as the regressors for fluid cognition:

      Fluid Cognitioni = β0 + β1 Chronological Agei + β2 Brain Age Indexi,j + β3 Brain Cognitioni + εi, (12)

      Applying the commonality analysis here allowed us, first, to investigate the addictive, unique effects of Brain Cognition, over and above chronological age and Brain Age indices. More importantly, the commonality analysis also enabled us to test the common, shared effects that Brain Cognition had with chronological age and Brain Age indices in explaining fluid cognition. We calculated the commonality analysis as follows (Nimon et al., 2017):

      Unique Effectchronological age = ΔR2chronological age = R2chronological age, Brain Age index, Brain Cognition – R2 Brain Age index, Brain Cognition

      Unique EffectBrain Age index = ΔR2Brain Age index = R2chronological age, Brain Age index, Brain Cognition – R2 chronological age, Brain Cognition

      Unique EffectBrain Cognition = ΔR2Brain Cognition = R2chronological age, Brain Age index, Brain Cognition – R2 chronological age, Brain Age Index

      Common Effectchronological age, Brain Age index = R2chronological age, Brain Cognition + R2 Brain Age index, Brain Cognition – R2 Brain Cognition – R2chronological age, Brain Age index, Brain Cognition

      Common Effectchronological age, Brain Cognition = R2chronological age, Brain Age Index + R2 Brain Age index, Brain Cognition – R2 Brain Age Index – R2chronological age, Brain Age index, Brain Cognition

      Common Effect Brain Age index, Brain Cognition = R2chronological age, Brain Age Index + R2 chronological age, Brain Cognition – R2 chronological age – R2chronological age, Brain Age index, Brain Cognition

      Common Effect chronological age, Brain Age index, Brain Cognition = R2 chronological age + R2 Brain Age Index + R2 Brain Cognition – R2chronological age, Brain Age Index – R2 chronological age, Brain Cognition – R2 Brain Age Index, Brain Cognition – R2chronological age, Brain Age index, Brain Cognition , (13)”

      (2) I agree that the solution is not to exclude age as a covariate, and that there is a big difference between inevitable and obvious. I simply think a further discussion of the inevitability of the results would be clarifying for the readers. There is a big opportunity in the brain-age literature to be as direct as possible about why you are finding what you are finding. People need to know not only what you found, but why you found what you found.

      Thank you. We agreed that we need to make this point more explicit and direct. In the revised manuscript, we had the statements in both Introduction and Discussion (see below) about the tight relationship between Brain Age and chronological age by design, making the small unique effects of Brain Age inevitable.

      Introduction:

      “Accordingly, by design, Brain Age is tightly close to chronological age. Because chronological age usually has a strong relationship with fluid cognition, to begin with, it is unclear how much Brain Age adds to what is already captured by chronological age.“

      Discussion:

      “First, Brain Age itself did not add much more information to help us capture fluid cognition than what we had already known from a person’s chronological age. This can clearly be seen from the small unique effects of Brain Age indices in the multiple regression models having Brain Age and chronological age as the regressors. While the unique effects of some Brain Age indices from certain age-prediction models were statistically significant, there were all relatively small. Without Brain Age indices, chronological age by itself already explained around 32% of the variation in fluid cognition. Including Brain Age indices only added around 1.6% at best. We believe the small unique effects of Brain Age were inevitable because, by design, Brain Age is tightly close to chronological age. Therefore, chronological age and Brain Age captured mostly a similar variation in fluid cognition.

      Investigating the simple regression models and the commonality analysis between each Brain Age index and chronological age provided additional insights….”

      (3) I believe it is very important to critically examine the use of brain-age and related metrics. As part of this process, I think we should be asking ourselves the following questions (among others): Why go through age prediction? Wouldn't the predictions of cognition (or another variable) using the same set of brain features always be as good or better? You still have not justified the use of brain-age. As I said before, if you are going to continue to recommend the use of brain-age, you need a very strong argument for why you are recommending this. What does it truly add? Otherwise, temper your statements to indicate possible better paths forward.

      Thank you Reviewer 3 for making an argument against the use of Brain Age. We largely agree with you. However, our work only focuses on one phenotype, fluid cognition, and on the normative situation (i.e., not having a case vs control group). As Reviewer 2 pointed out, Brain Age might still have utility in other cases, not studied here. Still, future studies that focus on other phenotypes may consider using our approach as a template to test the utility of Brain Age in other situations. We added the conclusion statement to reflect this.

      From Discussion:

      “Altogether, we examined the utility of Brain Age as a biomarker for fluid cognition. Here are the three conclusions. First, Brain Age failed to add substantially more information over and above chronological age. Second, a higher ability to predict chronological age did not correspond to a higher utility to capture fluid cognition. Third, Brain Age missed up to around one-third of the variation in fluid cognition that could have been explained by brain MRI. Yet, given our focus on fluid cognition, future empirical research is needed to test the utility of Brain Age on other phenotypes, especially when Brain Age is used for anomaly detection in case-control studies (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We hope that future studies may consider applying our approach (i.e., using the commonality analysis that includes predicted values from a model that directly predicts the phenotype of interest) to test the utility of Brain Age as a biomarker for other phenotypes.”

      References

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Satterthwaite, T. D., … on behalf of the ISTAGING Consortium, the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. https://doi.org/10.1093/brain/awaa160

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Satterthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pitfalls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. https://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. https://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. https://doi.org/10.1098/rstb.2017.0284

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fitting’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. https://doi.org/10.1093/brain/awaa454

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. https://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. https://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. https://doi.org/10.1016/j.biopsych.2015.12.023

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. https://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain‐based predictive models mediate the relationships between childhood cognition and socio‐demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. https://doi.org/10.1002/hbm.26027

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. https://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. https://doi.org/10.1371/journal.pcbi.1008347

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. https://doi.org/10.1002/hbm.25323

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. https://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain‐cognition relationship: Integrating task‐based fMRI across tasks markedly boosts prediction and test‐retest reliability. NeuroImage, 263, 119588. https://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. https://doi.org/10.1016/j.intell.2022.101654

    1. Author response:

      The following is the authors’ response to the previous reviews.

      We thank you for the time you took to review our work and for your feedback! 

      The major changes to the manuscript are:

      (1) We have added visual flow speed and locomotion velocity traces to Figure 5 as suggested.

      (2) We have rephrased the abstract to more clearly indicate that our statement regarding acetylcholine enabling faster switching of internal representations in layer 5 is speculative.

      (3) We have further clarified the positioning of our findings regarding the basal forebrain cholinergic signal in visual cortex in the introduction.

      (4) We have added a video (Video S1) to illustrate different mouse running speeds covered by our data.

      A detailed point-by-point response to all reviewer concerns is provided below.

      Reviewer #1 (Recommendations For The Authors):

      The authors have addressed most of the concerns raised in the initial review. While the paper has been improved, there are still some points of concern in the revised version. 

      Major comments

      (1) Page 1, Line 21: The authors claim, "Our results suggest that acetylcholine augments the responsiveness of layer 5 neurons to inputs from outside of the local network, enabling faster switching between internal representations during locomotion." However, it is not clear which specific data or results support the claim of "switching between internal representations." ... 

      Authors' response: "... That acetylcholine enables a faster switching between internal representations in layer 5 is a speculation. We have attempted to make this clearer in the discussion. ..." 

      In the revised version, there is no new data added to directly support the claim - "Our results suggest acetylcholine ..., enabling faster switching between internal representations during locomotion" (in the abstract). The authors themselves acknowledge that this statement is speculative. The present data only demonstrate that ACh reduces the response latency of L5 neurons to visual stimuli, but not that ACh facilitates quicker transitions in neuronal responses from one visual stimulus to another. To maintain scientific rigor and clarity, I recommend the authors amend this sentence to more accurately reflect the findings. 

      This might be a semantic disagreement? We would argue both a gray screen and a grating are visual stimuli. Hence, we are not sure we understand what the reviewer means by “but not that ACh facilitates quicker transitions in neuronal responses from one visual stimulus to another”. We concur, our data only address one of many possible transitions, but it is a switch between distinct visual stimuli that is sped up by ACh. Nevertheless, we have rephrased the sentence in question by changing “our data suggest” to “based on this we speculate” - but are not sure whether this addresses the reviewer’s concern.  

      (2) Page 4, Line 103: "..., a direct measurement of the activity of cholinergic projection from basal forebrain to the visual cortex during locomotion has not been made." This statement is incorrect. An earlier study by Reimer et al. indeed imaged cholinergic axons in the visual cortex of mice running on a wheel. 

      Authors' response: "We have clarified this as suggested. However, we disagree slightly with the reviewer here. The key question is whether the cholinergic axons imaged originate in basal forebrain. While Reimer et al. 2016 did set out to do this, we believe a number of methodological considerations prevent this conclusion: ... Collins et al. 2023 inject more laterally and thus characterize cholinergic input to S1 and A1, ..."

      The authors pointed out some methodological caveats in previous studies that measured the BF input in V1, and I agree with them on several points. Nonetheless, the statement that "a direct measurement of the activity of cholinergic projection from basal forebrain to visual cortex during locomotion has not been made. ... Prior measurements of the activity of cholinergic axons in visual cortex have all relied on data from a cross of ChAT-Cre mice with a reporter line ..." (Page 4, Line 103) seems to be an oversimplification. In fact, contrary to what the authors noted, Collins et al. (2023) conducted direct imaging of BF cholinergic axons in V1 (Fig. 1) - "Selected axon segments were chosen from putative retrosplenial, somatosensory, primary and secondary motor, and visual cortices". They used a viral approach to express GCaMP in BF axons to bypass the limitations associated with the use of a GCaMP reporter mouse line - "Viral injections were used for BF- ACh studies to avoid imaging axons or dendrites from cholinergic projections not arising from the BF (e.g. cortical cholinergic interneurons)." The authors should reconsider the text. 

      The reason we think that our statement here was – while simplified – accurate, is that Collins et al. do record from cholinergic axons in V1, but they don’t show these data (they only show pooled data across all recordings sites). By superimposing the recording locations of the Collins paper on the Allen mouse brain atlas (Figure R1), we estimate that of the approximately 50 recording sites, most are in somatosensory and somatomotor areas of cortex, and only 1 appears to be in V1, something that is often missed as it is not really highlighted in that paper. If this is indeed correct, we would argue that the data in the Collins et al. paper are not representative of cholinergic activity in visual cortex (we fear only the authors would know for sure). Nevertheless, we have rephrased again. 

      Author response image 1.

      Overlay of the Collins et al. imaging sites (red dots, black outline and dashed circle) on the Allen mouse brain atlas (green shading). Very few (we estimate that it was only 1) of the recording sites appear to be in V1 (the lightest green area), and maybe an additional 4 appear to be in secondary visual areas.  

      Minor comments

      (1) It is unclear which BF subregion(s) were targeted in this study. 

      Authors' response: Thanks for pointing this out. We targeted the entire basal forebrain (medial septum, vertical and horizontal limbs of the diagonal band, and nucleus basalis) with our viral injections. ... We have now added the labels for basal forebrain subregions targeted next to the injection coordinates in the manuscript. 

      The authors provided the coordinates for their virus injections targeting the BF subregions - "(AP, ML, DV (in mm): ... ; +0.6, +0.6, -4.9 (nucleus basalis) ..." Is this the right coordinates for the nucleus basalis? 

      Thank you for catching this - this was indeed incorrect. The coordinates were correct, but our annotation of brain region was not (as the reviewer correctly points out, these coordinates are in the horizontal limb of the diagonal band, not the nucleus basalis). We have corrected this.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for addressing most of the points raised in my original review. I still some concerns relating to the analysis of the data. 

      (1) I appreciate the authors point that getting mice to reliably during head-fixed recordings can require training. Since mice in this study were not trained to run, their low speed of locomotion limits the interpretation of the results. I think this is an important potential caveat and I have retained it in the public review. 

      This might be a misunderstanding. The Jordan paper was a bit of an outlier in that we needed mice to run at very high rates due to fact that our recording times was only minutes. Mice were chosen such that they would more or less continuously run, to maximize the likelihood that they would run during the intracellular recordings. This was what we tried to convey in our previous response. The speed range covered by the analysis in this paper is 0 cm/s to 36 cm/s. 36 cm/s is not far away from the top speed mice can reach on this treadmill (30 cm/s is 1 revolution of the treadmill per second). In our data, the top speed we measured across all mice was 36 cm/s. In the Jordan paper, the peak running speed across the entire dataset was 44 cm/s. Based on the reviewer’s comment, we suspect that the reviewer may be under the impression that 30 cm/s is a relatively slow running speed. To illustrate what this looks like we have made added a video (Video S1) to illustrate different running speeds. 

      (2) The majority of the analyses in the revised manuscript focus on grand average responses, which may mask heterogeneity in the underlying neural populations. This could be addressed by analysing the magnitude and latency of responses for individual neurons. For example, if I understand correctly, the analyses include all neurons, whether or not they are activated, inhibited, or unaffected by visual stimulation and locomotion. For example, while on average layer 2/3 neurons are suppressed by the grating stimulus (Figure 4A), presumable a subset are activated. Evaluating the effects of optogenetic stimulation and locomotion without analyzing them at the level of individual neurons could result in misleading conclusions. This could be presented in the form of a scatter plot, depicting the magnitude of neuronal responses in locomotion vs stationary condition, and opto+ vs no opto conditions. 

      We might be misunderstanding. The first part of the comment is a bit too unspecific to address directly. In cases in which we find the variability is relevant to our conclusions, we do show this for individual cells (e.g.the latencies to running onset are shown as histograms for all cells and axons in Figure S1). It is also unclear to us what the reviewer means by “Evaluating the effects of optogenetic stimulation and locomotion without analyzing them at the level of individual neurons could result in misleading conclusions”. Our conclusions relate to the average responses in L2/3, consistent with the analysis shown. All data will be freely available for anyone to perform follow-up analysis of things we may have missed. E.g., the specific suggestion of presenting the data shown in Figure 4 as a scatter plot is shown below (Figure R2). This is something we had looked at but found not to be relevant to our conclusions. The problem with this analysis is that it is difficult to estimate how much the different sources of variability contribute to the total variability observed in the data, and no interesting pattern is clearly apparent. All relevant and clear conclusions are already captured by the mean differences shown in Figure 4. 

      Author response image 2.

      Optogenetic activation of cholinergic axons in visual cortex primarily enhances responses of layer 5, but not layer 2/3 neurons. Related to Figure 4. (A) Average calcium response of layer 2/3 neurons in visual cortex to full field drifting grating in the absence or presence of locomotion. Each dot is the average calcium activity of an individual neuron during the two conditions. (B) As in A, but for layer 5 neurons. (C) As in A, but comparing the average response while the mice were stationary, to that while cholinergic axons were optogenetically stimulated. (D) As in C, but for layer 5 neurons. (E) Average calcium response of layer 2/3 neurons in visual cortex to visuomotor mismatch, without and with optogenetic stimulation of cholinergic axons in visual cortex. (F) As in E, but for layer 5 neurons. (G) Average calcium response of layer 2/3 neurons in visual cortex to locomotion onset in closed loop, without and with optogenetic stimulation of cholinergic axons in visual cortex. (H) As in G, but for layer 5 neurons.

      (3) To help the reader understand the experimental conditions in open loop experiments, please include average visual flow speed traces for each condition in Figure 5. 

      We have added the locomotion velocity and visual flow speeds to the corresponding conditions in Figure

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      (1) Authors' experimental designs have some caveats to definitely support their claims. Authors claimed that aged LT-HSCs have no myeloid-biased clone expansion using transplantation assays. In these experiments, authors used 10 HSCs and young mice as recipients. Given the huge expansion of old HSC by number and known heterogeneity in immunophenotypically defined HSC populations, it is questionable how 10 out of so many old HSCs (an average of 300,000 up to 500,000 cells per mouse; Mitchell et al., Nature Cell Biology, 2023) can faithfully represent old HSC population. The Hoxb5+ old HSC primary and secondary recipient mice data (Fig. 2C and D) support this concern. In addition, they only used young recipients. Considering the importance of inflammatory aged niche in the myeloid-biased lineage output, transplanting young vs old LT-HSCs into aged mice will complete the whole picture. 

      We sincerely appreciate your insightful comment regarding the existence of approximately 500,000 HSCs per mouse in older mice. To address this, we have conducted a statistical analysis to determine the appropriate sample size needed to estimate the characteristics of a population of 500,000 cells with a 95% confidence level and a ±5% margin of error. This calculation was performed using the finite population correction applied to Cochran’s formula.

      For our calculations, we used a proportion of 50% (p = 0.5), as it has been reported that approximately 50% of HSCs are myeloid-biased1,2. The formula used is as follows:

      N \= 500,000 (total population size)

      Z = 1.96 (Z-score for a 95% confidence level)

      p = 0.5 (expected proportion)

      e \= 0.05 (margin of error)

      Applying this formula, we determined that the required sample size is approximately 384 cells. This sample size ensures that the observed proportion in the sample will reflect the characteristics of the entire population. In our study, we have conducted functional experiments across Figures 2, 3, 5, 6, S3, and S6, with a total sample size of n = 126, which corresponds to over 1260 cells. While it would be ideal to analyze all 500,000 cells, this would necessitate the use of 50,000 recipient mice, which is not feasible. We believe that the number of cells analyzed is reasonable from a statistical standpoint. 

      References

      (1) Dykstra, Brad et al. “Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells.” The Journal of experimental medicine vol. 208,13 (2011): 2691-703. doi:10.1084/jem.20111490

      (2) Beerman, Isabel et al. “Functionally distinct hematopoietic stem cells modulate hematopoietic lineage potential during aging by a mechanism of clonal expansion.” Proceedings of the National Academy of Sciences of the United States of America vol. 107,12 (2010): 5465-70. doi:10.1073/pnas.1000834107

      (2) Authors' molecular data analyses need more rigor with unbiased approaches. They claimed that neither aged LT-HSCs nor aged ST-HSCs exhibited myeloid or lymphoid gene set enrichment but aged bulk HSCs, which are just a sum of LTHSCs and ST-HSCs by their gating scheme (Fig. 4A), showed the "tendency" of enrichment of myeloid-related genes based on the selected gene set (Fig. 4D). Although the proportion of ST-HSCs is reduced in bulk HSCs upon aging, since STHSCs do not exhibit lymphoid gene set enrichment based on their data, it is hard to understand how aged bulk HSCs have more myeloid gene set enrichment compared to young bulk HSCs. This bulk HSC data rather suggest that there could be a trend toward certain lineage bias (although not significant) in aged LT-HSCs or ST-HSCs. Authors need to verify the molecular lineage priming of LT-HSCs and ST-HSCs using another comprehensive dataset. 

      Thank you for your thoughtful feedback regarding the lack of myeloid or lymphoid gene set enrichment in aged LT-HSCs and aged ST-HSCs, despite the observed tendency for myeloid-related gene enrichment in aged bulk HSCs.

      First, we acknowledge that the GSEA results vary among the different myeloid gene sets analyzed (Fig. 4, D–F; Fig. S4, C–D). Additionally, a comprehensive analysis of mouse HSC aging using multiple RNA-seq datasets reported that nearly 80% of differentially expressed genes show poor reproducibility across datasets[1]. These factors highlight the challenges of interpreting lineage bias in HSCs based solely on previously published transcriptomic data.

      Given these points, we believe that emphasizing functional experimental results is more critical than incorporating an additional dataset to support our claim. In this regard, we have confirmed that young and aged LT-HSCs have similar differentiation capacity (Figure 3), while myeloid-biased hematopoiesis is observed in aged bulk HSCs (Figure S3). These findings are further corroborated by independent functional experiments. We sincerely appreciate your insightful comments.

      Reference

      (1) Flohr Svendsen, Arthur et al. “A comprehensive transcriptome signature of murine hematopoietic stem cell aging.” Blood vol. 138,6 (2021): 439-451. doi:10.1182/blood.2020009729

      (3) Although authors could not find any molecular evidence for myeloid-biased hematopoiesis from old HSCs (either LT or ST), they argued that the ratio between LT-HSC and ST-HSC causes myeloid-biased hematopoiesis upon aging based on young HSC experiments (Fig. 6). However, old ST-HSC functional data showed that they barely contribute to blood production unlike young Hoxb5- HSCs (ST-HSC) in the transplantation setting (Fig. 2). Is there any evidence that in unperturbed native old hematopoiesis, old Hoxb5- HSCs (ST-HSC) still contribute to blood production?

      If so, what are their lineage potential/output? Without this information, it is hard to argue that the different ratio causes myeloid-biased hematopoiesis in aging context. 

      Thank you for the insightful and important question. The post-transplant chimerism of ST-HSCs was low in Fig. 2, indicating that transplantation induced a short-term loss of hematopoietic potential due to hematopoietic stress per cell. 

      To reduce this stress, we increased the number of HSCs in transplantation setting. In Fig. S6, old LT-HSCs and old ST-HSCs were transplanted in a 50:50 or 20:80 ratio, respectively. As shown in Fig. S6.D, the 20:80 group, which had a higher proportion of old ST-HSCs, exhibited a statistically significant increase in the lymphoid percentage in the peripheral blood post-transplantation. 

      These findings suggest that old ST-HSCs contribute to blood production following transplantation. 

      Reviewer #2 (Public review):

      While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Fig 3; Fig 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section. 

      Response #2-1:

      Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows:

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 ± 8.9 vs. 42.1 ± 35.5%, p = 0.01), even though n = 10.

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high selfrenewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3.

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4±31.5% vs 47.4±39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased.

      Regarding Figure 6, we obtained a statistically significant difference and consider the sample size to be sufficient. In addition, we have performed various functional experiments (Figures 2, 5, 6 and S6), and have obtained consistent results that expansion of myeloid biased HSCs does not occur with aging in Hoxb5+HSCs fraction. Based on the above, we conclude that the LT-HSC fraction does not differ in myeloid differentiation potential with aging.

      As the authors attempt to challenge the current model of the age-associated expansion of myeloid-biased HSCs (which has been observed and reproduced by many different groups), ideally additional strong evidence in the form of single-cell transplants is provided. 

      Response #2-2:

      Thank you for the comments. As the reviewer pointed out, we hope we could reconfirm our results using single-cell level technology in the future.

      On the other hand, we have reported that the ratio of myeloid to lymphoid cells in the peripheral blood changes when the number of HSCs transplanted, or the number of supporting cells transplanted with HSCs, is varied[1-2]. Therefore, single-cell transplant data need to be interpreted very carefully to determine differentiation potential.

      From this viewpoint, future experiments will combine the Hoxb5 reporter system with a lineage tracing system that can track HSCs at the single-cell level over time. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. We have reflected this comment by adding the following sentences in the manuscript.

      [P19, L451] “In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system[3-4]. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells.” 

      It is also unclear why the authors believe that the observed reduction of ST-HSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation. 

      Response #2-3:

      Thank you for your comment. We apologize for the insufficient explanation. Our data, as shown in Figures 3 and 4, demonstrate that the differentiation potential of LT-HSCs remains unchanged with age. Therefore, rather than suggesting that an increase in LT-HSCs with a consistent differentiation capacity leads to myeloidbiased hematopoiesis, it seems more accurate to highlight that the relative decrease in the proportion of ST-HSCs, which remain in peripheral blood as lymphocytes, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, if we focus on the increase in the ratio of LT-HSCs, it is also plausible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Based on my understanding of the presented data, the authors argue that myeloidbiased HSCs do not exist, as 

      a) they detect no difference between young/aged HSCs after transplant (mind low nnumbers and large std!!!); b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSC in myeloid output LT-HSCs in competitive transplants (mind low n-numbers and large std!!!). 

      However, given the low n-numbers and high variance of the results, the argument seems weak and the presented data does not support the claims sufficiently. That the number of downstream progenitors does not change could be explained by other mechanisms, for instance, the frequently reported differentiation short-cuts of HSCs and/or changes in the microenvironment. 

      Response #2-4:

      We appreciate the comments. As mentioned above, we will correct the manuscript regarding the sample size. Regarding the interpreting of the lack of increase in the percentage of myeloid progenitor cells in the bone marrow with age, it is instead possible that various confounding factors, such as differentiation shortcuts or changes in the microenvironment, are involved.

      However, even when aged LT-HSCs and young LT-HSCs are transplanted into the same recipient mice, the timing of the appearance of different cell fractions in peripheral blood is similar (Figure 3 of this paper). Therefore, we have not obtained data suggesting that clear shortcuts exist in the differentiation process of aged HSCs into neutrophils or monocytes. Additionally, it is currently consensually accepted that myeloid cells, including neutrophils and monocytes, differentiate from GMPs[1]. Since there is no changes in the proportion of GMPs in the bone marrow with age, we concluded that the differentiation potential into myeloid cells remains consistent with aging.

      "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Fig. 3, B and C)." 

      [Comment to the authors]: Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity. 

      Response #2-5:

      Thank you for providing these insights. Regarding the sample size, we have addressed this in Response #2-1.

      Line 293: "Based on these findings, we concluded that myeloid-biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones." 

      Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of ST-HSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs? t 

      Response #2-6:

      Thank you for pointing this out. We apologize for the insufficient explanation. We will explain using Figure 8 from the paper.

      First, our data show that LT-HSCs maintain their differentiation capacity with age, while ST-HSCs lose their self-renewal capacity earlier, so that only long-lived memory lymphocytes remain in the peripheral blood after the loss of selfrenewal capacity in ST-HSCs (Figure 8, upper panel). In mouse bone marrow, the proportion of LT-HSCs increases with age, while the proportion of ST-HSCs relatively decreases (Figure 8, lower panel and Figure S5). 

      Our data show that merely reproducing the ratio of LT-HSCs to ST-HSCs observed in aged mice using young LT-HSCs and ST-HSCs can replicate myeloidbiased hematopoiesis. This suggests that the increase in LT-HSC and the relative decrease in ST-HSC within the HSC compartment with aging are likely to contribute to myeloid-biased hematopoiesis.

      As mentioned earlier, since the differentiation capacity of LT-HSCs remain unchaged with age, it seems more accurate to describe that the relative decrease in the proportion of ST-HSCs, which retain long-lived memory lymphocytes in peripheral blood, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, focusing on the increase in the proportion of LT-HSCs, it is also possible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Recommendations for the authors: 

      Reviewer #2 (Recommendations for the authors):

      Summary: 

      Comment #2-1: While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Figure 3; Figure 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors, need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section. 

      Response #2-1

      Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows: 

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 {plus minus} 8.9 vs. 42.1 {plus minus} 35.5%, p = 0.01), even though n = 10. 

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high selfrenewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3. 

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4{plus minus}31.5% vs 47.4{plus minus}39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased. 

      Regarding Figure 6, we obtained a statistically significant difference and consider the sample size to be sufficient. In addition, we have performed various functional experiments (Figures 2, 5, 6 and S6), and have obtained consistent results that expansion of myeloid-biased HSCs does not occur with aging in Hoxb5+HSCs fraction. Based on the above, we conclude that the LT-HSC fraction does not differ in myeloid differentiation potential with aging. 

      [Comment for authors]  

      Paradigm-shifting extraordinary claims require extraordinary data. Unfortunately, the authors do not provide additional data to further support their claims. Instead, the authors argue the following: Because they were able to find significant differences between experimental groups in some experiments, the absence of significant differences in the results of other experiments must be correct, too. 

      This logic is in my view flawed. Any assay/experiment with highly variable data has a very low sensitivity to detect significant differences between groups. If, as in this case, the variance is as large as the entire dynamic range of the readout, it becomes impossible to be able to detect any difference. In these cases, it is not surprising and actually expected that the mean of the group is located close to the center of the dynamic range as is the case here (center of dynamic range: 50%). In other words, this means that the experiments are simply not reproducible. It is absolutely critical to remember that any experiment and its associated statistical analysis has 3 (!!!) instead of 2 possible outcomes: 

      (1) There is a statistically significant difference 

      (2) There is no statistically significant difference 

      (3) The results of the experiment are inconclusive because the replicates are too variable and the results are not reproducible.  

      While most of us are inclined to think about outcomes (1) or (2), outcome (3) cannot be neglected. While it might be painful to accept, the only way to address concerns about data reproducibility is to provide additional data, improve reproducibility, and lower the power of the analysis to an acceptable level (e.g. able to detect difference of 5-10% between groups). 

      Without going into the technical details, the example graph from the link below illustrates that with a power 0.319 as stated by the authors, approx. 25 transplants, instead of 8, would be required. 

      Typically, however, a power of 0.8 is a reasonable value for any power analysis (although it's not a very strong power either). Even if we are optimistic and assume that there might be a reasonably large difference between experimental groups (in the example above P2 = 0.6, which is actually not that large) we can estimate that we would need over 10 transplants per group to say with confidence that two experimental groups likely do not differ. With smaller differences, these numbers increase quickly to 20+ transplants per group as can be seen in the example graph using an Alpha of 0.1 above. 

      Further reading can be found here and in many textbooks or other online resources: https://power-analysis.com/effect_size.htm  https://tss.awf.poznan.pl/pdf-188978-110207? filename=Using%20power%20analysis%20to.pdf 

      Response:

      Thank you for your feedback. We fully agree with the reviewer that paradigmshifting claims must be supported by equally robust data. It has been welldocumented that the frequency of myeloid-biased HSCs increases with age, with reports indicating that over 50% of the HSC compartment in aged mice consists of myeloid-biased HSCs[1,2]. Based on this, we believe that if aged LT-HSCs were substantially myeloid-biased, the difference should be readily detectable.

      To further validate our findings, we showed the similar preliminary experiment. The resulting data are shown below (n = 8). 

      Author response image 1.

      (A) Experimental design for competitive co-transplantation assay. Ten CD45.2<sup>+</sup> young LT-HSCs and ten CD45.2<sup>+</sup> aged LT-HSCs were transplanted with 2 × 10<sup>5</sup> CD45.1<sup>+</sup>/CD45.2<sup>+</sup> supporting cells into lethally irradiated CD45.1<sup>+</sup> recipient mice (n \= 8). (B) Lineage output of young or aged LT-HSCs at 4, 8, 12, 16 weeks after transplantation. Each bar represents an individual mouse. *P < 0.05. **P < 0.01.

      While a slight increase in myeloid-biased hematopoiesis was observed in the aged LT-HSC fraction, the difference was not statistically significant. These new results are presented alongside the original Figure 3, which was generated using a larger sample size (n = 16).

      Author response image 2.

      (A) Experimental design for competitive co-transplantation assay. Ten CD45.2<sup>+</sup> young LT-HSCs and ten CD45.2<sup>+</sup> aged LT-HSCs were transplanted with 2 × 10<sup>5</sup> CD45.1<sup>+</sup>/CD45.2<sup>+</sup> supporting cells into lethally irradiated CD45.1<sup>+</sup> recipient mice (n \= 16). (B) Lineage output of young or aged LT-HSCs at 4, 8, 12, 16 weeks after transplantation. Each bar represents an individual mouse. 

      Consistent with the original data, aged LT-HSCs exhibited a lineage output that was nearly identical to that of young LT-HSCs. Nonetheless, as the reviewer rightly pointed out, we cannot completely exclude the possibility that subtle differences may exist but remain undetected. To address this, we have added the following sentence to the manuscript:  

      [P9, L200] “These findings unmistakably demonstrated that mixed/bulk-HSCs showed myeloid skewed hematopoiesis in PB with aging. In contrast, LT-HSCs maintained a consistent lineage output throughout life, although subtle differences between aged and young LT-HSCs may exist and cannot be entirely ruled out.”

      References

      (1) Dykstra, Brad et al. “Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells.” The Journal of experimental medicine vol. 208,13 (2011): 2691-703. doi:10.1084/jem.20111490

      (2) Beerman, Isabel et al. “Functionally distinct hematopoietic stem cells modulate hematopoietic lineage potential during aging by a mechanism of clonal expansion.” Proceedings of the National Academy of Sciences of the United States of America vol. 107,12 (2010): 5465-70. doi:10.1073/pnas.1000834107

      Comment #2-3: It is also unclear why the authors believe that the observed reduction of STHSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation. 

      Response #2-3:  

      Thank you for your comment. We apologize for the insufficient explanation. Our data, as shown in Figures 3 and 4, demonstrate that the differentiation potential of LTHSCs remains unchanged with age. Therefore, rather than suggesting that an increase in LT-HSCs with a consistent differentiation capacity leads to myeloid biased hematopoiesis, it seems more accurate to highlight that the relative decrease in the proportion of ST-HSCs, which remain in peripheral blood as lymphocytes, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis. However, if we focus on the increase in the ratio of LT-HSCs, it is also plausible to explain that "with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis." 

      [Comment for authors] 

      While this interpretation of the data might make sense the shown data do not exclude alternative explanations. The authors do not exclude the possibility that LTHSCs expand with age and that this expansion in combination with an aging microenvironment drives myeloid bias. The authors should quantify the frequency [%] and absolute number of LT-HSCs and ST-HSCs in young vs. aged animals. Especially analyzing the abs. numbers of cells will be important to support their claims as % can be affected by changes in the frequency of other populations. 

      Thank you for your very important point. As this reviewer pointed out, we do not exclude the possibility that the combination of aged microenvironment drives myeloid bias. Additionally, we acknowledge that myeloid-biased hematopoiesis with age is a complex process likely influenced by multiple factors. We would like to discuss the mechanism mentioned as a future research direction. Thank you for the insightful feedback. Regarding the point about the absolute cell numbers mentioned in the latter half of the paragraph, we will address this in detail in our subsequent response (Response #2-4).

      Comment #2-4: Based on my understanding of the presented data, the authors argue that myeloid-biased HSCs do not exist, as a) they detect no difference between young/aged HSCs after transplant (mind low n-numbers and large std!); b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSCs in myeloid output LTHSCs in competitive transplants (mind low n-numbers and large std!). However, given the low n-numbers and high variance of the results, the argument seems weak and the presented data does not support the claims sufficiently. That the number of downstream progenitors does not change could be explained by other mechanisms, for instance, the frequently reported differentiation short-cuts of HSCs and/or changes in the microenvironment. 

      Response #2-4:  

      We appreciate the comments. As mentioned above, we will correct the manuscript regarding the sample size. Regarding the interpreting of the lack of increase in the percentage of myeloid progenitor cells in the bone marrow with age, it is instead possible that various confounding factors, such as differentiation shortcuts or changes in the microenviroment, are involved. However, even when aged LT-HSCs and young LT-HSCs are transplanted into the same recipient mice, the timing of the appearance of different cell fractions in peripheral blood is similar (Figure 3 of this paper). Therefore, we have not obtained data suggesting that clear shortcuts exist in the differentiation process of aged HSCs into neutrophils or monocytes. Additionally, it is currently consensually accepted that myeloid cells, including neutrophils and monocytes, differentiate from GMPs1. Since there are no changes in the proportion of GMPs in the bone marrow with age, we concluded that the differentiation potential into myeloid cells remains consistent with aging. 

      Reference 

      (1) Akashi K and others, 'A Clonogenic Common Myeloid Progenitor That Gives Rise to All Myeloid Lineages', Nature, 404.6774 (2000), 193-97. 

      [Comment for authors] 

      As the relative frequency of cell population can be misleading, the authors should compare the absolute numbers of progenitors in young vs. aged mice to strengthen their argument. It would also be helpful to quantify the absolute numbers and relative frequencies in WT mice to exclude the possibility the HoxB5-trimcherry mouse model suffers from unexpected aging phenotypes and the hematopoietic system differs from wild-type animals.

      Thank you for your valuable feedback. We understand the importance of comparing the absolute numbers of progenitors in young versus aged mice to provide a more accurate representation of the changes in cell populations.

      Therefore, we quantified the absolute cell count of hematopoietic cells in the bone marrow using flow cytometry data. 

      Author response image 3.

      As previously reported, we observed a 10-fold increase in the number of pHSCs in aged mice compared to young mice. Additionally, our analysis revealed a statistically significant decrease in the number of Flk2+ progenitors and CLPs in aged mice. On the other hand, there was no statistically significant change in the number of myeloid progenitors between the two age groups. We appreciate the suggestion and hope that this additional information strengthens our argument and addresses your concerns.

      Comment #2-5:  

      "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Figure 3, B and C)." Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity. 

      Response #2-5:  

      Thank you for providing these insights. Regarding the sample size, we have addressed this in Response #2-1. 

      [Comment for authors]  

      As explained in detail in the response to #2-1 the provided arguments are not convincing. As the authors pointed out, the power of these experiments is too low to make strong claims. If the author does not intend to provide new data, the language of the manuscript needs to be adjusted to reflect this weakness. A paragraph discussing the limitations of the study mentioning the limited power of the data should be included beyond the above-mentioned rather vague statement that the data should be validated (which is almost always necessary anyway). 

      Thank you for your valuable comment. We agree with the importance of discussing potential limitations in our experimental design. In response to the reviewer’s suggestion, we have revised the manuscript to include the following sentences:

      [P19, L434] "In the co-transplantation assay shown in Figure 3, the myeloid lineage output derived from young and aged LT-HSCs was comparable (Young LT-HSC: 51.4 ± 31.5% vs. Aged LT-HSC: 47.4 ± 39.0%, p = 0.82). Although no significant difference was detected, the small sample size (n = 8) may limit the sensitivity of the assay to detect subtle myeloid-biased phenotypes."

      This addition acknowledges the potential limitations of our analysis and highlights the need for further investigation with larger cohorts.

      Comment #2-6:

      Line 293: "Based on these findings, we concluded that myeloid biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones." Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of STHSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs?

      Response #2-6:

      Thank you for pointing this out. We apologize for the insufficient explanation. We will explain using attached Figure 8 from the paper. First, our data show that LT-HSCs maintain their differentiation capacity with age, while ST-HSCs lose their self-renewal capacity earlier, so that only long-lived memory lymphocytes remain in the peripheral blood after the loss of self-renewal capacity in ST-HSCs (Figure 8, upper panel). In mouse bone marrow, the proportion of LT-HSCs increases with age, while the proportion of STHSCs relatively decreases (Figure 8, lower panel and Figure S5).

      Our data show that merely reproducing the ratio of LT-HSCs to ST-HSCs observed in aged mice using young LT-HSCs and ST-HSCs can replicate myeloid-biased hematopoiesis. This suggests that the increase in LT-HSC and the relative decrease in ST-HSC within the HSC compartment with aging are likely to contribute to myeloid-biased hematopoiesis.

      As mentioned earlier, since the differentiation capacity of LT-HSCs remain unchanged with age, it seems more accurate to describe that the relative decrease in the proportion of STHSCs, which retain long-lived memory lymphocytes in peripheral blood, leading to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis. However, focusing on the increase in the proportion of LT-HSCs, it is also possible to explain that "with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells become relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid biased hematopoiesis."

      [Comment for authors]

      While I can follow the logic of the argument, my concerns about the interpretation remain as I see discrepancies in other findings in the published literature. For instance, what the authors call ST-HSCs, differs from the classical functional definition of ST-HSCs. It is thus difficult to relate the described observations to previous reports. ST-HSCs typically can contribute significantly to multiple lineages for several weeks (see for example PMID: 29625072). It is somewhat surprising that the ST-HSC in this study don't show this potential and loose their potential much quicker.

      The authors should thus provide a more comprehensive depth of immunophenotypic and molecular characterization to compare their LT-HSCs to ST-HSCs. For instance, are LT-HSCs CD41- HSCs? How do ST-HSCs differ in their surface marker expression from previously used definitions of ST-HSCs? A list of differentially expressed genes between young and old LT-HSCs and ST-HSCs should be done and will likely provide important insights into the molecular programs/markers (beyond the provided GO analysis, which seems superficial).

      Thank you for your valuable feedback. As the reviewer noted, there are indeed multiple definitions of ST-HSCs. We appreciate the opportunity to clarify our definitions of ST-HSCs. We define ST-HSCs functionally, rather than by surface antigens, which we believe is the most classical and widely accepted definition [1]. In our study, we define long-term hematopoietic stem cells (LT-HSCs) as those HSCs that continue to contribute to hematopoiesis after a second transplantation and possess long-term self-renewal potential. Conversely, we define short-term hematopoietic stem cells (ST-HSCs) as those HSCs that do not contribute to hematopoiesis after a second transplantation and only exhibit self-renewal potential in the short term. 

      Next, in the paper referenced by the reviewer[2], the chimerism of each fraction of ST-HSCs also peaked at 4 weeks and then decreased to approximately 0.1% after 12 weeks post-transplantation. Author response image 5 illustrates our ST-HSC donor chimerism in Figure 2. We believe that data in the paper referenced by the reviewer2 is consistent with our own observations of the hematopoietic pattern following ST-HSC transplantation, indicating a characteristic loss of hematopoietic potential 4 weeks after the transplantation. Furthermore, as shown in Figures 2D and 2F, the fraction of ST-HSCs does not exhibit hematopoietic activity after the second transplantation. Therefore, we consider this fraction to be ST-HSCs.

      Author response image 4.

      Additionally, the RNAseq data presented in Figures 4 and S4 revealed that the GSEA results vary among the different myeloid gene sets analyzed (Fig. 4, D–F; Fig. S4, C–D). Moreover, a comprehensive analysis of mouse HSC aging using multiple RNA-seq datasets reported that nearly 80% of differentially expressed genes show poor reproducibility across datasets[3]. From the above, while RNAseq data is indeed helpful, we believe that emphasizing functional experimental results is more critical than incorporating an additional dataset to support our claim. Thank you once again for your insightful feedback.

      References

      (1) Kiel, Mark J et al. “SLAM family receptors distinguish hematopoietic stem and progenitor cells and reveal endothelial niches for stem cells.” Cell vol. 121,7 (2005): 1109-21. doi:10.1016/j.cell.2005.05.026

      (2) Yamamoto, Ryo et al. “Large-Scale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment.” Cell stem cell vol. 22,4 (2018): 600-607.e4. doi:10.1016/j.stem.2018.03.013

      (3) Flohr Svendsen, Arthur et al. “A comprehensive transcriptome signature of murine hematopoietic stem cell aging.” Blood vol. 138,6 (2021): 439-451. doi:10.1182/blood.2020009729

      Reviewer #3 (Public review): 

      Although the topic is appropriate and the new model provides a new way to think about lineage-biased output observed in multiple hematopoietic contexts, some of the experimental design choices, as well as some of the conclusions drawn from the results could be substantially improved. Also, they do not propose any potential mechanism to explain this process, which reduces the potential impact and novelty of the study. 

      The authors have satisfactorily replied to some of my comments. However, there are multiple key aspects that still remain unresolved.

      Reviewer #3 (Recommendations for the authors): 

      Comment #3-1,2:  

      Although the additional details are much appreciated the core of my original comments remains unanswered. There are still no details about the irradiation dose for each particular experiment. Is any transplant performed using a 9.1 Gy dose? If yes, please indicate it in text or figure legend. If not, please remove this number from the corresponding method section. 

      Again, 9.5 Gy (split in two doses) is commonly reported as sublethal. The fact that the authors used a methodology that deviates from the "standard" for the field makes difficult to put these results in context with previous studies. It is not possible to know if the direct and indirect effects of this conditioning method in the hematopoietic system have any consequences in the presented results. 

      Thank you for your clarification. We confirm that none of the transplantation experiments described were performed using a 9.1 Gy irradiation dose. We have therefore removed the mention of "9.1 Gy" from the relevant section of the Materials and Methods. We appreciate helpful suggestion to improve the clarity of the manuscript.

      [P22, L493] “12-24 hours prior to transplantation, C57BL/6-Ly5.1 mice, or aged C57BL/6J recipient mice were lethally irradiated with single doses of 8.7 Gy.”

      Regarding the reviewer’s concern about the radiation dose used in our experiments, we will address this point in more detail in our subsequent response (see Response #3-4).

      Comment #3-4(Original): When representing the contribution to PB from transplanted cells, the authors show the % of each lineage within the donor-derived cells (Figures 3B-C, 5B, 6B-D, 7C-E, and S3 B-C). To have a better picture of total donor contribution, total PB and BM chimerism should be included for each transplantation assay. Also, for Figures 2C-D and Figures S2A-B, do the graphs represent 100% of the PB cells? Are there any radioresistant cells?

      Response #3-4 (Original): Thank you for highlighting this point. Indeed, donor contribution to total peripheral blood (PB) is important information. We have included the donor contribution data for each figure above mentioned.

      In Figure 2C-D and Figure S2A-B, the percentage of donor chimerism in PB was defined as the percentage of CD45.1-CD45.2+ cells among total CD45.1-CD45.2+ and CD45.1+CD45.2+ cells as described in method section.

      Comment for our #3-4 response:  

      Thanks for sharing these data. These graphs should be included in their corresponding figures along with donor contribution to BM. 

      Regarding Figure2 C-D, as currently shown, the graphs only account for CD45.1CD45.2+ (donor-derived) and CD45.1+CD45.2+ (supporting-derived). What is the percentage of CD45.1+CD45.2- (recipient-derived)? Since the irradiation regiment is atypical, including this information would help to know more about the effects of this conditioning method. 

      Thank you for your insightful comment regarding Figure 2C-D. To address the concern that the reviewer pointed out, we provide the kinetics of the percentage of CD45.1+CD45.2- (recipient-derived) in Author response image 7.

      Author response image 5.

      As the reviewer pointed out, we observed the persistence of recipient-derived cells, particularly in the secondary transplant. As noted, this suggests that our conditioning regimen may have been suboptimal. In response, we will include the donor chimerism analysis in the total cells and add the following statement in the study limitations section to acknowledge this point:

      [P19, L439] “Additionally, in this study, we purified LT-HSCs using the Hoxb5 reporter system and employed a moderate conditioning regimen (8.7 Gy). To have a better picture of total donor contribution, total PB chimerism are presented in Figure S7 and we cannot exclude the possibility that these factors may have influenced the results. Therefore, it would be ideal to validate our findings using alternative LT-HSC markers and different conditioning regimens.”

      Comment #3-5: For BM progenitor frequencies, the authors present the data as the frequency of cKit+ cells. This normalization might be misleading as changes in the proportion of cKit+ between the different experimental conditions could mask differences in these BM subpopulations. Representing this data as the frequency of BM single cells or as absolute numbers (e.g., per femur) would be valuable.

      Response #3-5:

      We appreciate the reviewer's comment on this point. 

      Firstly, as shown in Supplemental Figures S1B and S1C, we analyze the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in different panels. Therefore, normalization is required to assess the differentiation of HSCs from upstream to downstream.

      Additionally, the reason for normalizing by c-Kit+ is that the bone marrow analysis was performed after enrichment using the Anti-c-Kit antibody for both upstream and downstream fractions. Based on this, we calculated the progenitor populations as a frequency within the c-Kit positive cells. Next, the results of normalizing the whole bone marrow cells (live cells) are shown below. 

      Author response image 6.

      Similar to the results of normalizing c-Kit+ cells, myeloid progenitors remained unchanged, including a statistically significant decrease in CMP in aged mice. Additionally, there were no significant differences in CLP. In conclusion, similar results were obtained between the normalization with c-Kit and the normalization with whole bone marrow cells (live cells).

      However, as the reviewer pointed out, it is necessary to explain the reason for normalization with c-Kit. Therefore, we will add the following description.

      [P21, L502] For the combined analysis of the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in Figures 1B, we normalized by cKit+ cells because we performed a c-Kit enrichment for the bone marrow analysis.

      Comment for our #3-5 response:

      I understand that normalization is necessary to compare across different BM populations. However, the best way would be to normalize to single cells. As I mentioned in my original comment, normalizing to cKit+ cells could be misleading, as the proportion of cKit+ cells could be different across the experimental conditions. Further, enriching for cKit+ cells when analyzing BM subpopulation frequencies could introduce similar potential errors. The enrichment would depend on the level of expression of cKit for each of these population, what would alter the final quantification. Indeed, CLP are typically defined as cKit-med/low. Thus, cKit enrichment would not be a great method to analyze the frequency of these cells. 

      The graph in the authors' response to my comment, show similar trend to what is represented Figure 1B for some populations. However, there are multiple statistically significant changes that disappear in this new version. This supports my original concern and, in consequence, I would encourage to represent this data as the frequency of BM single cells or as absolute numbers (e.g., per femur). 

      Thank you for your thoughtful follow-up comment. In response to the reviewer’s suggestion, we will represent the data as the frequency among total BM single cells. These revised graphs have been incorporated into the updated Figure 7F and corresponding figure legend have been revised accordingly to accurately reflect these representations. We appreciate your valuable input, which has helped us improve the clarity and rigor of our data presentation.

      Comment #3-6: Regarding Figure 1B, the authors argue that if myeloid-biased HSC clones increase with age, they should see increased frequency of all components of the myeloid differentiation pathway (CMP, GMP, MEP). This would imply that their results (no changes or reduction in these myeloid subpopulations) suggest the absence of myeloid-biased HSC clones expansion with age. This reviewer believes that differentiation dynamics within the hematopoietic hierarchy can be more complex than a cascade of sequential and compartmentalized events (e.g., accelerated differentiation at the CMP level could cause exhaustion of this compartment and explain its reduction with age and why GMP and MEP are unchanged) and these conclusions should be considered more carefully.

      Response #3-6:

      We wish to thank the reviewer for this comment. We agree with that the differentiation pathway may not be a cascade of sequential events but could be influenced by various factors such as extrinsic factors.

      In Figure 1B, we hypothesized that there may be other mechanisms causing myeloid-biased hematopoiesis besides the age-related increase in myeloid-biased HSCs, given that the percentage of myeloid progenitor cells in the bone marrow did not change with age. However, we do not discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B. 

      Our newly proposed theories—that the differentiation capacity of LT-HSCs remains unchanged with age and that age-related myeloid-biased hematopoiesis is due to changes in the ratio of LT-HSCs to ST-HSCs—are based on functional experiment results. As the reviewer pointed out, to discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B, it is necessary to apply a system that can track HSC differentiation at single-cell level. The technology would clarify changes in the self-renewal capacity of individual HSCs and their differentiation into progenitor cells and peripheral blood cells. The authors believe that those single-cell technologies will be beneficial in understanding the differentiation of HSCs. Based on the above, the following statement has been added to the text.

      [P19, L440] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system1-2. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. 

      Comment for our #3-6 response:

      Thanks for the response. My original comments referred to the statement "On the other hand, in contrast to what we anticipated, the frequency of GMP was stable, and the percentage of CMP actually decreased significantly with age, defying our prediction that the frequency of components of the myeloid differentiation pathway, such as CMP, GMP, and MEP would increase in aged mice if myeloid-biased HSC clones increase with age (Fig. 1 B)" (lines #129-133). Again, the absence of an increase in CMP, GMP and MEP with age does not mean the absence of and increase in myeloid-biased HSC clones. This statement should be considered more carefully. 

      Thank you for the insightful comment. We agree that the absence of an increase in CMP, GMP and MEP with age does not mean the absence of an increase in myeloid-biased HSC clones. In our revised manuscript, we have refined the statement to acknowledge this nuance more clearly. The updated text now reads as follows:

      P6, L129] On the other hand, in contrast to what we anticipated, the frequency of GMP was stable, and the percentage of CMP actually decreased significantly with age, defying our prediction that the frequency of components of the myeloid differentiation pathway, such as CMP, GMP, and MEP may increase in aged mice, if myeloid-biased HSC clones increase with age. 

      Comment #3-7: Within the few recipients showing good donor engraftment in Figure 2C, there is a big proportion of T cells that are "amplified" upon secondary transplantation (Figure 2D). Is this expected?

      Response #3-7:

      We wish to express our deep appreciation to the reviewer for insightful comment on this point. As the reviewers pointed out, in Figure 2D, a few recipients show a very high percentage of T cells. The authors had the same question and considered this phenomenon as follows:

      (1) One reason for the very high percentage of T cells is that we used 1 x 107 whole bone marrow cells in the secondary transplantation. Consequently, the donor cells in the secondary transplantation contained more T-cell progenitor cells, leading to a greater increase in T cells compared to the primary transplantation.

      (2) We also consider that this phenomenon may be influenced by the reduced selfrenewal capacity of aged LT-HSCs, resulting in decreased sustained production of myeloid cells in the secondary recipient mice. As a result, long-lived memorytype lymphocytes may preferentially remain in the peripheral blood, increasing the percentage of T cells in the secondary recipient mice.

      We have discussed our hypothesis regarding this interesting phenomenon. To further clarify the characteristics of the increased T-cell count in the secondary recipient mice, we will analyze TCR clonality and diversity in the future.

      Comment for our #3-7 response:

      Thanks for the potential explanations to my question. This fact is not commonly reported in previous transplantation studies using aged HSCs. Could Hoxb5 label fraction of HSCs that is lymphoid/T-cell biased upon secondary transplantation? The number of recipients with high frequency of lymphoid cells in the peripheral blood (even from young mice) is remarkable. 

      Response:

      Thank you for your insightful suggestion. Based on this comment, we calculated the percentage of lymphoid cells in the donor fraction at 16 weeks following the secondary transplantation, which was 56.1 ± 25.8% (L/M = 1.27). According to the Müller-Sieburg criteria, lymphoid-biased hematopoiesis is defined as having an L/M ratio greater than 10. 

      Given our findings, we concluded that the Hoxb5-labeled fraction does not specifically indicate lymphoid-biased hematopoiesis. We sincerely appreciate the valuable input, which helped us to further clarify the interpretation of our results.

      Comment #3-8: Do the authors have any explanation for the high level of variabilitywithin the recipients of Hoxb5+ cells in Figure 2C?

      Response #3-8:

      We appreciate the reviewer's comment on this point. As noted in our previous report, transplantation of a sufficient number of HSCs results in stable donor chimerism, whereas a small number of HSCs leads to increased variability in donor chimerism1. Additionally, other studies have observed high variability when fewer than 10 HSCs are transplanted2-3. Based on this evidence, we consider that the transplantation of a small number of cells (10 cells) is the primary cause of the high level of variability observed.

      Comment for our #3-8 response:

      I agree that transplanting low number of HSC increases the mouse-to-mouse variability. For that reason, a larger cohort of recipients for this kind of experiment would be ideal. 

      Response:

      Thank you for the insightful comment. We agree that a larger cohort of recipients would be ideal for this type of experiment. In Figure 2, the difference between Hoxb5<suup>+</sup> and Hoxb5⁻ cells are robust, allowing for a clear statistical distinction despite the cohort size. However, we also recognize that a larger cohort would be necessary to detect more subtle differences, particularly in Figure 3. In response, we have added the following statement to the main text to acknowledge this limitation.

      P9, L200] These findings unmistakably demonstrated that mixed/bulk-HSCs showed myeloid skewed hematopoiesis in PB with aging. In contrast, LT-HSCs maintained a consistent lineage output throughout life, although subtle differences between aged and young LT-HSCs may exist and cannot be entirely ruled out.

      Comment #3-10: Is Figure 2G considering all primary recipients or only the ones that were used for secondary transplants? The second option would be a fairer comparison.

      Response #3-10:

      We appreciate the reviewer's comment on this point. We considered all primary recipients in Figure 2G to ensure a fair comparison, given the influence of various factors such as the radiosensitivity of individual recipient mice[1]. Comparing only the primary recipients used in the secondary transplantation would result in n = 3 (primary recipient) vs. n = 12 (secondary recipient). Including all primary recipients yields n = 11 vs. n = 12, providing a more balanced comparison. Therefore, we analyzed all primary recipient mice to ensure the reliability of our results.

      Comment for our #3-10 response:

      I respectfully disagree. Secondary recipients are derived from only 3 of the primary recipients. Therefore, the BM composition is determined by the composition of their donors. Including primary recipients that are not transplanted into secondary recipients for is not the fairest comparison for this analysis. 

      Thank you for your comment and for highlighting this important issue. We acknowledge the concern that including primary recipients that are not transplanted into secondary recipients is not the fairest comparison for this analysis. In response, we have reanalyzed the data using only the primary recipients whose bone marrow was actually transplanted into secondary recipients. 

      Author response image 7.

      Importantly, the reanalysis confirmed that the kinetics of myeloid cell proportions in peripheral blood were consistent between primary and secondary transplant recipients. We sincerely appreciate your thoughtful feedback, which has helped us improve the clarity.

      Comment #3-11: When discussing the transcriptional profile of young and aged HSCs, the authors claim that genes linked to myeloid differentiation remain unchanged in the LT-HSC fraction while there are significant changes in the STHSCs. However, 2 out of the 4 genes shown in Figure S4B show ratios higher than 1 in LT-HSCs.

      Response #3-11:

      Thank you for highlighting this important point. As the reviewer pointed out, when we analyze the expression of myeloid-related genes, some genes are elevated in aged LT-HSCs compared to young LT-HSCs. However, the GSEA analysis using myeloid-related gene sets, which include several hundred genes, shows no significant difference between young and aged LT-HSCs (see Figure S4C in this paper). Furthermore, functional experiments using the co-transplantation system show no difference in differentiation capacity between young and aged LT-HSCs (see Figure 3 in this paper). Based on these results, we conclude that LT-HSCs do not exhibit any change in differentiation capacity with aging.

      Comment for our #3-11 response:

      The authors used the data in Figure S4 to claim that "myeloid genes were tended to be enriched in aged bulk-HSCs but not in aged LT-HSCs compared to their respective controls" (this is the title of the figure; line # 1326). This is based on an increase in gene expression of CD150, vWF, Selp, Itgb3 in aged cells compared to young cells (Figure S4B). However, an increase in Selp and Itgb3 is also observed for LT-HSCs (lower magnitude, but still and increase). 

      Also, regarding the GSEA, the only term showing statistical significance in bulk HSCs is "Myeloid gene set", which does not reach significance in LT-HSCs, but present a trend for enrichment (q = 0.077). None of the terms in shown in this panel present statistical significance in ST-HSCs. 

      Thank you for your valuable point. As the reviewer noted, the current title may cause confusion. Therefore, we propose changing it to the following:

      [P52, L1331] “Figure S4. Compared to their respective young controls, aged bulk-HSCs exhibit greater enrichment of myeloid gene expression than aged LT-HSCs”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aim to assess the effect of salt stress on root:shoot ratio, identify the underlying genetic mechanisms, and evaluate their contribution to salt tolerance. To this end, the authors systematically quantified natural variations in salt-induced changes in root:shoot ratio. This innovative approach considers the coordination of root and shoot growth rather than exploring biomass and the development of each organ separately. Using this approach, the authors identified a gene cluster encoding eight paralog genes with a domain-of-unknown-function 247 (DUF247), with the majority of SNPs clustering into SR3G (At3g50160). In the manuscript, the authors utilized an integrative approach that includes genomic, genetic, evolutionary, histological, and physiological assays to functionally assess the contribution of their genes of interest to salt tolerance and root development.

      Comments on revisions:

      As the authors correctly noted, variations across samples, genotypes, or experiments make achieving statistical significance challenging. Should the authors choose to emphasize trends across experiments to draw biological conclusions, careful revisions of the text, including titles and figure legends, will be necessary to address some of the inconsistencies between figures (see examples below). However, I would caution that this approach may dilute the overall impact of the work on SR3G function and regulation. Therefore, I strongly recommend pursuing additional experimental evidence wherever possible to strengthen the conclusions.

      (1) Given the phenotypic differences shown in Figures S17A-B, 10A-C, and 6A, the statement that "SR3G does not play a role in plant development under non-stress conditions" (lines 680-681) requires revision to better reflect the observed data.

      Thank you to the reviewer for the comment. We appreciate the acknowledgment that variations among experiments are inherent to biological studies. Figures 6A and S17 represent the same experiment, which initially indicated a phenotype for the sr3g mutant under salt stress. To ensure that growth changes were specifically normalized for stress conditions, we calculated the Stress Tolerance Index (Fig. 6B). In Figure 10, we repeated the experiment including all five genotypes, which supported our original observation that the sr3g mutant exhibited a trend toward reduced lateral root number under 75 mM NaCl compared to Col-0, although this difference was not significant (Fig. 10B). Additionally, we confirmed that the wrky75 mutant showed a significant reduction in main root growth under salt stress compared to Col-0, consistent with findings reported in The Plant Cell by Lu et al. 2023. For both main root length and lateral root number, we demonstrated that the double mutants of wrky75/sr3g displayed growth comparable to wild-type Col-0. This result suggests that the sr3g mutation compensates for the salt sensitivity of the wrky75 mutant.

      We completely agree with the reviewer that there is a variation in our results regarding the sr3g phenotype under control conditions, as presented in Fig. 6A/Fig. S17 and Fig. 10A-C. In Fig. 6A/Fig. S17, we did not observe any consistent trends in main root or lateral root length for the sr3g mutant compared to Col-0 under control conditions. However, in Fig. 10A-C, we observed a significant reduction in main root length, lateral root number, and lateral root length for the sr3g mutant under control conditions. We believe this may align with SR3G’s role as a negative regulator of salt stress responses. While loss of this gene benefits plants in coping with salt stress, it might negatively impact overall plant growth under non-stress conditions. This interpretation is further supported by our findings on the root suberization pattern in sr3g mutants under control conditions (Fig. 8B), where increased suberization in root sections 1 to 3, compared to Col-0, could inhibit root growth. While SR3G's role in overall plant fitness is intriguing, it is beyond the scope of this study. We cannot rule out the possibility that SR3G contributes positively to plant growth, particularly root growth. That said, we observed no differences in shoot growth between Col-0 and the sr3g mutant under control conditions (Fig. 7). Additionally, we calculated the Stress Tolerance Index for all aspects of root growth shown in Fig. 10 and presented it in Fig. S25.

      To address the reviewer request on rephrasing the lines 680-681 from"SR3G does not play a role in plant development under non-stress conditions" (lines 680-681) statement, this statement is found in lines 652-653 and corresponds to Fig. 7, where we evaluated rosette growth in the WT and sr3g mutant under both control and salt stress conditions. We did not observe any significant differences or even trends between the two genotypes under control conditions, confirming the accuracy of the statement. To clarify further, we have added “SR3G does not play a role in rosette growth and development under non-stress conditions”.

      (2) I agree with the authors that detecting expression differences in lowly expressed genes can be challenging. However, as demonstrated in the reference provided (Lu et al., 2023), a significant reduction in WRKY75 expression is observed in T-DNA insertion mutant alleles of WRKY75. In contrast, Fig. 9B in the current manuscript shows no reduction in WRKY75 expression in the two mutant alleles selected by the authors, which suggests that these alleles cannot be classified as loss-of-function mutants (line 745). Additionally, the authors note that the wrky75 mutant exhibits reduced main root length under salt stress, consistent with the phenotype reported by Lu et al. (2023). However, other phenotypic discrepancies exist between the two studies. For example, 1) Lu et al. (2023) report that w¬rky75 root length is comparable to WT under control conditions, whereas the current manuscript shows that wrky75 root growth is significantly lower than WT; 2) under salt stress, Lu et al. (2023) show that wrky75 accumulates higher levels of Na+, whereas the current study finds Na+ levels in wrky75 indistinguishable from WT. To confirm the loss of WRKY75 function in these T-DNA insertion alleles the authors should provide additional evidence (e.g., Western blot analysis).

      We sincerely appreciate the reviewer acknowledging the challenge of detecting expression differences in lowly expressed genes, such as transcription factors. Transcription factors are typically expressed at lower levels compared to structural or enzymatic proteins, as they function as regulators where small quantities can have substantial effects on downstream gene expression.

      That said, we respectfully disagree with the reviewer’s interpretation that there is no reduction in WRKY75 expression in the two mutant lines tested in Fig. 9C. Among the two independent alleles examined, wrky75-3 showed a clear reduction in expression compared to WT Col-0 under both control and salt stress conditions. Using the Tukey test to compare all groups, we observed distinct changes in the assigned significance letters for each case:

      Col/root/control (cd) vs wrky75-3/root/control (cd): Although the same significance letter was assigned, we still observed a clear reduction in WRKY75 transcript abundance. More importantly, the variation in expression is notably lower compared to Col-0.

      Col/shoot/control (bcd) vs wrky75-3/shoot/control (a): This is significant reduction compared to Col

      Col/root/salt (cd) vs wrky75-3/root/salt (bcd): Once again, the reduction in WRKY75 transcript levels corresponds to changes in the assigned significance letters.

      Col/shoot/salt (bc) vs wrky75-3/shoot/salt (ab): Once again, the reduction in WRKY75 transcript levels corresponds to changes in the assigned significance letters.

      To address the reviewer’s comment regarding the significant reduction in WRKY75 expression observed in T-DNA insertion mutant alleles of WRKY75 in the reference by Lu et al., 2023, we would like to draw the reviewer’s attention to the following points:

      a) Different alleles: The authors in The Plant Cell used different alleles than those used in our study, with one of their alleles targeting regions upstream of the WRKY75 gene. While we identified one of their described alleles (WRKY75-1, SALK_101367) on the T-DNA express website, which targets upstream of WRKY75, the other allele (wrky75-25) appears to have been generated through a different mechanism (possibly an RNAi line) that is not defined in the Plant Cell paper and does not appear on the T-DNA express website. The authors mentioned they have received these seeds as gifts from other labs in the acknowledgement ”We thank Prof. Hongwei Guo (Southern University of Science and Technology, China) and Prof. Diqiu Yu (Yunnan University, China) for kindly providing the WRKY75<sub>pro</sub>:GUS, 35S<sub>pro</sub>:WRKY75-GFP, wrky75-1, and wrky75-25 seeds. We thank Man-cang Zhang (Electrophysiology platform, Henan University) for performing the NMT experiment”.

      However, in our study, we selected two different T-DNAs that target the coding regions. While this may explain slight differences in the observed responses, both studies independently link WRKY75 to salt stress, regardless of the alleles used. For your reference, we have included a screenshot of the different alleles used.

      Author response image 1.

      b) Different developmental stages: They measured WRKY75 expression in 5-day-old seedlings. In our experiment, we used seedlings grown on 1/2x MS for 4 days, followed by transfer to treatment plates with or without 75 mM NaCl for one week. As a result, we analyzed older plants (12 days old) for gene expression analysis. Despite the difference in developmental stage, we were still able to observe a reduction in gene expression.

      c) Different tissues: The authors of The Plant Cell used whole seedlings for gene expression analysis, whereas we separated the roots and shoots and measured gene expression in each tissue type individually. This approach is logical, as WRKY75 is a root cell-specific transcription factor with higher expression in the roots compared to the shoots, as demonstrated in our analysis (Fig. 9C).

      Based on the reasoning above, we did work with loss-of-function mutants of WRKY75, particularly wrky75-3. To more accurately reflect the nature of the mutation, we have changed the term "loss-of-function" to "knock-down" in line 717.

      The reviewer mentioned phenotypic discrepancies between the two studies. We agree that there are some differences, particularly in the magnitude of responses or expression levels. However, despite variations in the alleles used, developmental stages, and tissue types, both studies reached the same conclusion: WRKY75 is involved in the salt stress response and acts as a positive regulator. We have discussed the differences between our study and The Plant Cell in the section above, summarizing them into three main points: different alleles, different developmental stages, and different tissue types.

      To address the reviewer’s comment regarding "Lu et al. (2023) report that wrky75 root length is comparable to WT under control conditions, whereas the current manuscript shows that wrky75 root growth is significantly lower than WT": We evaluated root growth differently than The Plant Cell study. In The Plant Cell (Fig. 5, H-J), root elongation was measured in 10-day-old plants with a single time point measurement. They transferred five-day-old wild-type, wrky75-1, wrky75-25, and WRKY75-OE plants to 1/2× MS medium supplemented with 0 mM or 125 mM NaCl for further growth and photographed them 5 days after transfer. In contrast, our study used 4-day-old seedlings, which were transferred to 1/2 MS with or without 0, 75, or 125 mM salt for additional growth (9 days). Rather than measuring root growth only at the end, we scanned the roots every other day, up to five times, to assess root growth rates. Essentially, the precision of our method is higher as we captured growth changes throughout the developmental process, compared to the approach used in The Plant Cell. We do not underestimate the significance of the work conducted by other colleagues in the field, but we also recognize that each laboratory has its own approach and specific practices. This variation in experimental setup is intrinsic to biology, and we believe it is important to study biological phenomena in different ways. Especially as the common or contrasting conclusions reached by different studies, performed by different labs and using different experimental setups are shedding more light on reproducibility and gene contribution across different conditions, which is intrinsic to phenotypic plasticity, and GxE interactions.

      The Plant Cell used a very high salt concentration, starting at 125 mM, while we were more cautious in our approach, as such a high concentration can inhibit and obscure more subtle phenotypic changes.

      To address the reviewer’s comment on "Lu et al. (2023) show that wrky75 accumulates higher levels of Na+, whereas the current study finds Na+ levels in wrky75 indistinguishable from WT," we would like to highlight the differences in the methodologies used in both studies. The Plant Cell measured Na+ accumulation in the wrky75 mutant using xylem sap (Supplemental Figure S10), which appears to be a convenient and practical approach in their laboratory. In their experiment, wild-type and wrky75 mutant plants were grown in soil for 3 weeks, watered with either a mock solution or 100 mM NaCl solution for 1 day, and then xylem sap was collected for Na+ content analysis. In contrast, our study employed a different method to measure Na+ and K+ ion content, using Inductively Coupled Plasma Atomic Emission Spectroscopy (ICP-AES) for root and shoot Na+ and K+ measurements. Additionally, we collected samples after two weeks on treatment plates and focused on the Na+/K+ ratio, which we consider more relevant than net Na+ or K+ levels, as the ratio of these ions is a critical determinant of plant salt tolerance. With this in mind, we observed a considerable non-significant increase in the Na+/K+ ratio in the shoots of the wrky75-3 mutant (assigned Tukey’s letter c) compared to the Col-0 WT (assigned Tukey’s letters abc) under 125 mM salt, suggesting that this mutant is salt-sensitive. Importantly, the Na+/K+ ratio in the double wrky75/sr3g mutants was reduced to the WT level under the same salt conditions, further indicating that the salt sensitivity of wrky75 is mitigated by the sr3g mutation.

      Based on the reasons mentioned above, we believe that conducting additional experiments, such as Western blot analysis, is unnecessary and would not contribute new insights or alter the context of our findings.

      Reviewer #2 (Public review):

      Summary:

      Salt stress is a significant and growing concern for agriculture in some parts of the world. While the effects of sodium excess have been studied in Arabidopsis and (many) crop species, most studies have focused on Na uptake, toxicity and overall effects on yield, rather than on developmental responses to excess Na, per se. The work by Ishka and colleagues aims to fill this gap.

      Working from an existing dataset that exposed a diverse panel of A. thaliana accessions to control, moderate, and severe salt stress, the authors identify candidate loci associated with altering the root:shoot ratio under salt stress. Following a series of molecular assays, they characterize a DUF247 protein which they dub SR3G, which appears to be a negative regulator of root growth under salt stress.

      Overall, this is a well-executed study which demonstrates the functional role played by a single gene in plant response to salt stress in Arabidopsis.

      Review of revised manuscript:

      The authors have addressed my point-by-point comments to my satisfaction. In the cases where they have changed their manuscript language, clarified figures, or added analyses I have no further comment. In some cases, there is a fruitful back-and-forth discussion of methodology which I think will be of interest to readers.

      I have nothing to add during this round of review. I think that the paper and associated discussion will make a nice contribution to the field.

      We sincerely appreciate the reviewer’s recognition of the significance of our work to the field.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Lines 518-519: The statement that other DUF247s exhibit similar expression patterns to SR3G, suggesting their responsiveness to salt stress, is not fully supported by Fig. S14. Please clarify the specific similarities (and differences) in the expression patterns of the DUF247s shown in Fig. S14, as their expression appears to be spatially and temporally diverse. Additionally, the scale is missing in Fig. S14.

      We thank the reviewer. We fixed the text and added expression scales to Figure S14.

      Line 684, Fig. 6A should be 7A.

      Thanks. It is fixed.

      Line 686, Fig. 7A should be 7B.

      Thanks. It is fixed.

      Lines 721-723: The signal quantification in Fig. 8B does not support the claim that "in section one,..., sr3g-5 showed more suberization compared to Col-0." Given the variability and noise often associated with histological dyes such as Fluorol Yellow staining, conclusions should be cautiously grounded in robust signal quantification. Additionally, please specify the number of biological replicates used in both Fig. 8B and C.

      We thank the reviewer for their comments. We believe the statement in the text accurately reflects our results presented in Figure 8B, where we stated “non-significant, but substantially higher levels of root suberization in sr3g-5 compared to Col-0 in sections one to three of the root under control condition (Fig. 8B).” Therefore, we kept the statement and have included the number of biological replicates in the figure legend.

      Lines 731-732: Please provide a more detailed explanation of how the significant changes in suberin monomer levels align with the Fluorol Yellow staining results, and clarify how these findings support the proposed negative role of SR3G in root suberization.

      Fluorol Yellow is a lipophilic dye widely used to label suberin in plant tissues, specifically in roots in this study. Given the inherent variability in histological assays, we confirmed the increase in suberization using an alternative method, Gas Chromatography–Mass Spectrometry (GC-MS). Both approaches revealed elevated suberin levels in the sr3g mutant compared to Col-0. Since the overall suberin content was higher in the mutant under both control and salt stress conditions, we proposed that SR3G acts as a negative regulator of root suberization.

      Lines 686-688 and Figure S24: The authors calculated water mass as FW-DW. A more standard approach for calculating water content is (FW-DW)/FW x 100. Please update the text or adjust the calculation accordingly. Additionally, if the goal is to test differences between WT and the mutant within each condition, a t-test would be a more appropriate statistical method.

      We thank the reviewer. We added water content % to the figure S24. We kept the statistical test as it is as we wanted to be able to observe changes across conditions and genotypes.

      Lines 633-635 states that "No significant difference was observed between sr3g-4 and Col-0 (Fig. S18), except for the Stress Tolerance Index (STI) calculated using growth rates of lateral root length and number." However, based on the Figure S18 legend and statistical analysis (i.e., ns), it appears that the sr3g-4 mutant shows no alterations in root system architecture compared to Col-0. Please revise the text to accurately reflect the results of the statistical analysis.

      We thank the reviewer. We now fixed the text to reflect the result.

      Lines 698-707: The statistical analysis does not support the reported differences in the Na+/K+ ratio for the single and double mutants of sr3g-5 and wrky75-3 (Fig. 10D, where levels connected by the same letters indicate they are not significantly different). Furthermore, the conclusion that "the SR3G mutation indeed compensated for the increased Na+ accumulation observed in the wrky75 mutant under salt stress" is also based on non-significant differences (Fig. S25B). Please revise the text to accurately reflect the results of the statistical analysis. Additionally, since each mutant is compared to the WT, I recommend using Dunnett's test for statistical analysis.

      We thank the reviewer for their feedback. We have carefully revised the text to better support our findings. As previously mentioned, variations among samples are evident and are well-reflected across all our datasets. We have presented all data and focused on identifying trends within our samples to guide interpretation.

      We observed that the SR3G mutation effectively compensated for the increased Na+ accumulation observed in the wrky75 mutant under salt stress. A closer examination of the shoot Na+/K+ ratio under 125 mM salt shows that the wrky75 single mutant has a higher Na+/K+ ratio (indicated by the letter "c") compared to Col-0 (indicated by "abc") and the two double mutants (also indicated by "abc"). Therefore, we have retained the statistical analysis as originally conducted, and maintain our conclusions as is.

      Figure 6: data in panel C present the Na/K ratio, not Na+ content. Based on the statistical analysis of root Na+ levels presented in Fig. S17C, there is no significant difference between sr3g-5 and WT. Please update the title of Fig. 6. In addition, in panel A, the title of the Y-axis and figure legend should be "Lateral root growth rate" without the word length, and in panel C, the statistical analysis is missing.

      We thank the reviewer. We updated Fig. 6 title and fixed the Y-axis in panel A, and added statistical letters to panel C. Legend was updated to reflect the changes.

      Figure 7: Please clearly label the time points where significant differences between genotypes are observed for both early and late salt treatments. Was there a significant difference recorded between WT and sr3g-5 on day 0 under early salt stress? Such differences may arise from initial variations in plant size within this experiment, as indicated by Fig. 7B, where significant differences in rosette area are evident starting from day 0. Additionally, please indicate the statistical analysis in panel E.

      We thank the reviewer for this suggestion. We updated the figure with a statistical test added to the panel E. Although the difference between sr3g mutant and Col-0 is indeed significant in its growth rate at day 0, we would like to draw the attention of the reviewer that this growth rate was calculated over the 24 hours after adding salt stress. Therefore, this difference in growth rate is related to exposure to salt stress. Moreover, the growth rate between Col-0 and sr3g mutant does not differ in two other treatments (Control and Late Salt Stress) further supporting the conclusion that sr3g is affecting rosette size and growth rate only under early salt stress conditions.

      We have also added the Salt Tolerance Index calculation to Figure S24 as additional evidence, controlling for potential differences in size between Col-0 and sr3g mutant.

      Figure S17: statistical analysis is not indicated in panels A, B, and D.

      We thank the reviewer for spotting that. We updated the figure with a statistical test.

      Figures S21-23: The quality of these figures is insufficient, hindering the ability to effectively interpret the authors' results and main message. Furthermore, a Dunnett's test, rather than a t-test, is the appropriate statistical method for this analysis.

      We thank the reviewer for this observation. We have now added a high resolution figures for all supplemental figures, which should increase the resolution of the figures. As we are comparing all of the genotypes to Col-0 one-by-one - the results of individual t-tests are sufficient for this analysis.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      (1) Substantial revision of the claims and interpretation of the results is needed, especially in the setting of additional data showing enhanced erythrophagocytosis with decreased RBC lifespan.

      Thank you for your valuable feedback and suggestion for a substantial revision of the claims and interpretation of our results. We acknowledge the importance of considering additional data that shows enhanced erythrophagocytosis with decreased RBC lifespan. In response, we have revised our manuscript and incorporated additional experimental data to support and clarify our findings.

      (1) In our original manuscript, we reported a decrease in the number of splenic red pulp macrophages (RPMs) and phagocytic erythrocytes after hypobaric hypoxia (HH) exposure. This conclusion was primarily based on our observations of reduced phagocytosis in the spleen.

      (2) Additional experimental data on RBC labeling and erythrophagocytosis:

      • Experiment 1 (RBC labeling and HH exposure)

      We conducted an experiment where RBCs from mice were labeled with PKH67 and injected back into the mice. These mice were then exposed to normal normoxia (NN) or HH for 7 or 14 days. The subsequent assessment of RPMs in the spleen using flow cytometry and immunofluorescence detection revealed a significant decrease in both the population of splenic RPMs (F4/80hiCD11blo, new Figure 5A and C) and PKH67-positive macrophages after HH exposure (as depicted in new Figure 5A and C-E). This finding supports our original claim of reduced phagocytosis under HH conditions.

      Author response image 1.

      -Experiment 2 (erythrophagocytosis enhancement)

      To examine the effects of enhanced erythrophagocytosis, we injected Tuftsin after administering PKH67-labelled RBCs. Our observations showed a significant decrease in PKH67 fluorescence in the spleen, particularly after Tuftsin injection compared to the NN group. This result suggests a reduction in RBC lifespan when erythrophagocytosis is enhanced (illustrated in new Figure 7, A-B).

      Author response image 2.

      (3) Revised conclusions:

      • The additional data from these experiments support our original findings by providing a more comprehensive view of the impact of HH exposure on splenic erythrophagocytosis.

      • The decrease in phagocytic RPMs and phagocytic erythrocytes after HH exposure, along with the observed decrease in RBC lifespan following enhanced erythrophagocytosis, collectively suggest a more complex interplay between hypoxia, erythrophagocytosis, and RBC lifespan than initially interpreted.

      We think that these revisions and additional experimental data provide a more robust and detailed understanding of the effects of HH on splenic erythrophagocytosis and RBCs lifespan. We hope that these changes adequately address the concerns raised and strengthen the conclusions drawn in our manuscript.

      (2) F4/80 high; CD11b low are true RPMs which the cells which the authors are presenting, i.e. splenic monocytes / pre-RPMs. To discuss RPM function requires the presentation of these cells specifically rather than general cells in the proper area of the spleen.

      Thank you for your feedback requesting a substantial revision of our claims and interpretation, particularly considering additional data showing enhanced erythrophagocytosis with decreased RBC lifespan. In response, we have thoroughly revised our manuscript and included new experimental data that further elucidate the effects of HH on RPMs and erythrophagocytosis.

      (1) Re-evaluation of RPMs population after HH exposure:

      • Flow cytometry analysis (new Figure 3G, Figure 5A and B): We revisited the analysis of RPMs (F4/80hiCD11blo) in the spleen after 7 and 14 days of HH exposure. Our revised flow cytometry data consistently showed a significant decrease in the RPMs population post-HH exposure, reinforcing our initial findings.

      Author response image 3.

      Author response image 4.

      • In situ expression of RPMs (Figure S1, A-D):

      We further confirmed the decreased population of RPMs through in situ co-staining with F4/80 and CD11b, and F4/80 and CD68, in spleen tissues. These results clearly demonstrated a significant reduction in F4/80hiCD11blo (Figure S1, A and B) and F4/80hiCD68hi (Figure S1, C and D) cells following HH exposure.

      Author response image 5.

      (2) Single-cell sequencing analysis of splenic RPMs:

      • We conducted a single-cell sequencing analysis of spleen samples post 7 days of HH exposure (Figure S2, A-C). This analysis revealed a notable shift in the distribution of RPMs, predominantly associated with Cluster 0 under NN conditions, to a reduced presence in this cluster after HH exposure.

      • Pseudo-time series analysis indicated a transition pattern change in spleen RPMs, with a shift from Cluster 2 and Cluster 1 towards Cluster 0 under NN conditions, and a reverse transition following HH exposure (Figure S2, B and D). This finding implies a decrease in resident RPMs in the spleen under HH conditions.

      (3) Consolidated findings and revised interpretation:

      • The comprehensive analysis of flow cytometry, in situ staining, and single-cell sequencing data consistently indicates a significant reduction in the number of RPMs following HH exposure.

      • These findings, taken together, strongly support the revised conclusion that HH exposure leads to a decrease in RPMs in the spleen, which in turn may affect erythrophagocytosis and RBC lifespan.

      Author response image 6.

      In conclusion, our revised manuscript now includes additional experimental data and analyses, strengthening our claims and providing a more nuanced interpretation of the impact of HH on spleen RPMs and related erythrophagocytosis processes. We believe these revisions and additional data address your concerns and enhance the scientific validity of our study.

      (3) RBC retention in the spleen should be measured anyway quantitatively, eg, with proper flow cytometry, to determine whether it is increased or decreased.

      Thank you for your query regarding the quantitative measurement of RBC retention in the spleen, particularly in relation to HH exposure. We have utilized a combination of techniques, including flow cytometry and histological staining, to investigate this aspect comprehensively. Below is a summary of our findings and methodology.

      (1) Flow cytometry analysis of labeled RBCs:

      • Our study employed both NHS-biotin (new Figure 4, A-D) and PKH67 labeling (new Figure 4, E-H) to track RBCs in mice exposed to HH. Flow cytometry results from these experiments (new Figure 4, A-H) showed a decrease in the proportion of labeled RBCs over time, both in the blood and spleen. Notably, there was a significantly greater reduction in the amplitude of fluorescently labeled RBCs after NN exposure compared to the reduced amplitude of fluorescently labeled RBCs observed in blood and spleen under HH exposure. The observed decrease in labeled RBCs was initially counterintuitive, as we expected an increase in RBC retention due to reduced erythrophagocytosis. However, this decrease can be attributed to the significantly increased production of RBCs following HH exposure, diluting the proportion of labeled cells.

      • Specifically, for blood, the biotin-labeled RBCs decreased by 12.06% under NN exposure and by 7.82% under HH exposure, while the PKH67-labeled RBCs decreased by 9.70% under NN exposure and by 4.09% under HH exposure. For spleen, the biotin-labeled RBCs decreased by 3.13% under NN exposure and by 0.46% under HH exposure, while the PKH67-labeled RBCs decreased by 1.16% under NN exposure and by 0.92% under HH exposure. These findings suggest that HH exposure leads to a decrease in the clearance rate of RBCs.

      Author response image 7.

      (2) Detection of erythrophagocytosis in spleen:

      To assess erythrophagocytosis directly, we labeled RBCs with PKH67 and analyzed their uptake by splenic macrophages (F4/80hi) after HH exposure. Our findings (new Figure 5, D-E) indicated a decrease in PKH67-positive macrophages in the spleen, suggesting reduced erythrophagocytosis.

      Author response image 8.

      (3) Flow cytometry analysis of RBC retention:

      Our flow cytometry analysis revealed a decrease in PKH67-positive RBCs in both blood and spleen (Figure S4). We postulated that this was due to increased RBC production after HH exposure. However, this method might not accurately reflect RBC retention, as it measures the proportion of PKH67-labeled RBCs relative to the total number of RBCs, which increased after HH exposure.

      Author response image 9.

      (4) Histological and immunostaining analysis:

      Histological examination using HE staining and band3 immunostaining in situ (new Figure 6, A-D, and G-H) revealed a significant increase in RBC numbers in the spleen after HH exposure. This was further confirmed by detecting retained RBCs in splenic single cells using Wright-Giemsa composite stain (new Figure 6, E and F) and retained PKH67-labelled RBCs in spleen (new Figure 6, I and J).

      Author response image 10.

      (5) Interpreting the data:

      The comprehensive analysis suggests a complex interplay between increased RBC production and decreased erythrophagocytosis in the spleen following HH exposure. While flow cytometry indicated a decrease in the proportion of labeled RBCs, histological and immunostaining analyses demonstrated an actual increase in RBCs retention in the spleen. These findings collectively suggest that while the overall RBCs production is upregulated following HH exposure, the spleen's capacity for erythrophagocytosis is concurrently diminished, leading to increased RBCs retention.

      (6) Conclusion:

      Taken together, our results indicate a significant increase in RBCs retention in the spleen post-HH exposure, likely due to reduced residual RPMs and erythrophagocytosis. This conclusion is supported by a combination of flow cytometry, histological staining, and immunostaining techniques, providing a comprehensive view of RBC dynamics under HH conditions. We think these findings offer a clear quantitative measure of RBC retention in the spleen, addressing the concerns raised in your question.

      (4) Numerous other methodological problems as listed below.

      We appreciate your question, which highlights the importance of using multiple analytical approaches to understand complex physiological processes. Please find below our point-by-point response to the methodological comments.

      Reviewer #1 (Recommendations For The Authors):

      (1) Decreased BM and spleen monocytes d/t increased liver monocyte migration is unclear. there is no evidence that this happens or why it would be a reasonable hypothesis, even in splenectomized mice.

      Thank you for highlighting the need for further clarification and justification of our hypothesized decrease in BM and spleen monocytes due to increased monocyte migration to the liver, particularly in the context of splenectomized mice. Indeed, our study has not explicitly verified an augmentation in mononuclear cell migration to the liver in splenectomized mice.

      Nonetheless, our investigations have revealed a notable increase in monocyte migration to the liver after HH exposure. Noteworthy is our discovery of a significant upregulation in colony stimulating factor-1 (CSF-1) expression in the liver, observed after both 7 and 14 days of HH exposure (data not included). This observation was substantiated through flow cytometry analysis (as depicted in Figure S4), which affirmed an enhanced migration of monocytes to the liver. Specifically, we noted a considerable increase in the population of transient macrophages, monocytes, and Kupffer cells in the liver following HH exposure.

      Author response image 11.

      Considering these findings, we hypothesize that hypoxic conditions may activate a compensatory mechanism that directs monocytes towards the liver, potentially linked to the liver’s integral role in the systemic immune response. In accordance with these insights, we intend to revise our manuscript to reflect the speculative nature of this hypothesis more accurately, and to delineate the strategies we propose for its further empirical investigation. This amendment ensures that our hypothesis is presented with full consideration of its speculative basis, supported by a coherent framework for future validation.

      (2) While F4/80+CD11b+ population is decreased, this is mainly driven by CD11b and F4/80+ alone population is significantly increased. This is counter to the hypothesis.

      Thank you for addressing the apparent discrepancy in our findings concerning the F4/80+CD11b+ population and the increase in the F4/80+ alone population, which seems to contradict our initial hypothesis. Your observation is indeed crucial for the integrity of our study, and we appreciate the opportunity to clarify this matter.

      (1) Clarification of flow cytometry results:

      • In response to the concerns raised, we revisited our flow cytometry experiments with a focus on more clearly distinguishing the cell populations. Our initial graph had some ambiguities in cell grouping, which might have led to misinterpretations.

      • The revised flow cytometry analysis, specifically aimed at identifying red pulp macrophages (RPMs) characterized as F4/80hiCD11blo in the spleen, demonstrated a significant decrease in the F4/80 population. This finding is now in alignment with our immunofluorescence results.

      Author response image 12.

      Author response image 13.

      (2) Revised data and interpretation:

      • The results presented in new Figure 3G and Figure 5 (A and B) consistently indicate a notable reduction in the RPMs population following HH exposure. This supports our revised understanding that HH exposure leads to a decrease in the specific macrophage subset (F4/80hiCD11blo) in the spleen.

      We’ve updated our manuscript to reflect these new findings and interpretations. The revised manuscript details the revised flow cytometry analysis and discusses the potential mechanisms behind the observed changes in macrophage populations.

      (3) HO-1 expression cannot be used as a surrogate to quantify number of macrophages as the expression per cell can decrease and give the same results. In addition, the localization of effect to the red pulp is not equivalent to an assertion that the conclusion applies to macrophages given the heterogeneity of this part of the organ and the spleen in general.

      Thank you for your insightful comments regarding the use of HO-1 expression as a surrogate marker for quantifying macrophage numbers, and for pointing out the complexity of attributing changes in HO-1 expression specifically to macrophages in the splenic red pulp. Your observations are indeed valid and warrant a detailed response.

      (1) Role of HO-1 in macrophage activity:

      • In our study, HO-1 expression was not utilized as a direct marker for quantifying macrophages. Instead, it was considered an indicator of macrophage activity, particularly in relation to erythrophagocytosis. HO-1, being upregulated in response to erythrophagocytosis, serves as an indirect marker of this process within splenic macrophages.

      • The rationale behind this approach was that increased HO-1 expression, induced by erythrophagocytosis in the spleen’s red pulp, could suggest an augmentation in the activity of splenic macrophages involved in this process.

      (2) Limitations of using HO-1 as an indicator:

      • We acknowledge your point that HO-1 expression per cell might decrease, potentially leading to misleading interpretations if used as a direct quantifier of macrophage numbers. The variability in HO-1 expression per cell indeed presents a limitation in using it as a sole indicator of macrophage quantity.

      • Furthermore, your observation about the heterogeneity of the spleen, particularly the red pulp, is crucial. The red pulp is a complex environment with various cell types, and asserting that changes in HO-1 expression are exclusive to macrophages could oversimplify this complexity.

      (3) Addressing the concerns:

      • To address these concerns, we propose to supplement our HO-1 expression data with additional specific markers for macrophages. This would help in correlating HO-1 expression more accurately with macrophage numbers and activity.

      • We also plan to conduct further studies to delineate the specific cell types in the red pulp contributing to HO-1 expression. This could involve techniques such as immunofluorescence or immunohistochemistry, which would allow us to localize HO-1 expression to specific cell populations within the splenic red pulp.

      We’ve revised our manuscript to clarify the role of HO-1 expression as an indirect marker of erythrophagocytosis and to acknowledge its limitations as a surrogate for quantifying macrophage numbers.

      (4) line 63-65 is inaccurate as red cell homeostasis reaches a new steady state in chronic hypoxia.

      Thank you for pointing out the inaccuracy in lines 63-65 of our manuscript regarding red cell homeostasis in chronic hypoxia. Your feedback is invaluable in ensuring the accuracy and scientific integrity of our work. We’ve revised lines 63-65 to accurately reflect the understanding.

      (5) Eryptosis is not defined in the manuscript.

      Thank you for highlighting the omission of a definition for eryptosis in our manuscript. We acknowledge the significance of precisely defining such key terminologies, particularly when they play a crucial role in the context of our research findings. Eryptosis, a term referenced in our study, is a specialized form of programmed cell death unique to erythrocytes. Similar with apoptosis in other cell types, eryptosis is characterized by distinct physiological changes including cell shrinkage, membrane blebbing, and the externalization of phosphatidylserine on the erythrocyte surface. These features are indicative of the RBCs lifecycle and its regulated destruction process.

      However, it is pertinent to note that our current study does not extensively delve into the mechanisms or implications of eryptosis. Our primary focus has been to elucidate the effects of HH exposure on the processes of splenic erythrophagocytosis and the resultant impact on the lifespan of RBCs. Given this focus, and to maintain the coherence and relevance of our manuscript, we have decided to exclude specific discussions of eryptosis from our revised manuscript. This decision aligns with our aim to provide a clear and concentrated exploration of the influence of HH exposure on RBCs dynamics and splenic function.

      We appreciate your input, which has significantly contributed to enhancing the clarity and accuracy of our manuscript. The revision ensures that our research is presented with a focused scope, aligning closely with our experimental investigations and findings.

      (6) Physiologically, there is no evidence that there is any "free iron" in cells, making line 89 point inaccurate.

      Thank you for highlighting the concern regarding the reference to "free iron" in cells in line 89 of our manuscript. The term "free iron" in our manuscript was intended to refer to divalent iron (Fe2+), rather than unbound iron ions freely circulating within cells. We acknowledge that the term "free iron" might lead to misconceptions, as it implies the presence of unchelated iron, which is not physiologically common due to the potential for oxidative damage. To rectify this and provide clarity, we’ve revised line 89 of our manuscript to reflect our meaning more accurately. Instead of "free iron," we use "divalent iron (Fe2+)" to avoid any misunderstanding regarding the state of iron in cells. We also ensure that any implications drawn from the presence of Fe2+ in cells are consistent with current scientific literature and understanding.

      (7) Fig 1f no stats

      We appreciate your critical review and suggestions, which help in improving the accuracy and clarity of our research. We’ve revised statistic diagram of new Figure 1F.

      (8) Splenectomy experiments demonstrate that erythrophagocytosis is almost completely replaced by functional macrophages in other tissues (likely Kupffer cells in the liver). there is only a minor defect and no data on whether it is in fact the liver or other organs that provide this replacement function and makes the assertions in lines 345-349 significantly overstated.

      Thank you for your critical assessment of our interpretation of the splenectomy experiments, especially concerning the role of erythrophagocytosis by macrophages in other tissues, such as Kupffer cells in the liver. We appreciate your observation that our assertions may be overstated and acknowledge the need for more specific data to identify which organs compensate for the loss of splenic erythrophagocytosis.

      (1) Splenectomy experiment findings:

      • Our findings in Figure 2D do indicate that in the splenectomized group under NN conditions, erythrophagocytosis is substantially compensated for by functional macrophages in other tissues. This is an important observation that highlights the body's ability to adapt to the loss of splenic function.

      • However, under HH conditions, our data suggest that the spleen plays an important role in managing erythrocyte turnover, as indicated by the significant impact of splenectomy on erythrophagocytosis and subsequent erythrocyte dynamics.

      (2) Addressing the lack of specific organ identification:

      • We acknowledge that our study does not definitively identify which organs, such as the liver or others, take over the erythrophagocytosis function post-splenectomy. This is an important aspect that needs further investigation.

      • To address this, we also plan to perform additional experiments that could more accurately point out the specific tissues compensating for the loss of splenic erythrophagocytosis. This could involve tracking labeled erythrocytes or using specific markers to identify macrophages actively engaged in erythrophagocytosis in various organs.

      (3) Revising manuscript statements:

      Considering your feedback, we’ve revised the statements in lines 345-349 (lines 378-383 in revised manuscript) to enhance the scientific rigor and clarity of our research presentation.

      (9) M1 vs M2 macrophage experiments are irrelevant to the main thrust of the manuscript, there are no references to support the use of only CD16 and CD86 for these purposes, and no stats are provided. It is also unclear why bone marrow monocyte data is presented and how it is relevant to the rest of the manuscript.

      Thank you for your critical evaluation of the relevance and presentation of the M1 vs. M2 macrophage experiments in our manuscript. We appreciate your insights, especially regarding the use of specific markers and the lack of statistical analysis, as well as the relevance of bone marrow monocyte data to our study's main focus.

      (1) Removal of M1 and M2 macrophage data:

      Based on your feedback and our reassessment, we agree that the results pertaining to M1 and M2 macrophages did not align well with the main objectives of our manuscript. Consequently, we have decided to remove the related content on M1 and M2 macrophages from the revised manuscript. This decision was made to ensure that our manuscript remains focused and coherent, highlighting our primary findings without the distraction of unrelated or insufficiently supported data.

      The use of only CD16 and CD86 markers for M1 and M2 macrophage characterization, without appropriate statistical analysis, was indeed a methodological limitation. We recognize that a more comprehensive set of markers and rigorous statistical analysis would be necessary for a meaningful interpretation of M1/M2 macrophage polarization. Furthermore, the relevance of these experiments to the central theme of our manuscript was not adequately established. Our study primarily focuses on erythrophagocytosis and red pulp macrophage dynamics under hypobaric hypoxia, and the M1/M2 polarization aspect did not contribute significantly to this narrative.

      (2) Clarification on bone marrow monocyte data:

      Regarding the inclusion of bone marrow monocyte data, we acknowledge that its relevance to the main thrust of the manuscript was not clearly articulated. In the revised manuscript, we provide a clearer rationale for its inclusion and how it relates to our primary objectives.

      (3) Commitment to clarity and relevance:

      We are committed to ensuring that every component of our manuscript contributes meaningfully to our overall objectives and research questions. Your feedback has been instrumental in guiding us to streamline our focus and present our findings more effectively.

      We appreciate your valuable feedback, which has led to a more focused and relevant presentation of our research. These changes enhance the clarity and impact of our manuscript, ensuring that it accurately reflects our key research findings.

      (10) Biotinolated RBC clearance is enhanced, demonstrating that RBC erythrophagocytosis is in fact ENHANCED, not diminished, calling into question the founding hypothesis that the manuscript proposes.

      Thank you for your critical evaluation of our data on biotinylated RBC clearance, which suggests enhanced erythrophagocytosis under HH conditions. This observation indeed challenges our founding hypothesis that erythrophagocytosis is diminished in this setting. Below is a summary of our findings and methodology.

      (1) Interpretation of RBC labeling results:

      Both the previous results of NHS-biotin labeled RBCs (new Figure 4, A-D) and the current results of PKH67-labeled RBCs (new Figure 4, E-H) demonstrated a decrease in the number of labeled RBCs with an increase in injection time. The production of RBCs, including bone marrow and spleen production, was significantly increased following HH exposure, resulting in a consistent decrease in the proportion of labeled RBCs via flow cytometry detection both in the blood and spleen of mice compared to the NN group. However, compared to the reduced amplitude of fluorescently labeled RBCs observed in blood and spleen under NN exposure, there was a significantly weaker reduction in the amplitude of fluorescently labeled RBCs after HH exposure. Specifically, for blood, the biotin-labeled RBCs decreased by 12.06% under NN exposure and by 7.82% under HH exposure, while the PKH67-labeled RBCs decreased by 9.70% under NN exposure and by 4.09% under HH exposure. For spleen, the biotin-labeled RBCs decreased by 3.13% under NN exposure and by 0.46% under HH exposure, while the PKH67-labeled RBCs decreased by 1.16% under NN exposure and by 0.92% under HH exposure.

      Author response image 14.

      (2) Increased RBCs production under HH conditions:

      It's important to note that RBCs production, including from bone marrow and spleen, was significantly increased following HH exposure. This increase in RBCs production could contribute to the decreased proportion of labeled RBCs observed in flow cytometry analyses, as there are more unlabeled RBCs diluting the proportion of labeled cells in the blood and spleen.

      (3) Analysis of erythrophagocytosis in RPMs:

      Our analysis of PKH67-labeled RBCs content within RPMs following HH exposure showed a significant reduction in the number of PKH67-positive RPMs in the spleen (new Figure 5). This finding suggests a decrease in erythrophagocytosis by RPMs under HH conditions.

      Author response image 15.

      (4) Reconciling the findings:

      The apparent contradiction between enhanced RBC clearance (suggested by the reduced proportion of labeled RBCs) and reduced erythrophagocytosis in RPMs (indicated by fewer PKH67-positive RPMs) may be explained by the increased overall production of RBCs under HH. This increased production could mask the actual erythrophagocytosis activity in terms of the proportion of labeled cells. Therefore, while the proportion of labeled RBCs decreases more significantly under HH conditions, this does not necessarily indicate an enhanced erythrophagocytosis rate, but rather an increased dilution effect due to higher RBCs turnover.

      (5) Revised interpretation and manuscript changes:

      Given these factors, we update our manuscript to reflect this detailed interpretation and clarify the implications of the increased RBCs production under HH conditions on our observations of labeled RBCs clearance and erythrophagocytosis. We appreciate your insightful feedback, which has prompted a careful re-examination of our data and interpretations. We hope that these revisions provide a more accurate and comprehensive understanding of the effects of HH on erythrophagocytosis and RBCs dynamics.

      (11) Legend in Fig 4c-4d looks incorrect and Fig 4e-4f is very non-specific since Wright stain does not provide evidence of what type of cells these are and making for a significant overstatement in the contribution of this data to "confirming" increased erythrophagocytosis in the spleen under HH exposure (line 395-396).

      Thank you for your insightful observations regarding the data presentation and figure legends in our manuscript, particularly in relation to Figure 4 (renamed as Figure 6 in the revised manuscript) and the use of Wright-Giemsa composite staining. We appreciate your constructive feedback and acknowledge the importance of presenting our data with utmost clarity and precision.

      (1) Amendments to Figure legends:

      We recognize the necessity of rectifying inaccuracies in the legends of the previously labeled Figure 4C and D. Corrections have been meticulously implemented to ensure the legends accurately contain the data presented. Additionally, we acknowledge the error concerning the description of Wright staining. The method employed in our study is Wright-Giemsa composite staining, which, unlike Wright staining that solely stains cytoplasm (RBC), is capable of staining both nuclei and cytoplasm.

      (2) Addressing the specificity of Wright-Giemsa Composite staining:

      Our approach involved quantifying RBC retention using Wright-Giemsa composite staining on single splenic cells post-perfusion at 7 and 14 days post HH exposure. We understand and appreciate your concerns regarding the nonspecific nature of Wright staining. Although Wright stain is a general hematologic stain and not explicitly specific for certain cell types, its application in our study aimed to provide preliminary insights. The spleen cells, devoid of nuclei and thus likely to be RBCs, were stained and observed post-perfusion, indicating RBC retention within the spleen.

      (3) Incorporating additional methods for RBC identification:

      To enhance the specificity of our findings, we integrated supplementary methods for RBC identification in the revised manuscript. We employed band3 immunostaining (in the new Figure 6, C-D and G-H) and PKH67 labeling (Figure 6, I-J) for a more targeted identification of RBCs. Band3, serving as a reliable marker for RBCs, augments the specificity of our immunostaining approach. Likewise, PKH67 labeling affords a direct and definitive means to assess RBC retention in the spleen following HH exposure.

      Author response image 16. same as 10

      (4) Revised interpretation and manuscript modifications:

      Based on these enhanced methodologies, we have refined our interpretation of the data and accordingly updated the manuscript. The revised narrative underscores that our conclusions regarding reduced erythrophagocytosis and RBC retention under HH conditions are corroborated by not only Wright-Giemsa composite staining but also by band3 immunostaining and PKH67 labeling, each contributing distinctively to our comprehensive understanding.

      We are committed to ensuring that our manuscript precisely reflects the contribution of each method to our findings and conclusions. Your thorough review has been invaluable in identifying and rectifying areas for improvement in our research report and interpretation.

      (12) Ferroptosis data in Fig 5 is not specific to macrophages and Fer-1 data confirms the expected effect of Fer-1 but there is no data that supports that Fer-1 reverses the destruction of these cells or restores their function in hypoxia. Finally, these experiments were performed in peritoneal macrophages which are functionally distinct from splenic RPM.

      Thank you for your critique of our presentation and interpretation of the ferroptosis data in Figure 5 (renamed as Figure 9 in the revised manuscript), as well as your observations regarding the specificity of the experiments to macrophages and the effects of Fer-1. We value your input and acknowledge the need to clarify these aspects in our manuscript.

      (1) Clarification on cell type used in experiments:

      • We appreciate your attention to the details of our experimental setup. The experiments presented in Figure 9 were indeed conducted on splenic macrophages, not peritoneal macrophages, as incorrectly mentioned in the original figure legend. This was an error in our manuscript, and we have revised the figure legend accordingly to accurately reflect the cell type used.

      (2) Specificity of ferroptosis data:

      • We recognize that the data presented in Figure 9 need to be more explicitly linked to the specific macrophage population being studied. In the revised manuscript, we ensure that the discussion around ferroptosis data is clearly situated within the framework of splenic macrophages.

      • We also provide additional methodological details in the 'Methods' section to reinforce the specificity of our experiments to splenic macrophages.

      (3) Effects of Fer-1 on macrophage function and survival:

      • Regarding the effect of Fer-1, we agree that while our data confirms the expected effect of Fer-1 in inhibiting ferroptosis, we have not provided direct evidence that Fer-1 reverses the destruction of macrophages or restores their function in hypoxia.

      • To address this, we propose additional experiments to specifically investigate the impact of Fer-1 on the survival and functional restoration of splenic macrophages under hypoxic conditions. This would involve assessing not only the inhibition of ferroptosis but also the recovery of macrophage functionality post-treatment.

      (4) Revised interpretation and manuscript changes:

      • We’ve revised the relevant sections of our manuscript to reflect these clarifications and proposed additional studies. This includes modifying the discussion of the ferroptosis data to more accurately represent the cell types involved and the limitations of our current findings regarding the effects of Fer-1.

      • The revised manuscript presents a more detailed interpretation of the ferroptosis data, clearly describing what our current experiments demonstrate and what remains to be investigated.

      We are grateful for your insightful feedback, which has highlighted important areas for improvement in our research presentation. We think that these revisions will enhance the clarity and scientific accuracy of our manuscript, ensuring that our findings and conclusions are well-supported and precisely communicated.

      Reviewer #2 (Recommendations For The Authors):

      The following questions and remarks should be considered by the authors:

      (1) The methods should clearly state whether the HH was discontinued during the 7 or 14 day exposure for cleaning, fresh water etc. Moreover, how was CO2 controlled? The procedure for splenectomy needs to be described in the methods.

      Thank you for your inquiry regarding the specifics of our experimental methods, particularly the management of HH exposure and the procedure for splenectomy. We appreciate your attention to detail and the importance of these aspects for the reproducibility and clarity of our research.

      (1) HH exposure conditions:

      In our experiments, mice were continuously exposed to HH for the entire duration of 7 or 14 days, without interruption for activities such as cleaning or providing fresh water. This uninterrupted exposure was crucial for maintaining consistent hypobaric conditions throughout the experiment. The hypobaric chamber was configured to ensure a ventilation rate of 25 air exchanges per minute. This high ventilation rate was effective in regulating the concentration of CO2 inside the chamber, thereby maintaining a stable environment for the mice.

      (2) The splenectomy was performed as follows:

      After anesthesia, the mice were placed in a supine position, and their limbs were fixed. The abdominal operation area was skinned, disinfected, and covered with a sterile towel. A median incision was made in the upper abdomen, followed by laparotomy to locate the spleen. The spleen was then carefully pulled out through the incision. The arterial and venous directions in the splenic pedicle were examined, and two vascular forceps were used to clamp all the tissue in the main cadre of blood vessels below the splenic portal. The splenic pedicle was cut between the forceps to remove the spleen. The end of the proximal hepatic artery was clamped with a vascular clamp, and double or through ligation was performed to secure the site. The abdominal cavity was then cleaned to ensure there was no bleeding at the ligation site, and the incision was closed. Post-operatively, the animals were housed individually. Generally, they were able to feed themselves after recovering from anesthesia and did not require special care.

      We hope this detailed description addresses your queries and provides a clear understanding of the experimental conditions and procedures used in our study. These methodological details are crucial for ensuring the accuracy and reproducibility of our research findings.

      (2) The lack of changes in MCH needs explanation? During stress erythropoiesis some limit in iron availability should cause MCH decrease particularly if the authors claim that macrophages for rapid iron recycling are decreased. Fig 1A is dispensable. Fig 1G NN control 14 days does not make sense since it is higher than 7 days of HH.

      Thank you for your inquiry regarding the lack of changes in Mean Corpuscular Hemoglobin (MCH) in our study, particularly in the context of stress erythropoiesis and decreased macrophage-mediated iron recycling. We appreciate the opportunity to provide further clarification on this aspect.

      (1) Explanation for stable MCH levels:

      • Our research identified a decrease in erythrophagocytosis and iron recycling in the spleen following HH exposure. Despite this, the MCH levels remained stable. This observation can be explained by considering the compensatory roles of other organs, particularly the liver and duodenum, in maintaining iron homeostasis.

      • Specifically, our investigations revealed an enhanced capacity of the liver to engulf RBCs and process iron under HH conditions. This increased hepatic erythrophagocytosis likely compensates for the reduced splenic activity, thereby stabilizing MCH levels.

      (2) Role of hepcidin and DMT1 expression:

      Additionally, hypoxia is known to influence iron metabolism through the downregulation of Hepcidin and upregulation of Divalent Metal Transporter 1 (DMT1) expression. These alterations lead to enhanced intestinal iron absorption and increased blood iron levels, further contributing to the maintenance of MCH levels despite reduced splenic iron recycling.

      (3) Revised Figure 1 and data presentation

      To address the confusion regarding the data presented in Figure 1G, we have made revisions in our manuscript. The original Figure 1G, which did not align with the expected trends, has been removed. In its place, we have included a statistical chart of Figure 1F in the new version of Figure 1G. This revision will provide a clearer and more accurate representation of our findings.

      (4) Manuscript updates and future research:

      • We update our manuscript to incorporate these explanations, ensuring that the rationale behind the stable MCH levels is clearly articulated. This includes a discussion on the role of the liver and duodenum in iron metabolism under hypoxic conditions.

      • Future research could explore in greater detail the mechanisms by which different organs contribute to iron homeostasis under stress conditions like HH, particularly focusing on the dynamic interplay between hepatic and splenic functions.

      We thank you for your insightful question, which has prompted a thorough re-examination of our findings and interpretations. We believe that these clarifications will enhance the overall understanding of our study and its implications in the context of iron metabolism and erythropoiesis under hypoxic conditions.

      (3) Fig 2 the difference between sham and splenectomy is really marginal and not convincing. Is there also a difference at 7 days? Why does the spleen size decrease between 7 and 14 days?

      Thank you for your observations regarding the marginal differences observed between sham and splenectomy groups in Figure 2, as well as your inquiries about spleen size dynamics over time. We appreciate this opportunity to clarify these aspects of our study.

      (1) Splenectomy vs. Sham group differences:

      • In our experiments, the difference between the sham and splenectomy groups under HH conditions, though subtle, was consistent with our hypothesis regarding the spleen's role in erythrophagocytosis and stress erythropoiesis. Under NN conditions, no significant difference was observed between these groups, which aligns with the expectation that the spleen's contribution is more pronounced under hypoxic stress.

      (2) Spleen size dynamics and peak stress erythropoiesis:

      • The observed splenic enlargement prior to 7 days can be attributed to a combination of factors, including the retention of RBCs and extramedullary hematopoiesis, which is known to be a response to hypoxic stress.

      • Prior research has elucidated that splenic stress-induced erythropoiesis, triggered by hypoxic conditions, typically attains its zenith within a timeframe of 3 to 7 days. This observation aligns with our Toluidine Blue (TO) staining results, which indicated that the apex of this response occurs at the 7-day mark (as depicted in Figure 1, F-G). Here, the culmination of this peak is characteristically succeeded by a diminution in extramedullary hematopoiesis, a phenomenon that could elucidate the observed contraction in spleen size, particularly in the interval between 7 and 14 days.

      • This pattern of splenic response under prolonged hypoxic stress is corroborated by studies such as those conducted by Wang et al. (2021), Harada et al. (2015), and Cenariu et al. (2021). These references collectively underscore that the spleen undergoes significant dynamism in reaction to sustained hypoxia. This dynamism is initially manifested as an enlargement of the spleen, attributable to escalated erythropoiesis and erythrophagocytosis. Subsequently, as these processes approach normalization, a regression in spleen size ensues.

      We’ve revised our manuscript to include a more detailed explanation of these splenic dynamics under HH conditions, referencing the relevant literature to provide a comprehensive context for our findings. We will also consider performing additional analysis or providing further data on spleen size changes at 7 days to support our observations and ensure a thorough understanding of the splenic response to hypoxic stress over time.

      (4) Fig 3 B the clusters should be explained in detail. If the decrease in macrophages in Fig 3K/L is responsible for the effect, why does splenectomy not have a much stronger effect? How do the authors know which cells died in the calcein stained population in Fig 3D?

      Thank you for your insightful questions regarding the details of our data presentation in Figure 3, particularly about the identification of cell clusters and the implications of macrophage reduction. We appreciate the opportunity to address these aspects and clarify our findings.

      (1) Explanation of cell clusters in Figure 3B:

      • In the revised manuscript, we have included detailed notes for each cell population represented in Figure 3B (Figure 3D in revised manuscript). These notes provide a clearer understanding of the cell types present in each cluster, enhancing the interpretability of our single-cell sequencing data.

      • This detailed annotation will help readers to better understand the composition of the splenic cell populations under study and how they are affected by hypoxic conditions.

      (2) Impact of splenectomy vs. macrophage reduction:

      • The interplay between the reduction in macrophage populations, as evidenced by our single-cell sequencing data, and the ramifications of splenectomy presents a multifaceted scenario. Notably, the observed decline in macrophage numbers following HH exposure does not straightforwardly equate to a comparable alteration in overall splenic function, as might be anticipated with splenectomy.

      • In the context of splenectomy under HH conditions, a significant escalation in the RBCs count was observed, surpassing that in non-splenectomized mice exposed to HH. This finding underscores the spleen's critical role in modulating RBCs dynamics under HH. It also indirectly suggests that the diminished phagocytic capacity of the spleen following HH exposure contributes to an augmented RBCs count, albeit to a lesser extent than in the splenectomy group. This difference is attributed to the fact that, while the number of RPMs in the spleen post-HH is reduced, they are still present, unlike in the case of splenectomy, where they are entirely absent.

      • Splenectomy entails the complete removal of the spleen, thus eliminating a broad spectrum of functions beyond erythrophagocytosis and iron recycling mediated by macrophages. The nuanced changes observed in our study may be reflective of the spleen's diverse functionalities and the organism's adaptive compensatory mechanisms in response to the loss of this organ.

      (3) Calcein stained population in Figure 3D:

      • Regarding the identification of cell death in the calcein-stained population in Figure 3D (Figure 3A in revised manuscript), we acknowledge that the specific cell types undergoing death could not be distinctly determined from this analysis alone.

      • The calcein staining method allows for the visualization of live (calcein-positive) and dead (calcein-negative) cells, but it does not provide specific information about the cell types. The decrease in macrophage population was inferred from the single-cell sequencing data, which offered a more precise identification of cell types.

      (4) Revised manuscript and data presentation:

      • Considering your feedback, we have revised our manuscript to provide a more comprehensive explanation of the data presented in Figure 3, including the nature of the cell clusters and the interpretation of the calcein staining results.

      • We have also updated the manuscript to reflect the removal of Figure 3K/L results and to provide a more focused discussion on the relevant findings.

      We are grateful for your detailed review, which has helped us to refine our data presentation and interpretation. These clarifications and revisions will enhance the clarity and scientific rigor of our manuscript, ensuring that our conclusions are well-supported and accurately conveyed.

      (5) Is the reduced phagocytic capacity in Fig 4B significant? Erythrophagocytosis is compromised due to the considerable spontaneous loss of labelled erythrocytes; could other assays help? (potentially by a modified Chromium release assay?). Is it necessary to stimulated phagocytosis to see a significant effect?

      Thank you for your inquiry regarding the significance of the reduced phagocytic capacity observed in Figure 4B, and the potential for employing alternative assays to elucidate erythrophagocytosis dynamics under HH conditions.

      (1) Significance of reduced phagocytic capacity:

      The observed reduction in the amplitude of fluorescently labeled RBCs in both the blood and spleen under HH conditions suggests a decrease in erythrophagocytosis. This is indicative of a diminished phagocytic capacity, particularly when contrasted with NN conditions.

      (2) Investigation of erythrophagocytosis dynamics:

      To delve deeper into erythrophagocytosis under HH, we employed Tuftsin to enhance this process. Following the injection of PKH67-labeled RBCs and subsequent HH exposure, we noted a significant decrease in PKH67 fluorescence in the spleen, particularly marked after the administration of Tuftsin. This finding implies that stimulated erythrophagocytosis can influence RBCs lifespan.

      (3) Erythrophagocytosis under normal and hypoxic conditions:

      Under normal conditions, the reduction in phagocytic activity is less apparent without stimulation. However, under HH conditions, our findings demonstrate a clear weakening of the phagocytic effect. While we established that promoting phagocytosis under NN conditions affects RBC lifespan, the impact of enhanced phagocytosis under HH on RBCs numbers was not explicitly investigated.

      (4) Potential for alternative assays:

      Considering the considerable spontaneous loss of labeled erythrocytes, alternative assays such as a modified Chromium release assay could provide further insights. Such assays might offer a more nuanced understanding of erythrophagocytosis efficiency and the stability of labeled RBCs under different conditions.

      (5) Future research directions:

      The implications of these results suggest that future studies should focus on comparing the effects of stimulated phagocytosis under both NN and HH conditions. This would offer a clearer picture of the impact of hypoxia on the phagocytic capacity of macrophages and the subsequent effects on RBC turnover.

      In summary, our findings indicate a diminished erythrophagocytic capacity, with enhanced phagocytosis affecting RBCs lifespan. Further investigation, potentially using alternative assays, would be beneficial to comprehensively understand the dynamics of erythrophagocytosis in different physiological states.

      (6) Can the observed ferroptosis be influenced by bi- and not trivalent iron chelators?

      Thank you for your question regarding the potential influence of bi- and trivalent iron chelators on ferroptosis under hypoxic conditions. We appreciate the opportunity to discuss the implications of our findings in this context.

      (1) Analysis of iron chelators on ferroptosis:

      In our study, we did not specifically analyze the effects of bi- and trivalent iron chelators on ferroptosis under hypoxia. However, our observations with Deferoxamine (DFO), a well-known iron chelator, provide some insights into how iron chelation may influence ferroptosis in splenic macrophages under hypoxic conditions.

      (2) Effect of DFO on oxidative stress markers:

      Our findings showed that under 1% O2, there was an increase in Malondialdehyde (MDA) content, a marker of lipid peroxidation, and a decrease in Glutathione (GSH) content, indicative of oxidative stress. These changes are consistent with the induction of ferroptosis, which is characterized by increased lipid peroxidation and depletion of antioxidants. Treatment with Ferrostatin-1 (Fer-1) and DFO effectively reversed these alterations. This suggests that DFO, like Fer-1, can mitigate ferroptosis in splenic macrophages under hypoxia, primarily by impacting MDA and GSH levels.

      Author response image 17.

      (3) Potential role of iron chelators in ferroptosis:

      The effectiveness of DFO in reducing markers of ferroptosis indicates that iron availability plays a crucial role in the ferroptotic process under hypoxic conditions. It is plausible that both bi- and trivalent iron chelators could influence ferroptosis, given their ability to modulate iron availability within cells. Since ferroptosis is an iron-dependent form of cell death, chelating iron, irrespective of its valence state, could potentially disrupt the process by limiting the iron necessary for the generation of reactive oxygen species and lipid peroxidation.

      (4) Additional research and manuscript updates:

      Our study highlights the need for further research to explore the differential effects of various iron chelators on ferroptosis, particularly under hypoxic conditions. Such studies could provide a more comprehensive understanding of the role of iron in ferroptosis and the potential therapeutic applications of iron chelators. We update our manuscript to include these findings and discuss the potential implications of iron chelation in the context of ferroptosis under hypoxic conditions. This will provide a broader perspective on our research and its significance in understanding the mechanisms of ferroptosis.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their manuscript entitled 'The domesticated transposon protein L1TD1 associates with its ancestor L1 ORF1p to promote LINE-1 retrotransposition', Kavaklıoğlu and colleagues delve into the role of L1TD1, an RNA binding protein (RBP) derived from a LINE1 transposon. L1TD1 proves crucial for maintaining pluripotency in embryonic stem cells and is linked to cancer progression in germ cell tumors, yet its precise molecular function remains elusive. Here, the authors uncover an intriguing interaction between L1TD1 and its ancestral LINE-1 retrotransposon.

      The authors delete the DNA methyltransferase DNMT1 in a haploid human cell line (HAP1), inducing widespread DNA hypo-methylation. This hypomethylation prompts abnormal expression of L1TD1. To scrutinize L1TD1's function in a DNMT1 knock-out setting, the authors create DNMT1/L1TD1 double knock-out cell lines (DKO). Curiously, while the loss of global DNA methylation doesn't impede proliferation, additional depletion of L1TD1 leads to DNA damage and apoptosis.

      To unravel the molecular mechanism underpinning L1TD1's protective role in the absence of DNA methylation, the authors dissect L1TD1 complexes in terms of protein and RNA composition. They unveil an association with the LINE-1 transposon protein L1-ORF1 and LINE-1 transcripts, among others.

      Surprisingly, the authors note fewer LINE-1 retro-transposition events in DKO cells compared to DNMT1 KO alone.

      Strengths:

      The authors present compelling data suggesting the interplay of a transposon-derived human RNA binding protein with its ancestral transposable element. Their findings spur interesting questions for cancer types, where LINE1 and L1TD1 are aberrantly expressed.

      Weaknesses:

      Suggestions for refinement:

      The initial experiment, inducing global hypo-methylation by eliminating DNMT1 in HAP1 cells, is intriguing and warrants more detailed description. How many genes experience misregulation or aberrant expression? What phenotypic changes occur in these cells? Why did the authors focus on L1TD1? Providing some of this data would be helpful to understand the rationale behind the thorough analysis of L1TD1.

      The finding that L1TD1/DNMT1 DKO cells exhibit increased apoptosis and DNA damage but decreased L1 retro-transposition is unexpected. Considering the DNA damage associated with retro-transposition and the DNA damage and apoptosis observed in L1TD1/DNMT1 DKO cells, one would anticipate the opposite outcome. Could it be that the observation of fewer transposition-positive colonies stems from the demise of the most transpositionpositive colonies? Further exploration of this phenomenon would be intriguing.

      Reviewer #2 (Public review):

      In this study, Kavaklıoğlu et al. investigated and presented evidence for a role for domesticated transposon protein L1TD1 in enabling its ancestral relative, L1 ORF1p, to retrotranspose in HAP1 human tumor cells. The authors provided insight into the molecular function of L1TD1 and shed some clarifying light on previous studies that showed somewhat contradictory outcomes surrounding L1TD1 expression. Here, L1TD1 expression was correlated with L1 activation in a hypomethylation dependent manner, due to DNMT1 deletion in HAP1 cell line. The authors then identified L1TD1 associated RNAs using RIPSeq, which display a disconnect between transcript and protein abundance (via Tandem Mass Tag multiplex mass spectrometry analysis). The one exception was for L1TD1 itself, is consistent with a model in which the RNA transcripts associated with L1TD1 are not directly regulated at the translation level. Instead, the authors found L1TD1 protein associated with L1-RNPs and this interaction is associated with increased L1 retrotransposition, at least in the contexts of HAP1 cells. Overall, these results support a model in which L1TD1 is restrained by DNA methylation, but in the absence of this repressive mark, L1TD1 is expression, and collaborates with L1 ORF1p (either directly or through interaction with L1 RNA, which remains unclear based on current results), leads to enhances L1 retrotransposition. These results establish feasibility of this relationship existing in vivo in either development or disease, or both.

      Comments on revised version:

      In general, the authors did an acceptable job addressing the major concerns throughout the manuscript. This revision is much clearer and has improved in terms of logical progression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have addressed all my questions in the revised version of the manuscript.

      Reviewer #2 (Recommendations for the authors):

      Revised comments:

      A few points we'd like to see addressed are our comments about the model (Figure S7C), as this is important for the readership to understand this complex finding. Please try to apply some quantification, if possible (question 8). Please do your best to tone down the direct relationship of these findings to embryology (question 11). Based on both reviewer comments, we believe addressing reviewer #1s "Suggestions for refinement" (2 points), would help us change our view of solid to convincing.

      Responses to changes:

      Major

      (1) The study only used one knockout (KO) cell line generated by CRISPR/Cas9.

      Considering the possibility of an off-target effect, I suggest the authors attempt one or both of these suggestions.

      A)  Generate or acquire a similar DMNT1 deletion that uses distinct sgRNAs, so that the likelihood of off-targets is negligible. A few simple experiments such as qRT-PCR would be sufficient to suggest the same phenotype.

      B)  Confirm the DNMT1 depletion also by siRNA/ASO KD to phenocopy the KO effect.

      (2) In addition to the strategies to demonstrate reproducibility, a rescue experiment restoring DNMT1 to the KO or KD cells would be more convincing. (Partial rescue would suffice in this case, as exact endogenous expression levels may be hard to replicate).

      We have undertook several approaches to study the effect of DNMT1 loss or inactivation: As described above, we have generated a conditional KO mouse with ablation of DNMT1 in the epidermis. DNMT1-deficient keratinocytes isolated from these mice show a significant increase in L1TD1 expression. In addition, treatment of primary human keratinocytes and two squamous cell carcinoma cell lines with the DNMT inhibitor aza-deoxycytidine led to upregulation of L1TD1 expression. Thus, the derepression of L1TD1 upon loss of DNMT1 expression or activity is not a clonal effect.

      Also, the spectrum of RNAs identified in RIP experiments as L1TD1-associated transcripts in HAP1 DNMT1 KO cells showed a strong overlap with the RNAs isolated by a related yet different method in human embryonic stem cells. When it comes to the effect of L1TD1 on L1-1 retrotranspostion, a recent study has reported a similar effect of L1TD1 upon overexpression in HeLa cells [4].

      All of these points together help to convince us that our findings with HAP1 DNMT KO are in agreement with results obtained in various other cell systems and are therefore not due to off-target effects. With that in mind, we would pursue the suggestion of Reviewer 1 to analyze the effects of DNA hypomethylation upon DNMT1 ablation.

      Thank you for addressing this concern. The reference to Beck 2021 and the additional cells lines (R2: keratinocytes and R3: squamous cell carcinoma) provides sufficient evidence that this result is unlikely to be a result of clonal expansion or off targets.

      Question: Was the human ES Cell RIP Experiment shown here? What is the overlap?

      We refer to the recently published study by Jin et al. (PMID: 38165001). As stated in the Discussion, the majority of L1TD1-associated transcripts in HAP1 cells (69%) identified in our study were also reported as L1TD1 targets in hESCs suggesting a conserved binding affinity of this domesticated transposon protein across different cell types.  

      (3) As stated in the introduction, L1TD1 and ORF1p share "sequence resemblance" (Martin 2006). Is the L1TD1 antibody specific or do we see L1 ORF1p if Fig 1C were uncropped?

      (6) Is it possible the L1TD1 antibody binds L1 ORF1p? This could make Figure 2D somewhat difficult to interpret. Some validation of the specificity of the L1TD1 antibody would remove this concern (see minor concern below).

      This is a relevant question. We are convinced that the L1TD1 antibody does not crossreact with L1 ORF1p for the following reasons: Firstly, the antibody does not recognize L1 ORF1p (40 kDa) in the uncropped Western blot for Figure 1C (Figure R4A). Secondly, the L1TD1 antibody gives only background signals in DKO cells in the indirect immunofluorescence experiment shown in Figure 1E of the manuscript.

      Thirdly, the immunogene sequence of L1TD1 that determines the specificity of the antibody was checked in the antibody data sheet from Sigma Aldrich. The corresponding epitope is not present in the L1 ORF1p sequence.

      Finally, we have shown that the ORF1p antibody does not cross-react with L1TD1 (Figure R4B).

      Response: Thank you for sharing these images. These full images relieve concerns about specificity. The increase of ORF1P in R4B and Main figure 3C is interesting and pointed out in the manuscript. Not for the purposes of this review, but the observation of reduced transposition despite increased ORF1P could be an interesting follow up to this study (combined with the similar UPF1 result could indicate a complex of some kind).

      (4) In abstract (P2), the authors mentioned that L1TD1 works as an RNA chaperone, but in the result section (P13), they showed that L1TD1 associates with L1 ORF1p in an RNA independent manner. Those conclusions appear contradictory. Clarification or revision is required.

      Our findings that both proteins bind L1 RNA, and that L1TD1 interacts with ORF1p are compatible with a scenario where L1TD1/ORF1p heteromultimers bind to L1 RNA. The additional presence of L1TD1 might thereby enhance the RNA chaperone function of ORF1p. This model is visualized now in Suppl. Figure S7C.

      Response: Thank you for the model. To further clarify, do you mean that L1TD1 can bind L1 RNA, but this is not needed for the effect, however this "bonus" binding (that is enabled by heteromultimerization) appears to enhance the retrotransposition frequency? Do you think L1TD1 is binding L1 RNA in this context or simply "stabilizing" ORF1P (Trimer) RNP?

      Based on our data, L1TD1 associates with L1 RNA and interacts with L1 ORF1p. Both features might contribute to the enhanced retrotransposition frequency. Interestingly, the L1TD1 protein shares with its ancestor L1 ORF1p the non-canonical RNA recognition motif and the coiled-coil motif required for the trimerization but has two copies instead of one of the C-terminal domain (CTD), a structure with RNA binding and chaperone function. We speculate that the presence of an additional CTD within the L1TD1 protein might thereby enhance the RNA binding and chaperone function of L1TD1/ORF1p heteromultimers.

      (5) Figure 2C fold enrichment for L1TD1 and ARMC1 is a bit difficult to fully appreciate. A 100 to 200-fold enrichment does not seem physiological. This appears to be a "divide by zero" type of result, as the CT for these genes was likely near 40 or undetectable. Another qRT-PCR based approach (absolute quantification) would be a more revealing experiment. This is the validation of the RIP experiments and the presentation mode is specifically developed for quantification of RIP assays (Sigma Aldrich RIP-qRT-PCR: Data Analysis Calculation Shell). The unspecific binding of the transcript in the absence of L1TD1 in DNMT1/L1TD1 DKO cells is set to 1 and the value in KO cells represents the specific binding relative the unspecific binding. The calculation also corrects for potential differences in the abundance of the respective transcript in the two cell lines. This is not a physiological value but the quantification of specific binding of transcripts to L1TD1. GAPDH as negative control shows no enrichment, whereas specifically associated transcripts show strong enrichement. We have explained the details of RIPqRT-PCR evaluation in Materials and Methods (page 14) and the legend of Figure 2C in the revised manuscript.

      Response: Thank you for the clarification and additional information in the manuscript.

      (6) Is it possible the L1TD1 antibody binds L1 ORF1p? This could make Figure 2D somewhat difficult to interpret. Some validation of the specificity of the L1TD1 antibody would remove this concern (see minor concern below).

      See response to (3).

      Response: Thanks.

      (7) Figure S4A and S4B: There appear to be a few unusual aspects of these figures that should be pointed out and addressed. First, there doesn't seem to be any ORF1p in the Input (if there is, the exposure is too low). Second, there might be some L1TD1 in the DKO (lane 2) and lane 3. This could be non-specific, but the size is concerning. Overexposure would help see this.

      The ORF1p IP gives rise to strong ORF1p signals in the immunoprecipitated complexes even after short exposure. Under these conditions ORF1p is hardly detectable in the input. Regarding the faint band in DKO HAP1 cells, this might be due to a technical problem during Western blot loading. Therefore, the input samples were loaded again on a Western blot and analyzed for the presence of ORF1p, L1TD1 and beta-actin (as loading control) and shown as separate panel in Suppl. Figure S4A.

      The enhanced image is clearer. Thanks.

      S4A and S4B now appear to the S6A and S6B, is that correct? (This is due to the addition of new S1 and S2, but please verify image orders were not disturbed).

      Yes, the input is shown now as a separate panel in Suppl. Figure S6A.

      (8) Figure S4C: This is related to our previous concerns involving antibody cross-reactivity. Figure 3E partially addresses this, where it looks like the L1TD1 "speckles" outnumber the ORF1p puncta, but overlap with all of them. This might be consistent with the antibody crossreacting. The western blot (Figure 3C) suggests an upregulation of ORF1p by at least 23x in the DKO, but the IF image in 3E is hard to tell if this is the case (slightly more signal, but fewer foci). Can you return to the images and confirm the contrast are comparable? Can you massively overexpose the red channel in 3E to see if there is residual overlap? In Figure 3E the L1TD1 antibody gives no signal in DNMT1/L1TD1 DKO cells confirming that it does not recognize ORF1p. In agreement with the Western blot in Figure 3C the L1 ORF1p signal in Figure 3E is stronger in DKO cells. In DNMT1 KO cells the L1 ORF1p antibody does not recognize all L1TD1 speckles. This result is in agreement with the Western blot shown above in Figure R4B and indicates that the L1 ORF1p antibody does not recognize the L1TD1 protein. The contrast is comparable and after overexposure there are still L1TD1 specific speckles. This might be due to differences in abundance of the two proteins.

      Response: Suggestion: Would it be possible to use a program like ImageJ to supplement the western blot observation? Qualitatively, In figure 3E, it appears that there is more signal in the DKO, but this could also be due to there being multiple cells clustered together or a particularly nicely stained region. Could you randomly sample 20-30 cells across a few experiments to see if this holds up. I am interested in whether the puncta in the KO image(s) is a very highly concentrated region and in the DKO this is more disperse. Also, the representative DKO seems to be cropped slightly wrong. (Please use puncta as a guide to make the cropping more precise)

      As suggested by the reviewer we have quantified the signals of 60 KO cells and 56 DKO cells in three different IF experiments by ImageJ. We measured a 1.4-fold higher expression level of L1 ORF1p in DKO cells. However, the difference is not statistically significant. This is most probably due to the change in cell size and protein content during the cell cycle with increasing protein contents from G1 to G2. Western blot analysis provides signals of comparable protein amounts representing an average expression levels over ten thousands of cells. Nevertheless, the quantification results reflect in principle the IF pictures shown in Figure 3E but IF is probably not the best method to quantify protein amounts. We have also corrected Figure 3E.

      Author response image 1.

      (9) The choice of ARMC1 and YY2 is unclear. What are the criteria for the selection?

      ARMC1 was one of the top hits in a pilot RIP-seq experiment (IP versus input and IP versus IgG IP). In the actual RIP-seq experiment with DKO HAP1 cells instead of IgG IP as a negative control, we found ARMC1 as an enriched hit, although it was not among the top 5 hits. The results from the 2nd RIP-seq further confirmed the validity of ARMC1 as an L1TD1interacting transcript. YY2 was of potential biological relevance as an L1TD1 target due to the fact that it is a processed pseudogene originating from YY1 mRNA as a result of retrotransposition. This is mentioned on page 6 of the revised manuscript.

      Response: Appreciated!

      (10) (P16) L1 is the only protein-coding transposon that is active in humans. This is perhaps too generalized of a statement as written. Other examples are readily found in the literature.

      Please clarify.

      We will tone down this statement in the revised manuscript.

      Response: Appreciated! To further clarify, the term "active" when it comes to transposable elements, has not been solidified. It can span "retrotransposition competent" to "transcripts can be recovered". There are quite a few reports of GAG transcripts and protein from various ERV/LTR subfamilies in various cells and tissues (in mouse and human at least), however whether they contribute to new insertions is actively researched.

      (11) In both the abstract and last sentence in the discussion section (P17), embryogenesis is mentioned, but this is not addressed at all in the manuscript. Please refrain from implying normal biological functions based on the results of this study unless appropriate samples are used to support them.

      Much of the published data on L1TD1 function are related to embryonic stem cells [3- 7].

      Therefore, it is important to discuss our findings in the context of previous reports.

      Response: It is well established that embryonic stem cells are not a perfect or direct proxies for the inner cell mass of embryos, as multiple reports have demonstrated transcriptomic, epigenetic, chromatin accessibility differences. The exact origin of ES cells is also considered controversial. We maintain that the distinction between embryos/embryogenesis and the results presented in the manuscript are not yet interchangeable. An important exception would be complex models of embryogenesis such as embryoids, (or synthetic/artificial embryo models that have been carefully been termed as such so as to not suggest direct implications to embryos). https://www.nature.com/articles/ncb2965  

      https://link.springer.com/article/10.1007/s00018-018-2965-y  

      https://www.cell.com/developmental-cell/abstract/S1534-5807(24)00363-0?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS1534580724003630%3Fshowall%3Dtrue

      We have deleted the corresponding paragraph in the Discussion.

      (12) Figure 3E: The format of Figures 1A and 3E are internally inconsistent. Please present similar data/images in a cohesive way throughout the manuscript. We show now consistent IF Figures in the revised manuscript.

      Response: Thanks

      Minor:

      In general:

      Still need checking for typos, mostly in Materials and Methods section; Please keep a consistent writing style throughout the whole manuscript. If you use L1 ORF1p, then please use L1 instead of LINE-1, or if you keep LINE-1 in your manuscript, then you should use LINE-1 ORF1p.

      A lab member from the US checked again the Materials and Methods section for typos. We keep the short version L1 ORF1p.

      (1) Intro:

      - Is L1Td1 in mice and Humans? How "conserved" is it and does this suggest function? Murine and human L1TD1 proteins share 44% identity on the amino acid level and it was suggested that the corresponding genes were under positive selection during evolution with functions in transposon control and maintenance of pluripotency [8].

      - Why HAP1? (Haploid?) The importance of this cell line is not clear.

      HAP1 is a nearly haploid human cancer cell line derived from the KBM-7 chronic myelogenous leukemia (CML) cell line [9, 10]. Due to its haploidy is perfectly suited and widely used for loss-of-function screens and gene editing. After gene editing cells can be used in the nearly haploid or in the diploid state. We usually perform all experiments with diploid HAP1 cell lines. Importantly, in contrast to other human tumor cell lines, this cell line tolerates ablation of DNMT1. We have included a corresponding explanation in the revised manuscript on page 5, first paragraph.

      - Global methylation status in DNMT1 KO? (Methylations near L1 insertions, for example?)

      The HAP1 DNMT1 KO cell line with a 20 bp deletion in exon 4 used in our study was validated in the study by Smits et al. [11]. The authors report a significant reduction in overall DNA methylation. However, we are not aware of a DNA methylome study on this cell line. We show now data on the methylation of L1 elements in HAP1 cells and upon DNMT1 deletion in the revised manuscript in Suppl. Figure S1B.

      Response: Looks great!

      (2) Figure 1:

      - Figure 1C. Why is LMNB used instead of Actin (Fig1D)?

      We show now beta-actin as loading control in the revised manuscript.

      - Figure 1G shows increased Caspase 3 in KO, while the matching sentence in the result section skips over this. It might be more accurate to mention this and suggest that the single KO has perhaps an intermediate phenotype (Figure 1F shows a slight but not significant trend).

      We fully agree with the reviewer and have changed the sentence on page 6, 2nd paragraph accordingly.

      - Would 96 hrs trend closer to significance? An interpretation is that L1TD1 loss could speed up this negative consequence.

      We thank the reviewer for the suggestion. We have performed a time course experiment with 6 biological replicas for each time point up to 96 hours and found significant changes in the viability upon loss of DNMT1 and again significant reduction in viability upon additional loss of L1TD1 (shown in Figure 1F). These data suggest that as expected loss of DNMT1 leads to significant reduction viability and that additional ablation of L1TD1 further enhances this effect.

      Response: Looks good!

      - What are the "stringent conditions" used to remove non-specific binders and artifacts (negative control subtraction?)

      Yes, we considered only hits from both analyses, L1TD1 IP in KO versus input and L1TD1 IP in KO versus L1TD1 IP in DKO. This is now explained in more detail in the revised manuscript on page 6, 3rd paragraph.

      (3) Figure 2:

      - Figure 2A is a bit too small to read when printed.

      We have changed this in the revised manuscript.

      - Since WT and DKO lack detectable L1TD1, would you expect any difference in RIP-Seq results between these two?

      Due to the lack of DNMT1 and the resulting DNA hypomethylation, DKO cells are more similar to KO cells than WT cells with respect to the expressed transcripts.

      - Legend says selected dots are in green (it appears blue to me). We have changed this in the revised manuscript.

      - Would you recover L1 ORF1p and its binding partners in the KO? (Is the antibody specific in the absence of L1TD1 or can it recognize L1?) I noticed an increase in ORF1p in the KO in Figure 3C.

      Thank you for the suggestion. Yes, L1 ORF1p shows slightly increased expression in the proteome analysis and we have marked the corresponding dot in the Volcano plot (Figure 3A).

      - Should the figure panel reference near the (Rosspopoff & Trono) reference instead be Sup S1C as well? Otherwise, I don't think S1C is mentioned at all.

      - What are the red vs. green dots in 2D? Can you highlight ERV and ALU with different colors?

      We added the reference to Suppl. Figure S1C (now S3C) in the revised manuscript. In Figure 2D L1 elements are highlighted in green, ERV elements in yellow, and other associated transposon transcripts in red.

      Response: Much better, thanks!

      - Which L1 subfamily from Figure 2D is represented in the qRT-PCR in 2E "LINE-1"? Do the primers match a specific L1 subfamily? If so, which? We used primers specific for the human L1.2 subfamily.

      - Pulling down SINE element transcripts makes some sense, as many insertions "borrow" L1 sequences for non-autonomous retro transposition, but can you speculate as to why ERVs are recovered? There should be essentially no overlap in sequence.

      In the L1TD1 evolution paper [8], a potential link between L1TD1 and ERV elements was discussed:

      "Alternatively, L1TD1 in sigmodonts could play a role in genome defense against another element active in these genomes. Indeed, the sigmodontine rodents have a highly active family of ERVs, the mysTR elements [46]. Expansion of this family preceded the death of L1s, but these elements are very active, with 3500 to 7000 speciesspecific insertions in the L1-extinct species examined [47]. This recent ERV amplification in Sigmodontinae contrasts with the megabats (where L1TD1 has been lost in many species); there are apparently no highly active DNA or RNA elements in megabats [48]. If L1TD1 can suppress retroelements other than L1s, this could explain why the gene is retained in sigmodontine rodents but not in megabats."

      Furthermore, Jin et al. report the binding of L1TD1 to repetitive sequences in transcripts [12]. It is possible that some of these sequences are also present in ERV RNAs.

      Response: Interesting, thanks for sharing

      - Is S2B a screenshot? (the red underline).

      No, it is a Powerpoint figure, and we have removed the red underline.

      (4) Figure 3:

      - Text refers to Figure 3B as a western blot. Figure 3B shows a volcano plot. This is likely 3C but would still be out of order (3A>3C>3B referencing). I think this error is repeated in the last result section.

      - Figure and legends fail to mention what gene was used for ddCT method (actin, gapdh, etc.).

      - In general, the supplemental legends feel underwritten and could benefit from additional explanations. (Main figures are appropriate but please double-check that all statistical tests have been mentioned correctly).

      Thank you for pointing this out. We have corrected these errors in the revised manuscript.

      (5) Discussion:

      - Aluy connection is interesting. Is there an "Alu retrotransposition reporter assay" to test whether L1TD1 enhances this as well?

      Thank you for the suggestion. There is indeed an Alu retrotransposition reporter assay reported be Dewannieux et al. [13]. The assay is based on a Neo selection marker. We have previously tested a Neo selection-based L1 retrotransposition reporter assay, but this system failed to properly work in HAP1 cells, therefore we switched to a blasticidin based L1 retrotransposition reporter assay. A corresponding blasticidin-based Alu retrotransposition reporter assay might be interesting for future studies (mentioned in the Discussion, page 11 paragraph 4 of the revised manuscript.

      (6) Material and Methods :

      - The number of typos in the materials and methods is too numerous to list. Instead, please refer to the next section that broadly describes the issues seen throughout the manuscript.

      Writing style

      (1) Keep a consistent style throughout the manuscript: for example, L1 or LINE-1 (also L1 ORF1p or LINE-1 ORF1p); per or "/"; knockout or knock-out; min or minute; 3 times or three times; media or medium. Additionally, as TE naming conventions are not uniform, it is important to maintain internal consistency so as to not accidentally establish an imprecise version.

      (2) There's a period between "et al" and the comma, and "et al." should be italic.

      (3) The authors should explain what the key jargon is when it is first used in the manuscript, such as "retrotransposon" and "retrotransposition".

      (4) The authors should show the full spelling of some acronyms when they use it for the first time, such as RNA Immunoprecipitation (RIP).

      (5) Use a space between numbers and alphabets, such as 5 μg. (6) 2.0 × 105 cells, that's not an "x".

      (7) Numbers in the reference section are lacking (hard to parse).

      (8) In general, there are a significant number of typos in this draft which at times becomes distracting. For example, (P3) Introduction: Yet, co-option of TEs thorough (not thorough, it should be through) evolution has created so-called domesticated genes beneficial to the gene network in a wide range of organisms. Please carefully revise the entire manuscript for these minor issues that collectively erode the quality of this submission. Thank you for pointing out these mistakes. We have corrected them in the revised manuscript. A native speaker from our research group has carefully checked the paper. In summary, we have added Supplementary Figure S7C and have changed Figures 1C, 1E, 1F, 2A, 2D, 3A, 4B, S3A-D, S4B and S6A based on these comments.

      Response: Thank you for taking these comments on board!

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This is an interesting and somewhat unusual paper supporting the idea that creatine is a neurotransmitter in the central nervous system of vertebrates. The idea is not entirely new, and the authors carefully weigh the evidence, both past and newly acquired, to make their case. The strength of the paper lies in the importance of the potential discovery - as the authors point out, creatine ticks more boxes on criteria of neurotransmitters than some of the ones listed in textbooks - and the list of known transmitters (currently 16) certainly is textbook material. A further strength of the manuscript is the careful consideration of a list of criteria for transmitters and newly acquired evidence for four of these criteria: 1. evidence that creatine is stored in synaptic vesicles, 2. mutants for creatine synthesis and a vesicular transporter show reduced storage and release of creatine, 3. functional measurement that creatine release has an excitatory or inhibitory (here inhibitory) effect in vivo, and 4. ATP-dependence. The key weakness of the paper is that there is no single clear 'smoking gun', like a postsynaptic creatine receptor, that would really demonstrate the function as a transmitter. Instead, the evidence is of a cumulative nature, and not all bits of evidence are equally strong. On balance, I found the path to discovery and the evidence assembled in this manuscript to establish a clear possibility, positive evidence, and to provide a foundation for further work in this direction.

      it is notable that, historically, no neurotransmitter has ever been established in a single paper. While creatine will not be an exception, data presented in this paper are more than any previous paper in demonstrating the possibility of a new neurotransmitter. However, we added an entire paragraph in the Discussion part about differences between Cr and classic neurotransmitters such as Glu, beginning with the absence of a molecularly defined receptor at this point and the Ca2+ independent component of Cr release induced by extracellular K+.

      We appreciate the reviewer for noting that evidence obtained by us now support that creatine satisfies all 4 criteria of transmitters.

      We respectively disagree the point about a smoking gun: any of these four is a smoking gun, while the satisfication of all 4 is quite strong, more than a smoking gun.

      We find it disagreeable that a receptor “would really demonstrate the function of a transmitter”. Textbook criteria for a transmitter usually require postsynaptic responses, not a molecularly defined receptor. A molecularly defined receptor for many of the known transmitters required many years of work, while they were accepted as transmitters before their receptors were finally molecularly defined. As long as there is a postsynaptic response, there is of course a receptor, though its molecular properties should be further studied. For examples, responses to choline were discovered in 1900 (Hunt, Am J Physiol 3, xviii-xix, 1900), those to acetylcholine in 1906 (Hunt and Taveau, Br Med J 2:1788-1789, 1906), those to supradrenal glands before 1894 (Oliver and Schäfer, J Physiol 18:230-276 1895). Henry Dale was awarded a Nobel prize in 1936 partly for his work on acetylcholine. Receptors for acetylcholine and noradrenaline were not molecularly defined until the 1970s and 1980s. Before then, they were only known by mediating responses to natural transmitters and synthesized chemicals.

      There were two previous reports that creatine could be taken into brain slices (Almeida et al., 2006) or synaptosomes (Peral, Vázquez-Carretero and Ilundain, 2010). These were used by the reviewer to argue that the idea of creatine as a neurotransmitter “is not entirely new”. However, no one has followed up these studies for 10 years, thus they would not be considered as good smoking guns. While we have reproduced the synaptosome uptake result (together with our new finding that this uptake was dependent on SLC6A8), it should be noted that uptake of molecules into synaptosomes is not absolutely required for a neurotransmitter because degradation of a transmitter is equally valid. Furthermore, molecules required synaptically but not as a transmitter can also be transported into the synaptic terminal.

      Our detection of Cr in the synaptic vesicles provides much stronger evidence supporting its importance. If a smoking gun is important, the detection of creatine in the SVs is the best smoking gun, whose discovery in fact was the reason leading us to study its release, postsynaptic responses as well as repeating the uptake experiment with genetic mutants.

      Reviewer #2 (Public Review):

      Summary:

      Bian et al studied creatine (Cr) in the context of central nervous system (CNS) function. They detected Cr in synaptic vesicles purified from mouse brains with anti-Synaptophysin using capillary electrophoresis-mass spectrometry. Cr levels in the synaptic vesicle fraction were reduced in mice lacking the Cr synthetase AGAT, or the Cr transporter SLC6A8. They provide evidence for Cr release within several minutes after treating brain slices with KCl. This KCl-induced Cr release was partially calcium-dependent and was attenuated in slices obtained from AGAT and SLC6A8 mutant mice. Cr application also decreased the excitability of cortical pyramidal cells in one third of the cells tested. Finally, they provide evidence for SLC6A8-dependent Cr uptake into synaptosomes, and ATP-dependent Cr loading into synaptic vesicles. Based on these data, the authors propose that Cr may act as a neurotransmitter in the CNS.

      Strengths:

      1) A major strength of the paper is the broad spectrum of tools used to investigate Cr.

      2) The study provides strong evidence that Cr is present in/loaded into synaptic vesicles.

      Weaknesses:

      (in sequential order)

      1) Are Cr levels indeed reduced in Agat-/-? The decrease in Cr IgG in Agat-/- (and Agat+/-) is similar to the corresponding decrease in Syp (Fig. 3B). What is the explanation for this? Is the decrease in Cr in Agat-/- significant when considering the drop in IgG? The data should be normalized to the respective IgG control.

      We measured the Cr concentration in the whole brain lysates using Creatine Assay Kit (Sigma, MAK079). Cr levels in the brain were reduced in Agat-/- mice. The Cr concentration in AGAT-/- mice was reduced to about 1/10 of AGAT+/+ and AGAT+/- mice (Author response image 1).

      Author response image 1.

      Cr concentration in brain from AGAT+/+, AGAT+/- and AGAT-/- mice (n=5 male mice for each group). , p<0.05, **, p<0.001, one-way ANOVA with Tukey’s correction.

      As pointed by the reviewer, the decrease in Cr IgG in Agat-/- seems similar to the corresponding decrease in Syp (Fig. 3B in the paper). Cr pulled down by IgG was 0.46 ± 0.04, 0.37 ± 0.06 and 0.17 ±0.03 pmol/μg anti-syp antibody for Agat+/+, Agat+/-, and Agat-/- mice respectively. There was a trend of reduction Cr IgG in Agat-/-, however, there were no statistically significant differences between Agat-/- and Agat+/+, or between Agat-/- and Agat+/-, as determined by one-way ANOVA (Fig. 3B in the paper). Due to the fact that Agat-/- reduced Cr concentration in the brain, we speculate that the apparent drop in Cr pulled down by IgG may have partially resulted from the overall reduction of Cr content in the brain.

      The absolute content of Cr pulled down by Syp in Agat-/- mice was reduced to 21.6% of Agat+/+ mice and 23.6% of Agat+/- mice (Fig. 3B in the paper). As suggested by the reviewer, we normalized the Cr pulled down by Syp to the respective IgG control (Author response image 2). The normalized Cr content in AGAT-/- mice has a tendency to decrease, but not statistically significant, as compared to Agat+/+ and Agat+/- mice (n=10 for each group, one-way ANOVA).

      Author response image 2.

      Normalized Cr content in brain from AGAT+/+, AGAT+/- and AGAT-/- mice (n=10 for each group). Cr pulled down by anti-Syp antibody was normalized to that of IgG.

      2) The data supporting that depolarization-induced Cr release is SLC6A8 dependent is not convincing because the relative increase in KCl-induced Cr release is similar between SLC6A8-/Y and SLC6A8+/Y (Fig. 5D). The data should be also normalized to the respective controls.

      As suggested by the reviewer, we normalized the Cr release during KCl stimulation to the baseline (Author response image 3). The ratio of Cr release evoked by high KCl stimulation to the baseline was similar in WT and Slc6a8 knockouts. This suggests that Cr is not released through SLC6A8 transporter.

      Author response image 3.

      Normalized Cr release from slices from Slc6a8+/Y and Slc6a8-/Y mice (n=7 slices for each group). Cr released evoked by high KCl stimulation was normalized to baseline.

      However, without Slc6a8, KCl-induced release of Cr was significantly reduced (Figure 5D in the paper). This is because Slc6a8 is a transporter to Cr uptake into synaptic terminals (Figure 5D and 8C in the paper). Therefore, Cr content in SVs (Figure 2C in the paper) indirectly reduced Cr release.

      3) The majority (almost 3/4) of depolarization-induced Cr release is Ca2+ independent (Fig. 5G). Furthermore, KCl-induced, Ca2+-independent release persists in SLC6A8-/Y (Fig. 5G). What is the model for Ca2+-independent Cr release? Why is there Ca2+-independent Cr release from SLC6A8 KO neurons? How does this relate to the prominent decrease in Ca2+-dependent Cr release in SLC6A8-/Y (Fig. 5G)? They show a prominent decrease in Cr control levels in SLC6A8-/Y in Fig. 5D. Were the data shown in Fig. 5D obtained in the presence or absence of Ca2+? Could the decrease in Ca2+-dependent Cr release in SLC6A8-/Y (Fig. 5G) be due to decreased Cr baseline levels in the presence of Ca2+ (Fig. 5D)?

      These are interesting questions that, at this point, could only be answered by references to literature. For example, one possibility was that Ca2+-independent Cr release might occurs in glia, since as pointed by the reviewer in Point 6, high GAMT levels were reported for astrocytes and oligodendrites (Schmidt et al. 2004; Rosko et al. 2023). As reported, other neuromodulators such as taurine can be released from astrocytes (Philibert, Rogers, and Dutton 1989) or slices (Saransaari and Oja 2006) in Ca2+ independent manner. In addition, in the absence of potassium stimulation, Ca2+ depletion lead to increased release of taurine in cultured astrocytes (Takuma et al. 1996) or in striatum in vivo (Molchanova, Oja, and Saransaari 2005). Similarly, in SLC6A8 KO slices, Ca2+ depletion (Figure 5G) also increased creatine baseline levels as compared to that in normal ACSF (Figure 5D). Another possibility was that Ca2+-independent Cr release might occurs in neurons lacking SLC6a8 expression.

      As mentioned in the paper, data shown in Figure 5D was obtained in the presence Ca2+. Reduction of Ca2+-dependent Cr release evoked by potassium in SLC6A8-/Y (Figure 5G) may be due to decreased Cr baseline levels in the presence of Ca2+ and reduced Cr in synaptic vesicles (Figure 5D).

      4) Cr levels are strongly reduced in Agat-/- (Figure 6B). However, KCl-induced Cr release persists after loss of AGAT (Figure 6B). These data do not support that Cr release is Agat dependent.

      Although KCl-induced Cr release persisted in AGAT-/- mutants, it was dropped to 11.6% of WT mice (Figure 6B). AGAT is not directly involved in the release, but required for providing sufficient Cr.

      5) The authors show that Cr application decreases excitability in ~1/3 of the tested neurons (Figure 7). How were responders and non-responders defined? What justifies this classification? The data for all Cr-treated cells should be pooled. Are there indeed two distributions (responders/non-responders)? Running statistics on pre-selected groups (Figure 7H-J) is meaningless. Given that the effects could be seen 2-8 minutes after Cr application - at what time points were the data shown in Figure 7E-J collected? Is the Cr group shown in Figure 7F significantly different from the control group/wash?

      The responders were defined by three criteria: (1) When Cr was applied, the rheobase was increased as compared to both control and wash conditions. (2) The number of total evoked spikes was decreased during Cr application than both control and wash. (3) The number of total evoked spikes was decreased at least by 10% than control or wash.

      For all the individual responders, when Cr was applied, the rheobase was increased (Figure 7E and 7F). While in individual non-responders, the rheobase was either identical to both control and wash (n=19/35), identical to either control or wash (n=11/35), between control and wash (n=2/35) or smaller than both control and wash (n=3/35) following Cr application. Thus, the responders and non-responders were separatable. When the rheobase data were pulled together, many points were overlapped, so we did not pull the data here.

      As suggested, we pulled the data of the ratio of spike changes in response to 100 μM Cr application for all neurons together (Author response image 4). Evoked spikes of non-responders were typically (34/35) changed in the range of -10% to 10%.

      Author response image 4.

      Relative changes of total evoked spikes in response to 100 μM Cr. Responders are represented by red dots and non-responders by black dots. Dashed black line indicates 10%. Relative change = (Cr-(Control +wash)/2)/((Control +wash)/2)*100%.

      In Figure 7E-J, we collected data at time points when the maximal response was reached. The Cr group shown in Figure 7F was indeed significantly different from the control group/wash (p<0.05, paired t test, for data points collected under 75-500 pA current injection).

      6) Indirect effects: The phenotypes could be partially caused by indirect effects of perturbing the Cr/PCr/CK system, which is known to play essential roles in ATP regeneration, Ca2+ homeostasis, neurotransmission, intracellular signaling systems, axonal and dendritic transport... Similarly, high GAMT levels were reported for astrocytes (e.g., Schmidt et al. 2004; doi: 10.1093/hmg/ddh112), and changes in astrocytic Cr may underlie the phenotypes. Cr has been also reported to be an osmolyte: a hyperosmotic shock of astrocytes induced an increase in Cr uptake, suggesting that Cr can work as a compensatory osmolyte (Alfieri et al. 2006; doi: 10.1113/jphysiol.2006.115006). Potential indirect effects are also consistent with a trend towards decreased KCl-induced GABA (and Glutamate) release in SLC6A8-/Y (Figure 5C). These indirect effects may in part explain the phenotypes seen after perturbing Agat, SLC6A8, and should be thoroughly discussed.

      We discussed the possibility of creatine/phosphocreatine as non-transmitters in discussion part. We added the possibility of astrocytic Cr in discussion part. KCl-induced GABA (and Glutamate) release in SLC6A8-/Y (Figure 5C) was not significant.

      7) As stated by the authors, there is some evidence that Cr may act as a co-transmitter for GABAA receptors (although only at high concentrations). Would a GABAA blocker decrease the fraction of cells with decreased excitability after Cr exposure?

      We performed another experiment in CA1 pyramidal neurons in hippocampus showing that Cr at 100 μM did not change GABAergic neurotransmission (n=8, Author response image 5). Inhibitory postsynaptic currents (IPSCs) recorded in the presence of glutamate receptor blockers (10 μM APV and 10 μM CNQX) were not changed by 100 μM creatine in hippocampal CA1 pyramidal neurons (Bgroup data of IPSC frequency (B) and amplitude (C) averaged in 1 min duration). These did not support Cr activation of GABAA receptors.

      Author response image 5.

      IPSCs recorded in in hippocampal CA1 pyramidal neurons. (A) representative raw traces before (Control), during (Creatine) and after (Wash) the application of 100 μM creatine. (B&C) group data of IPSC frequency (B) and amplitude (C) averaged in 1 min duration.

      8) The statement "Our results have also satisfied the criteria of Purves et al. 67,68, because the presence of postsynaptic receptors can be inferred by postsynaptic responses." (l.568) is not supported by the data and should be removed.

      We have deleted this sentence, though what could mediate postsynaptic responses other than receptors?

      Reviewer #3 (Public Review):

      SUMMARY:

      The manuscript by Bian et al. promotes the idea that creatine is a new neurotransmitter. The authors conduct an impressive combination of mass spectrometry (Fig. 1), genetics (Figs. 2, 3, 6), biochemistry (Figs. 2, 3, 8), immunostaining (Fig. 4), electrophysiology (Figs. 5, 6, 7), and EM (Fig. 8) in order to offer support for the hypothesis that creatine is a CNS neurotransmitter.

      We thank the reviewer for the summary.

      STRENGTHS:

      There are many strengths to this study.

      • The combinatorial approach is a strength. There is no shortage of data in this study.

      • The careful consideration of specific criteria that creatine would need to meet in order to be considered a neurotransmitter is a strength.

      • The comparison studies that the authors have done in parallel with classical neurotransmitters are helpful.

      • Demonstration that creatine has inhibitory effects is another strength.

      • The new genetic mutations for Slc6a8 and AGAT are strengths and potentially incredibly helpful for downstream work.

      WEAKNESSES:

      • Some data are indirect. Even though Slc6a8 and AGAT are helpful sentinels for the presence of creatine, they are not creatine themselves. Therefore, the conclusions that are drawn should be circumspect.

      SLC6A8 and AGAT mutants are not essential for Cr’s role as a neurotransmitter.

      • Regarding Slc6a8, it seems to work only as a reuptake transporter - not as a transporter into SVs. Therefore, we do not know what the transporter is.

      Indeed, SLC6A8 is only a transporter on the cytoplasmic membrane, not a transporter on synaptic vesicles. We have shown biochemistry here, and we have unpublished data that showed other SLCs on SVs, which did not include SLC6A8.

      • Puzzlingly, Slc6a8 and AGAT are in different cells, setting up the complicated model that creatine is created in one cell type and then processed as a neurotransmitter in another.

      • No candidate receptor for creatine has been identified postsynaptically.

      • Because no candidate receptor has been identified, is it possible that creatine is exerting its effects indirectly through other inhibitory receptors (e.g., GABAergic Rs)?

      As shown in our response to Question 7 of Reviewer 2, Cr did not exert its effects through inhibitory GABAA receptors.

      • More broadly, what are the other possibilities for roles of creatine that would explain these observations other than it being a neurotransmitter? Could it simply be a modifier that exists in the SVs (lots of molecules exist in SVs)?

      We discussed the possibility of a non-transmitter role for creatine/phosphocreatine in discussion part.

      • The biochemical studies are helpful in terms of comparing relevant molecules (e.g., Figs. 8 and S1), but the images of the westerns are all so fuzzy that there are questions about processing and the accuracy of the quantification.

      Multiple members (>4) have carried out SV purifications repeatedly over the last decade in our group, we are highly confident of SV purifications presented in Figs. 8 and S1.

      There are several criteria that define a neurotransmitter. The authors nicely delineated many criteria in their discussion, but it is worth it for readers to do the same with their own understanding of the data.

      By this reviewer's understanding (and the Purves' textbook definition) a neurotransmitter: 1) must be present within the presynaptic neuron and stored in vesicles; 2) must be released by depolarization of the presynaptic terminal; 3) must require Ca2+ influx upon depolarization prior to release; 4) must bind specific receptors present on the postsynaptic cell; 5) exogenous transmitter can mimic presynaptic release; 6) there exists a mechanism of removal of the neurotransmitter from the synaptic cleft.

      6 criteria seem to be only required by the reviewer. As discussed in our Discussion part, Purves’ textbook did not list 6 criteria but only three criteria, “the substance must be present within the presynaptic neuron; the substance must be released in response to presynaptic depolarization, and the release must be Ca2+ dependent; specific receptors for the substance be present on the postsynaptic cell” (Purves et al., 2001, 2016).

      Kandel et al. (2013, 2021) listed 4 criteria for a neurotransmitter: “it is synthesized in the presynaptic neuron; it is present within vesicles and is released in amounts sufficient to exert a defined action on the postsynaptic neuron or effector organ; when administered exogenously in reasonable concentrations it mimics the action of the endogenous transmitter; a specific mechanism usually exists for removing the substance from the synaptic cleft”.

      While we agree that any neuroscientist can have his/her own criteria, it is more reasonable to accept the textbooks that have been widely read for decades.

      For a paper to claim that the work has identified a new neurotransmitter, several of these criteria would be met - and the paper would acknowledge in the discussion which ones have not been met. For this particular paper, this reviewer finds that condition 1 is clearly met.

      Conditions 2 and 3 seem to be met by electrophysiology, but there are caveats here. High KCl stimulation is a blunt instrument that will depolarize absolutely everything in the prep all at once and could result in any number of non-specific biological reactions as a result of K+ rushing into all neurons in the prep. Moreover, the results in 0 Ca2+ are puzzling. For creatine (and for the other neurotransmitters), why is there such a massive uptick in release, even when the extracellular saline is devoid of calcium?

      To avoid the disadvantage of high KCl stimulation, we performed optogenetic experiments recently, with encouraging preliminary data. We do not know the source of Ca2+-independent release of Cr and neurotransmitters, though astrocytes are a possibility.

      Condition 4 is not discussed in detail at all. In the discussion, the authors elide the criterion of receptors specified by Purves by inferring that the existence of postsynaptic responses implies the existence of receptors. True, but does it specifically imply the existence of creatinergic receptors? This reviewer does not think that is necessarily the case. The authors should be appropriately circumspect and consider other modes of inhibition that are induced by activation or potentiation of other receptors (e.g., GABAergic or glycinergic).

      Our results did not support Cr stimulation of inhibitory GABAA receptors (see our answer to Point 7 in of Reviewer 2).

      Condition 5 may be met, because the authors applied exogenous creatine and observed inhibition (Fig. 7). However, this is tough to know without understanding the effects of endogenous release of creatine. if they were to test if the absence of creatine caused excess excitation (at putative creatinergic synapses), then that would be supportive of the same.

      After the submission of our manuscript, we found a recent paper showing that slc6a8 knockout led to increased excitation in pyramidal neurons in the prefrontal cortex (PFC), with increased firing frequency (Ghirardini et al., 2023). Because we have shown that slc6a8 knockout would cause decrease of Cr in SVs (Figure 2 in our paper), this result provide the evidence described as Condition 5 of this reviewer: that decrease of Cr in SVs led to excess excitation.

      For condition 6, the authors made a great effort with Slc6a8. This is a very tough criterion to understand for many synapses and neurotransmitters.

      In terms of fundamental neuroscience, the story would be impactful if proven correct. There are certainly more neurotransmitters out there than currently identified.

      The impact as framed by the authors in the abstract and introduction for intellectual disability is uncertain (forming a "new basis for ID pathogenesis") and it seems quite speculative beyond the data in this paper.

      We deleted this sentence.

      Reviewer #1 (Recommendations For The Authors):

      To strengthen the manuscript, I suggest the following considerations:

      1) The key missing evidence to my mind is a receptor - but this is clearly outside the scope of this paper. Yet, I am surprised that in the list of criteria for neurotransmitters in general there is no mention of a receptor. Furthermore, many receptors have been identified through receptor agonists or antagonists, like neurotoxins or drugs. The authors do not talk about putative receptors except for a sentence in the discussion where they speculate on a GPCR. There are numerous GPCR agonists and antagonists, which may be a long-shot, or something even a bit more designed based on knowledge about creatine? I do not think the publication of this manuscript should have been made dependent on finding an agonist or antagonist of this specific unknown receptor (if it exists), but it would be good to have at least some leads on this from the authors what has been tried or what could be done? How about a manipulation of G-protein-coupled signal transduction to support the idea that there IS such a GPCR? There may be a real opportunity here to test existing compounds in wild type, the slc6a8 and agat mutants.

      We will keep trying, but accept the reality that Rome was not built in a single day and that no transmitter was proven by one single paper.

      A key new puzzle piece of evidence is the identification of creatine in synaptic vesicles. The experiment relies heavily on the purity of the SV fraction using the anti-synaptophysin antibody. I am quite sure that these preparations contain many other compartments - and of course a big mix of synaptic (and other) vesicles. Would it be possible to purify with an anti slc6a8 antibody?

      Sl6a8 is expressed in on the plasma membrane of neurons7-9, instead of synaptic vesicles. Consistent with this, we could not detect obvious Slc6a8-HA signal in our starting material (Lane S in Author response image 6) that was used for SV purification. We have tried to purify SVs by HA antibody in Slc6a8 mice and SV markers could not be detected.

      Author response image 6.

      Lack of Slc6a8-HA in our starting material. In Slc6a8-HA knock-in mice, the HA signal was present in whole brain homogenate (H), but not obvious in supernatants (S) following 35000 × centrifugation. In contrast, SV marker Syp was present in supernatants.

      The K stimulation protocol in slices is relatively crude, as all neurons in the slice get simultaneously overactivated - and some of the effects on Ca-dependent release are not very strong (e.g. the 35 neurons that were not responsive to creatine at all). A primary neuronal culture of neurons that respond to creatine would strengthen this section.

      To avoid the disadvantage of K stimulation, we also performed optogenetic experiments recently and obtained encouraging preliminary results.

      Reviewer #2 (Recommendations For The Authors):

      1) The different sections of the manuscript are not separated by headers.

      2) The beginning of the results section either does not reference the underlying literature or refers to unpublished data.

      We have kept a bit background in the beginning of the Results section.

      3) The text contains many opinions and historical information that are not required (e.g., "It has never been easy to discover a new neurotransmitter, especially one in the central nervous system (CNS). We have been searching for new neurotransmitters for 12 years."; l. 17).

      This is a field that has been dormant for decades and such background introductions are helpful for at least some readers.

      4) Almeida et al. (2008; doi: 10.1002/syn.20280) provided evidence for electrical activity-, and Ca2+-dependent Cr release from rat brain slices. This paper should be introduced in the introduction.

      Those were stand-alone papers which have not been reproduced or paid attention to. Our introduction part did not mention them because our research did not begin with those papers. We had no idea that those papers existed when we began. We started with SV purification and only read those papers afterwards. Thus, they were not necessary background to our paper but can be discussed after we discovered Cr in SVs.

      5) Fig. 7: A Y-scale for the stimulation protocol is missing.

      Revised.

      Reviewer #3 (Recommendations For The Authors):

      The main suggestion by this reviewer (beyond the details in the public review) is to consider the full spectrum of biology that is consistent with these results. By my reading, creatine could be a neurotransmitter, but other possibilities also exist, and the authors need to highlight those too.

      We have discussed non-transmitter role in the discussion.

      References

      Ghirardini, E., G. Sagona, A. Marquez-Galera, F. Calugi, C. M. Navarron, F. Cacciante, S. Chen, F. Di Vetta, L. Dada, R. Mazziotti, L. Lupori, E. Putignano, P. Baldi, J. P. Lopez-Atalaya, T. Pizzorusso, and L. Baroncelli. 2023. Cell-specific vulnerability to metabolic failure: the crucial role of parvalbumin expressing neurons in creatine transporter deficiency. Acta Neuropathol Commun, 11: 34. doi: 10.1186/s40478-023-01533-w.

      Lowe, M. T., Faull, R. L., Christie, D. L. & Waldvogel, H. J. Distribution of the creatine transporter throughout the human brain reveals a spectrum of creatine transporter immunoreactivity. J Comp Neurol 523, 699-725 (2015). https://doi.org:10.1002/cne.23667

      Mak, C. S. et al. Immunohistochemical localisation of the creatine transporter in the rat brain. Neuroscience 163, 571-585 (2009). https://doi.org:10.1016/j.neuroscience.2009.06.065.

      Molchanova, S. M., Oja, S. S. & Saransaari, P. Mechanisms of enhanced taurine release under Ca2+ depletion. Neurochem Int 47, 343-349 (2005). https://doi.org:10.1016/j.neuint.2005.04.027

      Philibert, R. A., Rogers, K. L. & Dutton, G. R. K+-evoked taurine efflux from cerebellar astrocytes: on the roles of Ca2+ and Na+. Neurochem Res 14, 43-48 (1989). https://doi.org:10.1007/BF00969756

      Rosko, L. M. et al. Cerebral Creatine Deficiency Affects the Timing of Oligodendrocyte Myelination. J Neurosci 43, 1143-1153 (2023). https://doi.org:10.1523/JNEUROSCI.2120-21.2022

      Saransaari, P. & Oja, S. S. Characteristics of taurine release in slices from adult and developing mouse brain stem. Amino Acids 31, 35-43 (2006). https://doi.org:10.1007/s00726-006-0290-5

      Schmidt, A. et al. Severely altered guanidino compound levels, disturbed body weight homeostasis and impaired fertility in a mouse model of guanidinoacetate N-methyltransferase (GAMT) deficiency. Hum Mol Genet 13, 905-921 (2004). https://doi.org:10.1093/hmg/ddh112

      Speer, O. et al. Creatine transporters: a reappraisal. Mol Cell Biochem 256-257, 407-424 (2004). https://doi.org:10.1023/b:mcbi.0000009886.98508.e7

      Takuma, K. et al. Ca2+ depletion facilitates taurine release in cultured rat astrocytes. Jpn J Pharmacol 72, 75-78 (1996). https://doi.org:10.1254/jjp.72.75

    1. Author Response

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This valuable paper examines gene expression differences between male and female individuals over the course of flower development in the dioecious angiosperm Trichosantes pilosa. The authors show that male-biased genes evolve faster than female-biased and unbiased genes. This is frequently observed in animals, but this is the first report of such a pattern in plants. In spite of the limited sample size, the evidence is mostly solid and the methods appropriate for a non-model organism. The resources produced will be used by researchers working in the Cucurbitaceae, and the results obtained advance our understanding of the mechanisms of plant sexual reproduction and its evolutionary implications: as such they will broadly appeal to evolutionary biologists and plant biologists.

      Public Reviews:

      Reviewer #1 (Public Review):

      The evolution of dioecy in angiosperms has significant implications for plant reproductive efficiency, adaptation, evolutionary potential, and resilience to environmental changes. Dioecy allows for the specialization and division of labor between male and female plants, where each sex can focus on specific aspects of reproduction and allocate resources accordingly. This division of labor creates an opportunity for sexual selection to act and can drive the evolution of sexual dimorphism.

      In the present study, the authors investigate sex-biased gene expression patterns in juvenile and mature dioecious flowers to gain insights into the molecular basis of sexual dimorphism. They find that a large proportion of the plant transcriptome is differentially regulated between males and females with the number of sex-biased genes in floral buds being approximately 15 times higher than in mature flowers. The functional analysis of sex-biased genes reveals that chemical defense pathways against herbivores are up-regulated in the female buds along with genes involved in the acquisition of resources such as carbon for fruit and seed production, whereas male buds are enriched in genes related to signaling, inflorescence development and senescence of male flowers. Furthermore, the authors implement sophisticated maximum likelihood methods to understand the forces driving the evolution of sex-biased genes. They highlight the influence of positive and relaxed purifying selection on the evolution of male-biased genes, which show significantly higher rates of non-synonymous to synonymous substitutions than female or unbiased genes. This is the first report (to my knowledge) highlighting the occurrence of this pattern in plants. Overall, this study provides important insights into the genetic basis of sexual dimorphism and the evolution of reproductive genes in Cucurbitaceae.

      Reviewer #2 (Public Review):

      Summary:

      This study uses transcriptome sequence from a dioecious plant to compare evolutionary rates between genes with male- and female-biased expression and distinguish between relaxed selection and positive selection as causes for more rapid evolution. These questions have been explored in animals and algae, but few studies have investigated this in dioecious angiosperms, and none have so far identified faster rates of evolution in male-biased genes (though see Hough et al. 2014 https://doi.org/10.1073/pnas.1319227111).

      Strengths:

      The methods are appropriate to the questions asked. Both the sample size and the depth of sequencing are sufficient, and the methods used to estimate evolutionary rates and the strength of selection are appropriate. The data presented are consistent with faster evolution of genes with male-biased expression, due to both positive and relaxed selection.

      This is a useful contribution to understanding the effect of sex-biased expression in genetic evolution in plants. It demonstrates the range of variation in evolutionary rates and selective mechanisms, and provides further context to connect these patterns to potential explanatory factors in plant diversity such as the age of sex chromosomes and the developmental trajectories of male and female flowers.

      Weaknesses:

      The presence of sex chromosomes is a potential confounding factor, since there are different evolutionary expectations for X-linked, Y-linked, and autosomal genes. Attempting to distinguish transcripts on the sex chromosomes from autosomal transcripts could provide additional insight into the relative contributions of positive and relaxed selection.

      Reviewer #3 (Public Review):

      The potential for sexual selection and the extent of sexual dimorphism in gene expression have been studied in great detail in animals, but hardly examined in plants so far. In this context, the study by Zhao, Zhou et al. al represents a welcome addition to the literature.

      Relative to the previous studies in Angiosperms, the dataset is interesting in that it focuses on reproductive rather than somatic tissues (which makes sense to investigate sexual selection), and includes more than a single developmental stage (buds + mature flowers).<br /> Some aspects of the presentation have been improved in this new version of the manuscript.

      Specifically:

      • the link between sex-biased and tissue-biased genes is now slightly clearer,

      • the limitation related to the de novo assembled transcriptome is now formally acknowledged,

      • the interpretation of functional categories of the genes identified is more precise,

      • the legends of supplementary figures have been improved - a large number of typos have been fixed.

      in response to this first round of reviews. As I detail below, many of the relevant and constructive suggestions by the previous reviewers were not taken into account in this revision.

      For instance:

      • Reviewer 2 made precise suggestions for trying to take into account the potential confounding factor of sex-chromosomes. This suggestion was not followed.

      For the question of reviewer 2:

      The presence of sex chromosomes is a potential confounding factor, since there are different evolutionary expectations for X-linked, Y-linked, and autosomal genes. Attempting to distinguish transcripts on the sex chromosomes from autosomal transcripts could provide additional insight into the relative contributions of positive and relaxed selection.

      Empirically, the analyses could be expanded by an attempt to distinguish between genes on the autosomes and the sex chromosomes. Genotypic patterns can be used to provisionally assign transcripts to XY or XX-like behavior when all males are heterozygous and all females are homozygous (fixed X-Y SNPs) and when all females are heterozygous and males are homozygous (lost or silenced Y genes). Comparing such genes to autosomal genes with sex-biased expression would sharpen the results because there are different expectations for the efficacy of selection on sex chromosomes. See this paper (Hough et al. 2014; https://www.pnas.org/doi/abs/10.1073/pnas.1319227111), which should be cited and does in fact identify faster substitution rates in Y-linked genes.

      Authors’ response: We have cited Hough et al. (2014) and Sandler et al. (2018) in the revised manuscript. We agree that the presence of sex chromosomes is potentially a confounding factor. By adopting methods in Hough et al. (2014) and Sandler et al. (2018), we tried to distinguish transcripts on sex chromosomes from autosomal chromosomes. For a total of 2,378 unbiased genes, we found that 36 genes were putatively sex chromosomal genes, 20 of which were exclusively heterozygous and homozygous for males and females, respectively; while the other 16 genes showing an opposite genotyping patterns between males and females. For 343 male-biased genes, only three ones exhibit a pattern of potentially sex-linked. For the 1,145 female-biased genes, we identified 19 genes which might located on the sex chromosomes. Among the 19 genes, five genes were exclusively heterozygous for males and exclusively homozygous for females, while reversed genotyping patterns presented in the other 14 genes. So, sex-linked genes may contribute relatively little to rapid evolution of male-biased genes. An alternative explanation is that the results could be unreliable due to small sample sizes. Thus, we did not describe them in the Results section. We will investigate the issue when whole genome sequences and population datasets become available in the near future.

      • Reviewer 1 & 3 indicated that results were mentioned in the discussion section without having been described before. This was not fixed in this new version.

      For the question of reviewer 1:

      2) Paragraph (407-416) describes the analysis of duplicated genes under relaxed selection but there is no mention of this in the results.

      Authors’ response: Following this suggestion, in the Results section, we have added a sentence, “We also found that most of them were members of different gene families generated by gene duplication (Table S13)” on line 310-311 in the revised manuscript (Rapid_evolution_of_malebiased_genes_Trichosanthes_pilosa_Tracked_change_2023_11_06.docx).

      For the question of reviewer 1:

      38- line 417-424. The discussion should not contain new results.

      Authors’ response: Thank you for pointing out this. In the Results section, we have added a few sentences as following: “Similarly, given that dN/dS values of sex-biased genes were higher due to codon usage bias, lower dS rates would be expected in sex-biased genes relative to unbiased genes (Ellegren & Parsch, 2007; Parvathy et al., 2022). However, in our results, the median of dS values in male-biased genes were much higher than those in female-biased and unbiased genes in the results of ‘free-ratio’ (Fig. S4A, female-biased versus male-biased genes, P = 6.444e-12 and malebiased versus unbiased genes, P = 4.564e-13) and ‘two-ratio’ branch model (Fig. S4B, femalebiased versus male-biased genes, P = 2.2e-16 and male-biased versus unbiased genes, P = 9.421e08, respectively). ” on line 323-331, and consequently, removed the following sentence, “femalebiased vs male-biased genes, P = 6.444e-12 and male-biased vs unbiased genes, P = 4.564e-13” and “female-biased versus male-biased genes, P = 2.2e-16 and male-biased versus unbiased genes, P = 9.421e-08, respectively” in the Discussion section.

      • Reviewer 1 asked for a comparison between the number of de novo assembled unigenes in this transcriptome and the number of genes in other Cucurbitaceae species. I could not see this comparison reported.

      Authors’ response: In the first revision, we described only percentages. We have now added the number of genes. We modify this part as follows: “The majority of unigenes were annotated by homologs in species of Cucurbitaceae (61.6%, 36,375), including Momordica charantia (16.3%, 9,625), Cucumis melo (11.9%, 7,027), Cucurbita pepo (11.9%, 7,027), Cucurbita moschata (11.5%, 6,791), Cucurbita maxima (10.1%, 5,964) and other species (38.4%, 22,676) (Fig. S1C).”.

      • Reviewer 1 pointed out that permutation tests were more appropriate, but no change was made to the manuscript.

      Authors’ response: Thank you for your suggestion. In the first revision, we have indirectly responded to the issues. Wilcoxon rank sum test is more commonly used for all comparisons between sex-biased and unbiased genes in many papers. Additionally, we tested datasets using permutation t-tests, which is consistent with the results of Wilcoxon rank sum test. For example, we found that only in floral buds, there are significant differences in ω values in the results of ‘free-ratio’ (female-biased versus male-biased genes, P = 0.04282 and male-biased versus unbiased genes, P = 0.01114) and ‘two-ratio’ model (female-biased versus male-biased genes, P = 0.01992 and male-biased versus unbiased genes, P = 0.02127, respectively). We also described these results in the Results section accordingly (line 278-284).

      • Reviewer 3 pointed out the small sample size (both for the RNA-seq and the phylogenetic analysis), but again this limitation is not acknowledged very clearly.

      Authors’ response: Sorry, we acknowledged that our sample size was relatively small. In the revised version, we have added a sentence as follows, “Additionally, our sample size is relatively small, and may provide low power to detect differential expression.” in the Discussion section.

      • Reviewer 1 & 3 pointed out that Fig 3 was hard to understand and asked for clarifications that I did not see in the text and the figure in unchanged.

      Authors’ response: Thank you for your suggestions. We have revised the manuscript to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and presented the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Fig. 3.

      • Reviewer 3 suggested to combine all genes with sex-bias expression when evaluating the evolutionary rate, in addition to the analyses already done. This suggestion was not followed.

      For the question of reviewer 3:line 196 and following: In these analyses, I could not understand the rationale for keeping buds vs mature flowers as separate analyses throughout. Why not combine both and use the full set of genes showing sex-bias in any tissue? This would increase the power and make the presentation of the results a lot more straightforward.

      Authors’ response: Thank you for your suggestions. In the first revision, we tried to respond to the issues. First, we observed strong sexual dimorphism in floral buds, such as racemose versus solitary, early-flowering versus late-flowering. Second, as you pointed out earlier, “the dataset is interesting in that it focuses on reproductive rather than somatic tissues (which makes sense to investigate sexual selection), and includes more than a single developmental stage (buds + mature flowers)”, we totally agree with you on this point. Third, according to your suggestions, we combined all genes with sex-bias expression to evaluate the evolutionary rates. We found significant differences (please see a Figure below) in ω values in the results of ‘free-ratio’ (female-biased versus male-biased genes, P =0.005622 and male-biased versus unbiased genes, P = 0.001961) and ‘two-ratio’ model (female-biased versus male-biased genes, P = 0.008546 and male-biased versus unbiased genes, P = 0.009831, respectively) using Wilcoxon rank sum test. However, the significance is lower than previous results in floral buds due to sex-biased genes of mature flower joined, especially compared to the results of “free-ratio model”. Additionally, we also test all combined genes with sex-bias expression using permutation t-test. Unfortunately, there are no significant differences in ω values expect for male-biased versus unbiased genes in the results of ‘free-ratio’ model (P = 0.03034) and ‘two-ratio’ model (P = 0.0376), respectively. To a certain extent, the combination of all genes with sex-bias expression may cover the signals of rapid evolution of sex-biased genes in floral buds. Therefore, these results are not described in our manuscript. In the near future, we would like to make further investigations through more development stages of flowers and new technologies (e.g. Single-Cell method, See Murat et al., 2023) in each sex to consolidate the conclusion, and it is hoped that we could find more meaningful results.

      Author response image 1.

      • Reviewer 3 pointed out that hand-picking specific categories of genes was not statistically valid, and in fact not necessary in the present context. This was not changed.

      For the question of reviewer3: removing genes on a post-hoc basis seems statistically suspicious to me. I don't think your analysis has enough power to hand-pick specific categories of genes, and it is not clear what this brings here. I suggest simply removing these analyses and paragraphs.

      Authors’ response: Thank you for your suggestions. We have changed them accordingly. We removed a part of the following paragraph, “To confirm the contributions of positive selection and relaxed selection to rapid rates of male-biased genes in floral buds, we generated three datasets of OGs by excluding different sets of genes. Specifically, we excluded 18 relaxed selective male-biased genes (5.23%), 98 positively selected male-biased genes (28.57%), and 112 male-biased genes (32.65%) under positive and relaxed selection from 343 OGs (Fig. S4). We observed that after excluding male-biased genes under relaxed purifying selection, the median (0.264) decreased by 0.34% compared to the median (0.265) of all OGs (Fig. S4A-B). However, after excluding positively selected male-biased genes, the median (0.236) was reduced by 11% (Fig. S4A, C) in the results of ‘free-ratio’ branch model. This pattern was consistent with the results of ‘two-ratio’ branch model as well (Fig. S4E-G).” on line 290 to 300.

      However, we kept the following paragraph, “We also analyzed female-biased and unbiased genes that underwent positive and relaxed selection in floral buds (Tables S6-S10). We identified 216 (18.86%) positively selected, and 69 (6.03%) relaxed selective female-biased genes from 1,145 OGs, respectively. Similarly, we found 436 (18.33%) positively selected, and 43 (1.81%) unbiased genes under relaxed selection from 2,378 OGs, respectively. Notably, male-biased genes have a higher proportion (10%) of positively selected genes compared to female-biased and unbiased genes. However, relaxed selective male-biased genes have a higher proportion (3.24%) than unbiased genes, but about 0.8% lower than that of female-biased genes.”. In this way, we can compare the proportion of sex-biased genes that have undergone positive selection and release selection among female-biased genes, unbiased genes and male-biased genes in floral buds in the Discussion section.

      • Reviewer 1 asked for all data to be public, but I could not find in the manuscript where the link to the data on ResearchGate was provided.

      Authors’ response: We have added a link in the Data Availability section.

      • Reviewers 1 & 3 pointed out that since only two tissues were compared, the claims on pleiotropy should have been toned down, but no change was made to the text.

      Authors’ response: Thank you for your suggestions. We revised “due to low pleiotropic constraints” to “due to low evolutionary constraints” and revised “low pleiotropy” to “low constraints”.

      • Reviewer 1 asked for a clarification on which genes are plotted on the heatmap of Fig3C and an explanation of the color scale. No change was made.

      Authors’ response: Sorry for the confusion. Actually, Reviewer 1 asked that “Fig. 2C, which genes are plotted on the heatmap and what is the color scale corresponding to?” In the previous revision, we have revised them (See Fig. 2 Sex-biased gene expression for floral buds and flowers at anthesis in males and females of Trichosanthes pilosa). Sex-biased genes (the union of sex-biased genes in F1, M1, F2 and M2) are plotted on the heatmap. The color gradient represents from high to low (from red to green) gene expression.

      • Reviewer 1 asked for panel B in Fig S5 and S6 to be removed. They are still there. They asked for abbreviations to be explained in the legend of Fig S8. This was not done. They asked for details about columns headers. Such detailed were not added. They asked for more recent references on line 53-56: this was not done.

      Authors’ response: We have removed panel B in Fig. S5 and S6. We explained abbreviations in text and Fig. S8. We added more details about the column headers in Supplementary Table S4, S5, S6, S7, S8, S9 and S10. We also added more recent references on line 53-56.

      Recommendations for the authors:

      Reviewer #3 (Recommendations For The Authors):

      Authors’ response: Thank you for your suggestions. We have revised/fixed these issues following your concerns and suggestions.

      Line 46-48 would be clearer as « Sexual dimorphism is the condition where sexes of the same species exhibit different morphological, ecological and physiological traits in gonochoristic animals and dioecious plants, despite male and female individuals sharing the same genome except for sex chromosomes or sex-determining loci »

      Authors’ response: Thanks. We have revised it accordingly.

      Line 50: replace «in both » by «between the two »

      Authors’ response: We have revised it.

      Line 51: « genes exclusively » -> « genes expressed exclusively »

      Authors’ response: We have revised it.

      Line 58: « in many animals » -> « in several animal species »

      Authors’ response: We have revised it to “in some animal species”.

      Line 58: « to which » -> « of this bias »

      Authors’ response: We have revised it.

      Line 64: « Most dioecious plants possess homomorphic sex-chromosomes that are roughly similar in size when viewed by light microscopy. » : a reference is missing

      Authors’ response: We have added the reference.

      Line 67: remove « that »

      Authors’ response: We have revised it.

      line 96: change to: « only the five above-mentioned studies »

      Authors’ response: We have revised it.

      Line 97: remove « the »

      Authors’ response: We have revised it.

      Line 111: « Drosophia » -> Drosophila

      Authors’ response: We have revised it.

      Line 114: exhibiting -> « exhibited »

      Authors’ response: We have revised it.

      Line 115: suggest -> « suggesting »

      Authors’ response: We have revised it.

      Line 117: « studies in plants have rarely reported elevated rates of sex-biased genes » : is it « rarely » or « never » ?

      Authors’ response: We have revised to “never”.

      Line 143: « It’s » -> « Its »

      Authors’ response: We have revised it.

      Line 143-146: say whether the male parts (e.g. anthers) are still present in females flowers, and the female parts (pistil+ ovaries) in the male flowers, or whether these respective organs are fully aborted.

      Authors’ response: We have added the following sentence, “The male parts (e. g., anthers) of female flowers, and the female parts (e. g., pistil and ovaries) of male flowers are fully aborted” in line 148150 of the Introduction section.

      Line 158: this is now clearer, but please specify whether you are talking about 12 floral buds in total, or 12 per individual (i.e. 72 buds in total).

      Authors’ response: We have revised it to “Using whole transcriptome shotgun sequencing, we sequenced floral buds and flowers at anthesis from female and male of dioecious T. pilosa. We set up three biological replicates from three female and three male plants, including 12 samples in total (six floral buds and six flowers at anthesis)”.

      Line 194-198: These sentences are unclear and hard to link to the figure. Consider changing for « In male plants, the number of tissue-biased genes in flowers at anthesis (M2TGs: n = 2795) was higher than that in floral buds (M1TGs: n = 1755, Fig. 3A and 3B). Figure 3 is also very hard to read. Adding a label on the side to indicate that panels A and B correspond to male-biased genes and C and D to female-biased genes could be useful.

      Authors’ response: Thank you for your suggestions. We have revised the text to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and presented the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Figure 3.

      Line 208: explain the approach: e.g. « We then compared rates of protein evolution among malebiased, female-biased and unbiased genes. To do this, we sequenced floral bud transcriptomes from the closely related T. anguina, as well as two more distant outgroups, T. kirilowii and Luffa cylindrica. T. kirilowii is a dioecious species like T. pilosa, and the other two are monoecious. We identified one-to-one orthologous groups (OGs) for 1,145 female-biased, 343 male-biased, and 2,378 unbiased genes. »

      Authors’ response: We have revised this paragraph to the following, “We compared rates of protein evolution among male-biased, female-biased and unbiased genes in four species with phylogenetic relationships (((T. anguina, T. pilosa), T. kirilowii), Luffa cylindrica), including dioecious T. pilosa, dioecious T. kirilowii, monoecious T. anguina in Trichosanthes, together with monoecious Luffa cylindrica. To do this, we sequenced transcriptomes of T. pilosa. We also collected transcriptomes of T. kirilowii, as well as genomes of T. anguina and Luffa cylindrica.”

      Line 220: « the same ω value was in all branches » -> « all branches are constrained to have the same ω value ».

      Authors’ response: We have revised it.

      Line 221: « results of the 'two-ratio' branch model ... »

      Authors’ response: We have revised it.

      Line 235: add a few words to explain why the effect size is bigger than for buds, but still is not significant: e.g. «possibly because of limited statistical power due to the low number of sex-biased genes in flowers at anthesis »

      Authors’ response: We have revised this to “However, there is no statistically significant difference in the distribution of ω values using Wilcoxon rank sum tests for female-biased versus male-biased genes (P = 0.0556), female-biased versus unbiased genes (P = 0.0796), and male-biased versus unbiased genes (P = 0.3296) possibly because of limited statistical power due to the low number of sex-biased genes in flowers at anthesis.” in line 260-261.

      Line 255: explain in plain English what the « A model » is. This was already requested in the previous version.

      Authors’ response: We have revised “A model” to “classical branch-site model A”.

      Line 258: explain in plain English what the « foreground 2b ω value » corresponds to

      Authors’ response: We have revised to as follows, “foreground 2b ω value” to “foreground ω >1”. Additionally, we also added the sentence “The classical branch-site model assumes four site classes (0, 1, 2a, 2b), with different ω values for the foreground and background branches. In site classes 2a and 2b, the foreground branch undergoes positive selection when there is ω > 1.” in line 624-627.

      Line 259: explain how these different approaches complement each other rather than being redundant. This was also already requested in the previous version.

      Authors’ response: Sorry. We have now revised it as follows, “As a complementary approach, we utilized the aBSREL and BUSTED methods that are implemented in HyPhy v.2.5 software, which avoids false positive results by classical branch-site models due to the presence of rate variation in background branches, and detected significant evidence of positive selection.” in line 292-295.

      Line 270: remove « dramatically », and also remove « or eliminated at both gene-wide and genomewide levels », as well as « relative to positive selection »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Line 290-309: remove this section - this was already pointed out in the previous reviews as a « ad hoc » procedure, and this point has already been made clear with the RELAX analysis.

      Authors’ response: Thank you for your suggestions. We revised this section accordingly. We remove the following paragraph, “To confirm the contributions of positive selection and relaxed selection to rapid rates of male-biased genes in floral buds, we generated three datasets of OGs by excluding different sets of genes. Specifically, we excluded 18 relaxed selective male-biased genes (5.23%), 98 positively selected male-biased genes (28.57%), and 112 male-biased genes (32.65%) under positive and relaxed selection from 343 OGs (Fig. S4). We observed that after excluding malebiased genes under relaxed purifying selection, the median (0.264) decreased by 0.34% compared to the median (0.265) of all OGs (Fig. S4A-B). However, after excluding positively selected malebiased genes, the median (0.236) was reduced by 11% (Fig. S4A, C) in the results of ‘free-ratio’ branch model. This pattern was consistent with the results of ‘two-ratio’ branch model as well (Fig. S4E-G).” on line 334-344.

      However, we kept the other parts “We also analyzed female-biased and unbiased genes that underwent positive and relaxed selection in floral buds (Tables S6-S10). We identified 216 (18.86%) positively selected, and 69 (6.03%) relaxed selective female-biased genes from 1,145 OGs, respectively. Similarly, we found 436 (18.33%) positively selected, and 43 (1.81%) unbiased genes under relaxed selection from 2,378 OGs, respectively. Notably, male-biased genes have a higher proportion (10%) of positively selected genes compared to female-biased and unbiased genes. However, relaxed selective male-biased genes have a higher proportion (3.24%) than unbiased genes, but about 0.8% lower than that of female-biased genes.”. In this way, we can compare the proportion of sex-biased genes that have undergone positive selection and release selection among female-biased genes, unbiased genes and male-biased genes in floral buds in the Discussion sections.

      Line 348: Here you talk about « Numerous studies », but then only report three studies. Please clarify.

      Authors’ response: Thank you for your suggestions. We have revised it to “Several studies”.

      Line 352: Cut the sentence: « In contrast, the wind-pollinated dioecious plant Populus balsamifera ... »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Line 357: « In contrast to the above studies... »: If I understand correctly, this is not in contrast to the observation in Populus balsamifera. Please clarify.

      Authors’ response: Thank you for your suggestions. We have revised to “Similar to the above study of Populus balsamifera.”.

      Line 420: « our results » -> « we »; « that underwent » -> « undergoing »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Figure 3 is very hard to read and poorly labeled (see my comments on line 194 above). It is also hard to link to the text, since the numbers reported in the text are actually not present in the figure unless the readers makes some calculations themselves. This should be improved. Also, the use of acronyms (e.g. M1BG, F2TG etc.) contributes to making the text very difficult to read. The acronyms should at least be explained very clearly in the text when they are used.

      Authors’ response: Thank you for your suggestions. We have revised the text to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and give the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Figure 3.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Review:

      Reviewer #2 (Public Review): 

      Regarding reviewer #2 public review, we update here our answers to this public review with new analysis and modification done in the manuscript. 

      This manuscript is missing a direct phenotypic comparison of control cells to complement that of cells expressing RhoGEF2-DHPH at "low levels" (the cells that would respond to optogenetic stimulation by retracting); and cells expressing RhoGEF2-DHPH at "high levels" (the cells that would respond to optogenetic stimulation by protruding). In other words, the authors should examine cell area, the distribution of actin and myosin, etc in all three groups of cells (akin to the time zero data from figures 3 and 5, with a negative control). For example, does the basal expression meaningfully affect the PRG low-expressing cells before activation e.g. ectopic stress fibers? This need not be an optogenetic experiment, the authors could express RhoGEF2DHPH without SspB (as in Fig 4G). 

      Updated answer: We thank reviewer #2 for this suggestion. PRG-DHPH overexpression is known to affect the phenotype of the cell as shown in Valon et al., 2017. In our experiments, we could not identify any evidence of a particular phenotype before optogenetic activation apart from the area and spontaneous membrane speed that were already reported in our manuscript (Fig 2E and SuppFig 2). Regarding the distribution of actin and myosin, we did not observe an obvious pattern that will be predictive of the protruding/retracting phenotype. Trying to be more quantitative, we have classified (by eye, without knowing the expression level of PRG nor the future phenotype) the presence of stress fibers, the amount of cortical actin, the strength of focal adhesions, and the circularity of cells. As shown below, when these classes are binned by levels of expression of PRG (two levels below the threshold and two above) there is no clear determinant. Thus, we concluded that the main driver of the phenotype was the PRG basal expression rather than any particularity of the actin cytoskeleton/cell shape.

      Author response image 1.

      Author response image 2.

      Relatedly, the authors seem to assume ("recruitment of the same DH-PH domain of PRG at the membrane, in the same cell line, which means in the same biochemical environment." supplement) that the only difference between the high and low expressors are the level of expression. Given the chronic overexpression and the fact that the capacity for this phenotypic shift is not recruitmentdependent, this is not necessarily a safe assumption. The expression of this GEF could well induce e.g. gene expression changes. 

      Updated answer: We agree with reviewer #2 that there could be changes in gene expression. In the next point of this supplementary note, we had specified it, by saying « that overexpression has an influence on cell state, defined as protein basal activity or concentration before activation. »  We are sorry if it was not clear, and we changed this sentence in the revised manuscript (in red in the supp note). 

      One of the interests of the model is that it does not require any change in absolute concentrations, beside the GEF. The model is thought to be minimal and fits well and explains the data with very few parameters. We do not show that there is no change in concentration, but we show that it is not required to invoke it. We revised a sentence in the new version of the manuscript to include this point.

      Additional answer: During the revision process, we have been looking for an experimental demonstration of the independence of the phenotypic switch to any change in global gene expression pattern due to the chronic overexpression of PRG. Our idea was to be in a condition of high PRG overexpression such that cells protrude upon optogenetic activation, and then acutely deplete PRG to see if cells where then retracting. To deplete PRG in a timescale that prevent any change of gene expression, we considered the recently developed CATCHFIRE (PMID: 37640938) chemical dimerizer. We designed an experiment in which the PRG DH-PH domain was expressed in fusion with a FIRE-tag and co-expressing the FIRE-mate fused to TOM20 together with the optoPRG tool. Upon incubation with the MATCH small molecule, we should be able to recruit the overexpressed PRG to the mitochondria within minutes, hereby preventing it to form a complex with active RhoA in the vicinity of the plasma membrane. Unfortunately, despite of numerous trials we never achieved the required conditions: we could not have cells with high enough expression of PRGFIRE-tag (for protrusive response) and low enough expression of optoPRG (for retraction upon PRGFIRE-tag depletion). We still think this would be a nice experiment to perform, but it will require the establishment of a stable cell line with finely tuned expression levels of the CATCHFIRE system that goes beyond the timeline of our present work.      

      Concerning the overall model summarizing the authors' observations, they "hypothesized that the activity of RhoA was in competition with the activity of Cdc42"; "At low concentration of the GEF, both RhoA and Cdc42 are activated by optogenetic recruitment of optoPRG, but RhoA takes over. At high GEF concentration, recruitment of optoPRG lead to both activation of Cdc42 and inhibition of already present activated RhoA, which pushes the balance towards Cdc42."

      These descriptions are not precise. What is the nature of the competition between RhoA and Cdc42? Is this competition for activation by the GEFs? Is it a competition between the phenotypic output resulting from the effectors of the GEFs? Is it competition from the optogenetic probe and Rho effectors and the Rho biosensors? In all likelihood, all of these effects are involved, but the authors should more precisely explain the underlying nature of this phenotypic switch. Some of these points are clarified in the supplement, but should also be explicit in the main text. 

      Updated answer: We consider the competition between RhoA and Cdc42 as a competition between retraction due to the protein network triggered by RhoA (through ROCK-Myosin and mDia-bundled actin) and the protrusion triggered by Cdc42 (through PAK-Rac-ARP2/3-branched Actin). We made this point explicit in the main text.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):  

      Major 

      - why this is only possible for such few cells. Can the authors comment on this in the discussion? Does the model provide any hints? 

      As said in our answer to the public comment or reviewer #1, we think that the low number of cells being able to switch can be explained by two different reasons: 

      (1) First, we were looking for clear inversions of the phenotype, where we could see clear ruffles in the case of the protrusion, and clear retractions in the other case. Thus, we discarded cells that would show in-between phenotypes, because we had no quantitative parameter to compare how protrusive or retractile they were. This reduced the number of switching cells 

      (2) Second, we had a limitation due to the dynamic of the optogenetic dimer used here. Indeed, the control of the frequency was limited by the dynamic of unbinding of the optogenetic dimer. This dynamic of recruitment (~20s) is comparable to the dynamics of the deactivation of RhoA and Cdc42. Thus, the differences in frequency are smoothed and we could not vary enough the frequency to increase the number of switches. Thanks to the model, we can predict that increasing the unbinding rate of the optogenetic tool (shorter dimer lifetime) should allow us to increase the number of switching cells. 

      We have added a sentence in the discussion to make this second point explicit.

      - I would encourage the authors to discuss this molecular signaling switch in the context of general design principles of switches. How generalizable is this network/mechanism? Is it exclusive to activating signaling proteins or would it work with inhibiting mechanisms? Is the competition for the same binding site between activators and effectors a common mechanism in other switches? 

      The most common design principle for molecular switches is the bistable switch that relies on a nonlinear activation (for example through cooperativity) with a linear deactivation. Such a design allows the switch between low and high levels. In our case, there is no need for a non-linearity since the core mechanism is a competition for the same binding site on active RhoA of the activator and the effectors. Thus, the design principle would be closer to the notion of a minimal “paradoxical component” (PMID: 23352242) that both activate and limit signal propagation, which in our case can be thought as a self-limiting mechanism to prevent uncontrolled RhoA activation by the positive feedback. Yet, as we show in our work, this core mechanism is not enough for the phenotypic switch to happen since the dual activation of RhoA and Cdc42 is ultimately required for the protrusion phenotype to take over the retracting one. Given the particularity of the switch we observed here, we do not feel comfortable to speculate on any general design principles in the main text, but we thank reviewer #1 for his/her suggestion.

      - Supplementary figures - there is a discrepancy between the figures called in the text and the supplementary files, which only include SF1-4. 

      We apologize for this error and we made the correction. 

      - In the text, the authors use Supp Figure 7 to show that the phenotype could not be switched by varying the fold increase of recruitment through changing the intensity/duration of the light pulse. Aside from providing the figure, could you give an explanation or speculation of why? Does the model give any prediction as to why this could be difficult to achieve experimentally (is the range of experimentally feasible fold change of 1.1-3 too small? Also, could you clarify why the range is different than the 3 to 10-fold mentioned at the beginning of the results section? 

      We thank the reviewer for this question, and this difference between frequency and intensity can be indeed understood in a simple manner through the model. 

      All the reactions in our model were modeled as linear reactions. Thus, at any timepoint, changing the intensity of the pulse will only change proportionally the amount of the different components (amount of active RhoA, amount of sequestered RhoA, and amount of active Cdc42). This explains why we cannot change the balance between RhoA activity and Cdc42 activity only through the pulse strength. We observed the same experimentally: when we changed the intensity of the pulses, the phenotype would be smaller/stronger, but would never switch, supporting our hypothesis on the linearity of all biochemical reactions. 

      On the contrary, changing the frequency has an effect, for a simple reason: the dynamics of RhoA and Cdc42 activation are not the same as the dynamics of inhibition of RhoA by the PH domain (see

      Figure 4). The inhibition of RhoA by the PH is almost instantaneous while the activation of RhoGTPases has a delay (sets by the deactivation parameter k_2). Intuitively, increasing the frequency will lead to sustained inhibition of RhoA, promoting the protrusion phenotype. Decreasing the frequency – with a stronger pulse to keep the same amount of recruited PRG – restricts this inhibition of RhoA to the first seconds following the activation. The delayed activation of RhoA will then take over. 

      We added two sentences in the manuscript to explain in greater details the difference between intensity and frequency.  

      Regarding the difference between the 1.3-3 fold and the 3 to 10 fold, the explanation is the following: the 3 to 10 fold referred to the cumulative amount of proteins being recruited after multiple activations (steady state amount reached after 5 minutes with one activation every 30s); while the 1.3-3 fold is what can be obtained after only one single pulse of activation.  

      - The transient expression achieves a large range of concentration levels which is a strength in this case. To solve the experimental difficulties associated with this, i.e. finding transfected cells at low cell density, the authors developed a software solution (Cell finder). Since this approach will be of interest for a wide range of applications, I think it would deserve a mention in the discussion part. 

      We thank the reviewer for his/her interest in this small software solution.

      We developed the description of the tool in the Method section. The Cell finder is also available with comments on github (https://github.com/jdeseze/cellfinder) and usable for anyone using Metamorph or Micromanager imaging software. 

      Minor 

      - Can the authors describe what they mean with "cell state"? It is used multiple times in the manuscript and can be interpreted as various things. 

      We now explain what we mean by ‘cell state’ in the main text :

      “protein basal activities and/or concentrations - which we called the cell state”

      - “(from 0% to 45%, Figure 2D)", maybe add here: "compare also with Fig. 2A". 

      We completed the sentence as suggested, which clarifies the data for the readers.

      - The sentence "Given that the phenotype switch appeared to be controlled by the amount of overexpressed optoPRG, we hypothesized that the corresponding leakiness of activity could influence the cell state prior to any activation." might be hard to understand for readers unfamiliar with optogenetic systems. I suggest adding a short sentence explaining dark-state activity/leakiness before putting the hypothesis forward. 

      We changed this whole beginning of the paragraph to clarify.

      - Figure 2E and SF2A. I would suggest swapping these two panels as the quantification of the membrane displacement before activation seems more relevant in this context. 

      We thank reviewer #1 for this suggestion and we agree with it (we swapped the two panels)

      - Fig. 2B is missing the white frames in the mixed panels. 

      We are sorry for this mistake, we changed it in the new version.  

      - In the text describing the experiment of Fig. 4G, it would again be helpful to define what the authors mean by cell state, or to state the expected outcome for both hypotheses before revealing the result.

      We added precisions above on what we meant by cell state, which is the basal protein activities and/or concentrations prior to optogenetic activation. We added the expectation as follow: 

      To discriminate between these two hypotheses, we overexpressed the DH-PH domain alone in another fluorescent channel (iRFP) and recruited the mutated PH at the membrane. “If the binding to RhoA-GTP was only required to change the cell state, we would expect the same statistics than in Figure 2D, with a majority of protruding cells due to DH-PH overexpression. On the contrary, we observed a large majority of retracting phenotype even in highly expressing cells (Figure 4G), showing that the PH binding to RhoA-GTP during recruitment is a key component of the protruding phenotype.”

      - Figure 4H,I: "of cells that overexpress PRG, where we only recruit the PH domain" doesn't match with the figure caption. Are these two constructs in the same cell? If not please clarify the main text. 

      We agree that it was not clear. Both constructs are in the same cell, and we changed the figure caption accordingly.  

      - "since RhoA dominates Cdc42" is this concluded from experiments (if yes, please refer to the figure) or is this known from the literature (if yes, please cite). 

      The assumption that RhoA dominates Cdc42 comes from the fact that we see retraction at low PRG concentration. We assumed that RhoA is responsible for the retraction phenotype. Our assumption is based on the literature (Burridge 2004 as an example of a review, confirmed by many experiments, such as the direct recruitment of RhoA to the membrane, see Berlew 2021) and is supported by our observations of immediate increase of RhoA activity at low PRG. We modified the text to clarify it is an assumption.

      - Fig. 6G  o left: is not intuitive, why are the number of molecules different to start with? 

      The number of molecules is different because they represent the active molecules: increasing the amount of PRG increases the amount of active RhoA and active Cdc42. We updated the figure to clarify this point.

      o right: the y-axis label says "phenotype", maybe change it to "activity" or add a second y-axis on the right with "phenotype"? 

      We updated the figure following reviewer #1 suggestion.

      - Discussion: "or a retraction in the same region" sounds like in the same cell. Perhaps rephrase to state retraction in a similar region? 

      Sorry for the confusion, we change it to be really clear: “a protrusion in the activation region when highly expressed, or a retraction in the activation region when expressed at low concentrations.”

      Typos: 

      - "between 3 and 10 fold" without s. 

      - Fig. 1H, y-axis label. 

      - "whose spectrum overlaps" with s. 

      - "it first decays, and then rises" with s. 

      - Fig 4B and Fig 6B. Is the time in sec or min? (Maybe double-check all figures). 

      - "This result suggests that one could switch the phenotype in a single cell by selecting it for an intermediate expression level of the optoPRG.". 

      - "GEF-H1 PH domain has almost the same inhibition ability as PRG PH domain". 

      We corrected all these mistakes and thank the reviewer for his careful reading of the manuscript.

      Reviewer #2 (Recommendations For The Authors): 

      Likewise, the model assumes that at high PRG GEF expression, the "reaction is happening far from saturation ..." and that "GTPases activated with strong stimuli -giving rise to strong phenotypic changes- lead to only 5% of the proteins in a GTP-state, both for RhoA and Cdc42". Given the high levels of expression (the absolute value of which is not known) this assumption is not necessarily safe to assume. The shift to Cdc42 could indeed result from the quantitative conversion of RhoA into its active state. 

      We agree with the reviewer that the hypothesis that RhoA is fully converted into its active state cannot be completely ruled out. However, we think that the two following points can justify our choice.

      - First, we see that even in the protruding phenotype, RhoA activity is increasing upon optoPRG recruitment (Figure 3). This means that RhoA is not completely turned into its active GTP-loaded state. The biosensor intensity is rising by a factor 1.5 after 5 minutes (and continue to increase, even if not shown here). For sure, it could be explained by the relocation of RhoA to the place of activation, but it still shows that cells with high PRG expression are not completely saturated in RhoA-GTP. 

      - We agree that linearity (no saturation) is still an hypothesis and very difficult to rule out, because it is not only a question of absolute concentrations of GEFs and RhoA, but also a question of their reaction kinetics, which are unknow parameters in vivo. Yet, adding a saturation parameter would mean adding 3 unknown parameters (absolute concentrations of RhoA, as well as two reaction constants). The fact that there are not needed to fit the complex curves of RhoA as we do with only one parameter tends to show that the minimal ingredients representing the interaction are captured here.  

      The observed "inhibition of RhoA by the PH domain of the GEF at high concentrations" could result from the ability of the probe to, upon membrane recruitment, bind to active RhoA (via its PH domain) thereby outcompeting the RhoA biosensor (Figure 4A-C). This reaction is explicitly stated in the supplemental materials ("PH domain binding to RhoA-GTP is required for protruding phenotype but not sufficient, and it is acting as an inhibitor of RhoA activity."), but should be more explicit in the main text. Indeed, even when PRG DHPH is expressed at high concentrations, it does activate RhoA upon recruitment (figure 3GH). Not only might overexpression of this active RhoA-binding probe inhibit the cortical recruitment of the RhoA biosensor, but it may also inhibit the ability of active RhoA to activate its downstream effectors, such as ROCK, which could explain the decrease in myosin accumulation (figure 3D-F). It is not clear that there is a way to clearly rule this out, but it may impact the interpretation. 

      This hypothesis is actually what we claim in the manuscript. We think that the inhibition of RhoA by the PH domain is explained by its direct binding. We may have missed what Reviewer #2 wanted to say, but we think that we state it explicitly in the main text :

      “Knowing that the PH domain of PRG triggers a positive feedback loop thanks to its binding to active RhoA 18, we hypothesized that this binding could sequester active RhoA at high optoPRG levels, thus being responsible for its inhibition.”

      And also in the Discussion:

      “However, this feedback loop can turn into a negative one for high levels of GEF: the direct interaction between the PH domain and RhoA-GTP prevents RhoA-GTP binding to effectors through a competition for the same binding site.”

      We may have not been clear, but we think that this is what is happening: the PH domain prevents the binding to effectors and decreases RhoA activity (as was shown in Chen et al. 2010).  

      The X-axis in Figure 4C time is in seconds not minutes. The Y-axis in Figure 4H is unlabeled. 

      We are sorry for the mistake of Figure 4C. We changed the Y-axis in the Figure 4h.  

      Although this publication cites some of the relevant prior literature, it fails to cite some particularly relevant works. For example, the authors state, "The LARG DH domain was already used with the iLid system" and refers to a 2018 paper (ref 19), whereas that domain was first used in 2016 (PMID 27298323). Indeed, the authors used the plasmid from this 2016 paper to build their construct. 

      We thank the reviewer for pointing out this error, we have corrected the citation and put the seminal one in the revised version.

      An analogous situation pertains to previous work that showed that an optogenetic probe containing the DH and PH domains in RhoGEF2 is somewhat toxic in vivo (table 6; PMID 33200987). Furthermore, it has previously been shown that mutation of the equivalent of F1044A and I1046E eliminates this toxicity (table 6; PMID 33200987) in vivo. This is particularly important because the Rho probe expressing RhoGEF2-DHPH is in widespread usage (76 citations in PubMed). The ability of this probe to activate Cdc42 may explain some of the phenotypic differences described resulting from the recruitment of RhoGEF2-DHPH and LARG-DH in a developmental context (PMID 29915285, 33200987). 

      We thank reviewer #2 for these comments, and added a small section in the discussion, for optogenetic users: 

      This underlines the attention that needs to be paid to the choice of specific GEF domains when using optogenetic tools. Tools using DH-PH domains of PRG have been widely used, both in mammalian cells and in Drosophila (with the orthologous gene RhoGEF2), and have been shown to be toxic in some contexts in vivo 28. Our study confirms the complex behavior of this domain which cannot be reduced to a simple RhoA activator.   

      Concerning the experiment shown in 4D, it would be informative to repeat this experiment in which a non-recruitable DH-PH domain of PRG is overexpressed at high levels and the DH domain of LARG is recruited. This would enable the authors to distinguish whether the protrusion response is entirely dependent on the cell state prior to activation or the combination of the cell state prior to activation and the ability of PRG DHPH to also activate Cdc42. 

      We thank the reviewer for his suggestion. Yet, we think that we have enough direct evidence that the protruding phenotype is due to both the cell state prior to activation and the ability of PRG DHPH to also activate Cdc42. First, we see a direct increase in Cdc42 activity following optoPRG recruitment (see Figure 6). This increase is sustained in the protruding phenotype and precedes Rac1 and RhoA activity, which shows that it is the first of these three GTPases to be activated. Moreover, we showed that inhibition of PAK by the very specific drug IPA3 is completely abolishing only the protruding phenotype, which shows that PAK, a direct effector of Cdc42 and Rac1, is required for the protruding phenotype to happen. We know also that the cell state prior to activation is defining the phenotype, thanks to the data presented in Figure 2. 

      We further showed in Figure 1 that LARG DH-PH domain was not able to promote protrusion. The proposed experiment would be interesting to confirm that LARG does not have the ability to activate another GTPase, even in a different cell state with overexpressed PRG. However, we are not sure it would bring any substantial findings to understand the mechanism we describe here, given the facts provided above.  

      Similarly, as PRG activates both Cdc42 and Rho at high levels, it would be important to determine the extent to which the acute Rho activation contributes to the observed phenotype (e.g. with Rho kinase inhibitor). 

      We agree with the reviewer that it would be interesting to know whether RhoA activation contributes to the observed phenotype, and we have tried such experiments. 

      For Rho kinase inhibitor, we tried with Y-27632 and we could never prevent the protruding phenotype to happen. However, we could not completely abolish the retracting phenotype either (even when the effect on the cells was quite strong and visible), which could be due to other effectors compensating for this inhibition. As RhoA has many other effectors, it does not tell us that RhoA is not required for protrusion. 

      We also tried with C3, which is a direct inhibitor of RhoA. However, it had too much impact on the basal state of the cells, making it impossible to recruit (cells were becoming round and clearly dying. As both the basal state and optogenetic activation require the activation of RhoA, it is hard to conclude out of experiments where no cell is responding. 

      The ability of PRG to activate Cdc42 in vivo is striking given the strong preference for RhoA over Cdc42 in vitro (2400X) (PMID 23255595). Is it possible that at these high expression levels, much of the RhoA in the cell is already activated, so that the sole effect that recruited PRG can induce is activation of Cdc42? This is related to the previous point pertaining to absolute expression levels.  

      As discussed before, we think that it is not only a question of absolute expression levels, but also of the affinities between the different partners. But Reviewer #2 is right, there is a competition between the activation of RhoA and Cdc42 by optoPRG, and activation of Cdc42 probably happens at higher concentration because of smaller effective affinity.

      Still, we know that activation of the Cdc42 by PRG DH-PH domain is possible in vivo, as it was very clearly shown in Castillo-Kauil et al., 2020 (PMID 33023908). They show that this activation requires the linker between DH and PH domain of PRG, as well as Gαs activation, which requires a change in PRG DH-PH conformation. This conformational switch does not happen in vitro, which might explain why the affinity against Cdc42 was found to be very low. 

      Minor points 

      In both the abstract and the introduction the authors state, "we show that a single protein can trigger either protrusion or retraction when recruited to the plasma membrane, polarizing the cell in two opposite directions." However, the cells do not polarize in opposite directions, ie the cells that retract do not protrude in the direction opposite the retraction (or at least that is not shown). Rather a single protein can trigger either protrusion or retraction when recruited to the plasma membrane, depending upon expression levels. 

      We thank the reviewer for this remark, and we agree that we had not shown any data supporting a change in polarization. We solved this issue, by showing now in Supplementary Figure 1 the change in areas in both the activated and in the not activated region. The data clearly show that when a protrusion is happening, the cell retracts in the non-activated region. On the other hand, when the cell retracts, a protrusion happens in the other part of the cell, while the total area is staying approximately constant. 

      We added the following sentence to describe our new figure:

      Quantification of the changes in membrane area in both the activated and non-activated part of the cell (Supp Figure 1B-C) reveals that the whole cell is moving, polarizing in one direction or the other upon optogenetic activation.

      While the authors provide extensive quantitative data in this manuscript and quantify the relative differences in expression levels that result in the different phenotypes, it would be helpful to quantify the absolute levels of expression of these GEFs relative to e.g. an endogenously expressed GEF. 

      We agree with the reviewer comment, and we also wanted to have an idea of the absolute level of expression of GEFs present in these cells to be able to relate fluorescent intensities with absolute concentrations. We tried different methods, especially with the purified fluorescent protein, but having exact numbers is a hard task.

      We ended up quantifying the amount of fluorescent protein within a stable cell line thanks to ELISA and comparing it with the mean fluorescence seen under the microscope. 

      We estimated that the switch concentration was around 200nM, which is 8 times more than the mean endogenous concentration according to https://opencell.czbiohub.org/, but should be reachable locally in wild type cell, or globally in mutated cancer cells. 

      Given the numerical data (mostly) in hand, it would be interesting to determine whether RhoGEF2 levels, cell area, the pattern of actin assembly, or some other property is most predictive of the response to PRG DHPH recruitment. 

      We think that the manuscript made it clear that the concentration of PRG DHPH is almost 100% predictive of the response to PRG DHPH. We believe that other phenotypes such as the cell area or the pattern of actin assembly would only be consequences of this. Interestingly, as experimentators we were absolutely not able to predict the behavior by only seeing the shape of the cell, event after hundreds of activation experiments, and we tried to find characteristics that would distinguish both populations with the data in our hands and could not find any.

      There is some room for general improvement/editing of the text. 

      We tried our best to improve the text, following reviewers suggestions.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Audio et al. measured cerebral blood volume (CBV) across cortical areas and layers using high-resolution MRI with contrast agents in non-human primates. While the non-invasive CBV MRI methodology is often used to enhance fMRI sensitivity in NHPs, its application for baseline CBV measurement is rare due to the complexities of susceptibility contrast mechanisms. The authors determined the number of large vessels and the areal and laminar variations of CBV in NHP, and compared those with various other metrics.

      Strengths:

      Noninvasive mapping of relative cerebral blood volume is novel for non-human primates. A key finding was the observation of variations in CBV across regions; primary sensory cortices had high CBV, whereas other higher areas had low CBV. The measured CBV values correlated with previously reported neuronal and receptor densities.

      We appreciate your recognition of the novelty of our non-invasive relative cerebral blood volume (CBV) mapping in non-human primates, as well as the observed areal variations and their correlations with neuronal and receptor densities. However, we are concerned that key contributions of our work—such as cortical layer-specific vasculature mapping and benchmarking surface vessel density estimations against anatomical ground truth—are being framed as limitations rather than significant advances in the field pushing the boundaries of current neuroimaging capabilities and providing a valuable foundation for future research. Additionally, we would like to clarify that dynamic susceptibility contrast (DSC) MRI using gadolinium is the gold standard for CBV measurement in clinical settings and the argument that “baseline CBV measurements are rare due to the complexities of susceptibility contrast” is simply not true. The limited use of ferumoxytol for CBV imaging is primarily due to previous FDA regulatory restrictions, rather than inherent methodological shortcomings.

      Changes in text:

      Compared to clinically used gadolinium-based agents, ferumoxytol's substantially longer half-life and stronger R<sub>2</sub>* effect allows for higher-resolution and more sensitive vascular volume measurements (Buch et al., 2022), albeit these methodologies are hampered by confounding factors such as vessel orientation relative to the magnetic field (B<sub>0</sub>) direction (Ogawa et al., 1993).

      Weaknesses:

      A weakness of this manuscript is that the quantification of CBV with postprocessing approaches to remove susceptibility effects from pial and penetrating vessels is not fully validated, especially on a laminar scale. Further specific comments follow.

      (1) Baseline CBV indices were determined using contrast agent-enhanced MRI (deltaR<sub>2</sub>*). Although this approach is suitable for areal comparisons, its application at a laminar scale poses challenges due to significant contributions from large vessels including pial vessels. The primary concern is whether large-vessel contributions can be removed from the measured deltaR<sub>2</sub>* through processing techniques.

      Eliminating the contribution of large vessels completely is unlikely, and we agree with the reviewer that ΔR<sub>2</sub>* results likely reflect a weighted combination of signals from both large vessels and capillaries. However, the distribution of ΔR<sub>2</sub>* more closely aligns with capillary density in areas V1–V5 than with large vessel distributions (Weber et al., 2008), suggesting that our ΔR<sub>2</sub>* results are more weighted toward capillaries. Moreover, we demonstrated that the pial vessel induced signal-intensity drop-outs are clearly limited to the superficial layers and exhibit smaller spatial extent than generally thought (Supp. Figs. 2 and 4).

      (2) High-resolution MRI with a critical sampling frequency estimated from previous studies (Weber 2008, Zheng 1991) was performed to separate penetrating vessels. However, this approach is still insufficient to accurately identify the number of vessels due to the blooming effects of susceptibility and insufficient spatial resolution. The reported number of penetrating vessels is only applicable to the experimental and processing conditions used in this study, which cannot be generalized.

      Our intention was not to suggest that our measurements provide a general estimate of vessel density across the macaque cerebral cortex. At 0.23 mm isotropic resolution, we successfully delineated approximately 30% of the penetrating vessels in V1. Our primary objective was to demonstrate a proof-of-concept quantifiable measurement rather than to establish a generalized vessel density metric for all brain regions. We have consistently emphasized this throughout the manuscript, but if there is a specific point of misunderstanding, we would be happy to consider revisions for clarity.

      (3) Baseline R<sub>2</sub>* is sensitive to baseline R<sub>2</sub>, vascular volume, iron content, and susceptibility gradients. Additionally, it is sensitive to imaging parameters; higher spatial resolution tends to result in lower R<sub>2</sub>* values (closer to the R<sub>2</sub> value). Thus, it is difficult to correlate baseline R<sub>2</sub>* with physiological parameters.

      The observed correlation between R<sub>2</sub>* and neuron density is likely indirect, as R<sub>2</sub>* is strongly influenced by iron, myelin, and deoxyhemoglobin densities. However, the robust correlation between R<sub>2</sub>* and neuron density, peaking in the superficial layers (R = 0.86, p < 10<sup>-10</sup>), is striking and difficult to ignore (revised Supp. Fig. 6D-E). Upon revision, we identified an error in Supp. Fig. 6D-E, where the previous version used single-subject R<sub>2</sub>* and ΔR<sub>2</sub>* maps instead of the group-averaged maps. The revised correlations are slightly stronger than in the earlier version.

      Given that the correlation between neuron density and R<sub>2</sub>* is strongest in the superficial layers, we suggest this relationship reflects an underlying association with tissue cytochrome oxidase (CO) activity and cumulative effect of deoxygenated venous blood drainage toward the pial network. The superficial cortical layers are also less influenced by myelin and iron densities, which are more concentrated in the deeper cortical layers. Additional factors may contribute to this relationship, including the iron dependence of mitochondrial CO activity, as iron is an essential component of CO’s heme groups. Moreover, myelin maintenance depends on iron, which is predominantly stored in oligodendrocytes. The presence of myelinated thin axons and a higher axonal surface density may, in turn, be a prerequisite for high neuron density.

      In this context, it is also valuable to note the absolute range of superficial R<sub>2</sub>* values (≈ 6 s<sup>-1</sup>; Supp. Fig. 6D). This variation in cortical surface R<sub>2</sub>* is about 12-30 times larger compared to the signal changes observed during task-based fMRI (6 vs. 0.2-0.5 s<sup>-1</sup>). This relation seems reasonable because regional increases in absolute blood flow associated with imaging signals, as measured by PET, typically do not exceed 5%–10% of the brain's resting blood flow (Raichle and Mintum 2016; Brain work and brain imaging). The venous oxygenation level is typically 60%, with task-induced activation increasing it by only a few percent. We suggest that this is ~40% oxygen extraction is reflected in the superficial R<sub>2</sub>*. Finally, the large intercept (≈ 14.5 1/s; Supp. Fig. 6D), which is not equivalent to the water R<sub>2</sub>* (≈ 1 1/s), suggests that R<sub>2</sub>* is influenced by substantial non-neuron density factors, such as receptor, myelin, iron, susceptibility gradients and spatial resolution.

      The R<sub>2</sub>* values are well known to be influenced by intra-voxel phase coherence and thus spatial resolution. However, our view is that the proposed methodology of acquiring cortical-layer thickness adjusted high-resolution (spin-echo) R<sub>2</sub> maps poses more methodological limitations and is less practical. Notwithstanding, to further corroborate the relationship between R<sub>2</sub>* and neuron density, we investigated whether a similar correlation exists in non-quantitative T2w SPACE-FLAIR images (0.32 mm isotropic) signal-intensity and neuron density. Using B<sub>1</sub> bias-field and B<sub>0</sub> orientation bias corrected T2w SPACE-FLAIR images (N=7), we parcellated the equivolumetric surface maps using Vanderbilt sections. Our findings showed that signal intensity—where regions with high signal intensity correspond to low R<sub>2</sub> values, and areas with low signal intensity correspond to high R<sub>2</sub> values—was positively correlated with neuron density, particularly in the superficial layers (R = 0.77, p = 10<sup>-11</sup>; Author response image 1).This analysis confirmed the correlation with neuron density and R<sub>2</sub> peaks at superficial layers. However, this correlation was slightly weaker compared to quantitative R<sub>2</sub>* (Supp. Fig. 6D), suggesting the variable flip-angle spin-echo train refocused signal-phase coherence loss from large draining vessels or that non-quantitative T2w-FLAIR images may be confounded by other factors such as B<sub>1</sub> transmission field biases (Glasser et al., 2022). Notwithstanding, this non-quantitative fast spin-echo with variable flip-angles approach, which is in principle less dependent on image resolution and closer to R<sub>2,intrinsic</sub> than R<sub>2</sub>*, yields similar findings in comparison to quantitative gradient-echo.

      Author response image 1.

      (A) T2w-FLAIR SPACE normalized signal-intensity plotted vs neuron density. Note that low signal-intensity corresponds to high R<sub>2</sub> and high neuron density, consistent with findings using ME-GRE. (B) Correlation between T2w-FLAIR SPACE and neuron density across equivolumetric layers. Notably, a similar relationship with neuron density was observed using a variable spin-echo pulse sequence as with quantitative gradient-echo-based imaging.

      Changes in text:

      Results:

      “Because the Julich cortical area atlas covers only a section of the cerebral cortex, and the neuron density estimates are interpolated maps, we extended our analysis using the original Collins sample borders encompassing the entire cerebral cortex (Supp. Fig. 6A-C). This analysis reaffirmed the positive correlation with ΔR<sub>2</sub>* (peak at EL2, R = 0.80, p < 10<sup>-11</sup>) and baseline R<sub>2</sub>* (peak at EL2a, R = 0.86, p < 10<sup>-13</sup>), yielding linear coefficients of ΔR<sub>2</sub>* = 102 × 10<sup>3</sup> neurons/s and R<sub>2</sub>* = 41 × 10<sup>3</sup> neurons/s (Supp. Fig. 6D-G). This suggests that the sensitivity of quantitative layer R<sub>2</sub>* MRI in detecting neuronal loss is relatively weak, and the introduction of the Ferumoxytol contrast agent has the potential to enhance this sensitivity by a factor of 2.5.”

      A new paragraph was added into discussion section 4.3 corroborating the relation between R<sub>2</sub>* and neuron density:

      “Another key finding of this study was the strong correlation between baseline R<sub>2</sub>* and neuron density (Supp. Fig. 6D, E). While R<sub>2</sub>* is well known to be influenced by iron, myelin, and deoxyhemoglobin densities, this correlation peaks in the superficial layers (Supp. Fig. 6E), suggesting a link to CO activity and the accumulation of deoxygenated venous blood draining from all cortical layers toward the pial network. Notably, the absolute range of superficial R<sub>2</sub>* values (max - min ≈ 6 s<sup>-1</sup>; Supp. Fig. 6D) is approximately 12-30 times larger than the ΔR<sub>2</sub>* observed during task-based BOLD fMRI at 3T (0.2-0.5 1/s) (Yablonskiy and Haacke 1994). Since venous oxygenation is around 60% and task-induced changes in blood flow account for only 5%–10% of the brain's resting blood flow (Raichle & Mintun, 2006), these results suggest that superficial R<sub>2</sub>* (Fig. 1D) may serve as a more accurate proxy for total deoxyhemoglobin content (and thus total oxygen consumption), which scales with the neuron density of the underlying cortical gray matter. Importantly, superficial layers may also provide a more specific measure of deoxyhemoglobin, as they are less influenced by myelin and iron, which are more concentrated in deeper cortical layers. Additionally, smaller but direct contributors, such as mitochondrial CO density—an iron-dependent factor—may also play a role in this relationship.”

      References:

      Raichle, M.E., Mintun, M.A., 2006. BRAIN WORK AND BRAIN IMAGING. Annu. Rev. Neurosci. 29, 449–476. https://doi.org/10.1146/annurev.neuro.29.051605.112819

      (4) CBV-weighted deltaR<sub>2</sub>* is correlated with various other metrics (cytoarchitectural parcellation, myelin/receptor density, cortical thickness, CO, cell-type specificity, etc.). While testing the correlation between deltaR<sub>2</sub>* and these other metrics may be acceptable as an exploratory analysis, it is challenging for readers to discern a causal relationship between them. A critical question is whether CBV-weighted deltaR<sub>2</sub>* can provide insights into other metrics in diseased or abnormal brain states.

      We acknowledge that having multivariate analysis using dense histological maps would be valuable to establish causality among these several metrics:

      “To comprehensively understand the factors contributing to the vascular organization of the brain, experimental disentanglement through multivariate analysis of laminar cell types and receptor densities is needed (Hayashi et al., 2021, Froudist-Walsh et al., 2023). Moreover, employing more advanced statistical modeling, including considerations for synapse-neuron interactions, may be important for refined evaluations.”

      We think the primary contributors to the brain's energy budget are neurons and receptors, as shown in several references and stated in the manuscript. To investigate relationship between neuron density and CBV, we estimated the energy budget allocated to neurons and extrapolated the remaining CBV to other contributing factors:

      Changes in text:

      “However, this is a simplified estimation, and a more comprehensive assessment would need to account for an aggregate of biophysical factors such as neuron types, neuron membrane surface area, firing rates, dendritic and synaptic densities (Fig. 6F-G), neurotransmitter recycling, and other cell types (Kageyama 1982; Elston and Rose 1997; Perge et al., 2009; Harris et al., 2012). Indeed, the majority of the mitochondria reside in the dendrites and synaptic transmission is widely acknowledged to drive the majority of the energy consumption and blood flow (Wong-Riley, 1989; Attwell et al., 2001).

      Extrapolating cortical ΔR<sub>2</sub>* to zero neuron density results in a large intercept (~35 1/s), corresponding to 60% of the maximum cortical CBV (57 1/s; Supp. Fig. 6F). This supports the view that the majority of energy consumption occurs in the neuropil—comprising dendrites, synapses, and axons—which accounts for ~80–90% of cortical gray matter volume, whereas neuronal somata constitute only ~10–20% (Wong-Riley, 1989). Although neuronal cell bodies exhibit higher CO activity per unit volume due to their dense mitochondrial content, these results suggest their overall contribution to the total CBV per mm<sup>3</sup> tissue remains lower than that of the neuropil, given the latter's substantially larger volume fraction in cortical tissue.

      Contrary to our initial expectations, we observed a relatively smaller CBV in regions and layers with high receptor density (Fig. 6B, D, F). This relationship extends to other factors, such as number of spines (putative excitatory inputs) and dendrite tree size across the entire cerebral cortex (Supp. Fig. 7) (Froudist-Walsh et al., 2023, Elston 2007). These results align with the work of Weber and colleagues, who reported a similar negative correlation between vascular length density and synaptic density, as well as a positive correlation with neuron density in macaque V1 across cortical layers (Weber et al., 2008).”

      Variations in neurons and receptors are reflected in cytoarchitecture, myelin (axon density likely scales with neuron density and myelin inhibits synaptic connections), and cell-type composition. For example, fast-spiking parvalbumin interneurons, which target the soma or axon hillock, are well-suited for regulating activity in regions with high neuron density, whereas bursting calretinin interneurons, which target distal dendrites, are more adapted to areas with high synaptic density. These factors in turn, gradually change along the cortical hierarchy level (higher levels have thinner cortical layer IV, more complex dendrite trees and more numerous inter-areal connectivity patterns). In our view, these factors are tightly interlinked and explain the strong correlations and metabolic demands observed across different metrics.

      We also agree that cortical layer imaging of vasculature in diseased or abnormal brain states is an intriguing direction for future research; however, it falls beyond the scope of the present study.

      Reviewer #2 (Public review):

      Summary:

      This manuscript presents a new approach for non-invasive, MRI-based, measurements of cerebral blood volume (CBV). Here, the authors use ferumoxytol, a high-contrast agent and apply specific sequences to infer CBV. The authors then move to statistically compare measured regional CBV with known distribution of different types of neurons, markers of metabolic load and others. While the presented methodology captures and estimated 30% of the vasculature, the authors corroborated previous findings regarding lack of vascular compartmentalization around functional neuronal units in the primary visual cortex.

      Strengths:

      Non invasive methodology geared to map vascular properties in vivo.

      Implementation of a highly sensitive approach for measuring blood volume.

      Ability to map vascular structural and functional vascular metrics to other types of published data.

      Weaknesses:

      The key issue here is the underlying assumption about the appropriate spatial sampling frequency needed to captures the architecture of the brain vasculature. Namely, ~7 penetrating vessels / mm2 as derived from Weber et al 2008 (Cer Cor). The cited work, begins by characterizing the spacing of penetrating arteries and ascending veins using vascular cast of 7 monkeys (Macaca mulatta, same as in the current paper). The ~7 penetrating vessels / mm2 is computed by dividing the total number of identified vessels by the area imaged. The problem here is that all measurements were made in a "non-volumetric" manner and only in V1. Extrapolating from here to the entire brain seems like an over-assumption, particularly given the region-dependent heterogeneity that the current paper reports.

      We appreciate the reviewer’s concerns regarding spatial sampling frequency and its implications for characterizing brain vasculature, which we investigated in this study. To clarify, our analysis of surface vessel density was explicitly restricted to V1 precisely due to the limitations of our experimental precision. While we reported the total number of vessels identified in the cortex, we intentionally chose not to present density values across regions in this manuscript. Although these calculations are feasible, we focused on the data directly analyzed and avoided extrapolating density values beyond the scope of our findings. Thus, we are uncertain about the suggestion that we extrapolated vessel density values across the entire brain, as we have taken care to limit our conclusions of our vessel density precision to V1.

      Regarding methodology, we conducted two independent analyses of vessel density specifically in V1. The first involved volumetric analysis using the Frangi filter, while the second used surface-based analysis of local signal-intensity gradients (as illustrated in Fig. 2E and Supp. Figs. 3 and 4), albeit the final surface density analysis is performed using the ultra-high resolution equivolumetric layers. Notably, these two approaches produced consistent and comparable vessel density estimates, supporting the reliability of our findings within the scope of V1 (we found 30% of the vessels relative to the ground-truth).

      Comments on revisions:

      I appreciate the effort made to improve the manuscript. That said, the direct validation of the underlying assumption about spatial resolution sampling remains unaddressed in the final version of this manuscript. With the only intention to further strengthen the methodology presented here, I would encourage again the authors to seek a direct validation of this assumption for other brain areas.

      In their reply, the authors stated "... line scanning or single-plane sequences, at least on first impression, seem inadequate for whole-brain coverage and cortical surface mapping. ". This seems to emanate for a misunderstanding as the method could be used to validate the mapping, not to map per-se.

      We apologize for any misunderstanding in our previous response and appreciate your clarification. We now understand that you were suggesting the use of line-scanning or single-plane sequences as a method to validate, rather than map, our spatial sampling assumptions.

      We agree that single-plane sequences at very high in-plane resolution (e.g., 50 × 50 × 1000 µm) have great potential to detect penetrating vessels and even vessel branching patterns. These techniques could indeed provide valuable insights into region-specific vessel density variations which could then be used to validate whole brain 3D acquisitions. However, as noted above, we have refrained from reporting vessel densities outside V1 precisely due to sampling limitations (we only found 30% of the penetrating vessels in V1, or only 2 mm<sup>2</sup>/30mm<sup>2</sup> ≈ 7% of branching vessel ground-truth, see discussion).

      We acknowledge the merit of incorporating such methods to validate regional vessel densities and agree that this would be an important avenue for future research. Thank you for suggesting this point, we have briefly mentioned the advantage of single-plane EPI at discussion.

      Changes in text:

      “4.1 Methodological considerations - vessel density informed MRI

      …anatomical studies accounting for branching patterns have reported much higher vessel densities up to 30 vessels/mm<sup>2</sup> (Keller et al., 2011; Adams et al., 2015). Further investigations are warranted, taking into account critical sampling frequencies associated with vessel branching patterns (Duverney 1981), and achieving higher SNR through ultra-high B<sub>0</sub> MRI (Bolan et al., 2006; Harel et al., 2010; Kim et al., 2013) and utilize high-resolution single-plane sequences and prospective motion correction schemes to accurately characterize regional vessel densities. Such advancements hold promise for improving vessel quantification, classifications for veins and arteries and constructing detailed cortical surface maps of the vascular networks which may have diagnostic and neurosurgical utilities (Fig. 2A, B) (Iadecola, 2013; Qi and Roper, 2021; Sweeney et al., 2018).”

      During the revision we found a typo and corrected it in Supp. Fig. 8: Dosal -> Dorsal.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their insightful comments and recommendations. We have extensively revised the manuscript in response to the valuable feedback. We believe the results is a more rigorous and thoughtful analysis of the data. Furthermore, our interpretation and discussion of the findings is more focused and highlights the importance of the circuit and its role in the response to stress. Thank you for helping to improve the presented science.

      Key changes made in response to the reviewers comments include:

      • Revision of statistical analyses for nearly all figures, with the addition of a new table of summary statistics to include F and/or t values alongside p-values.

      • Addition of statistical analyses for all fiber photometry data.

      • Examination of data for possible sex dependent effects.

      • Clarification of breeding strategies and genotype differences, with added details to methods to improve clarity.

      • Addressing concerns about the specificity of virus injections and the spread, with additional details added to methods.

      • Modification of terminology related to goal-directed behavior based on reviewer feedback, including removal of the term from the manuscript.

      • Clarification and additional data on the use of photostimulation and its effects, including efforts to inactivate neurons for further insight, despite technical challenges.

      • Correction of grammatical errors throughout the manuscript.

      Reviewer 1:

      Despite the manuscript being generally well-written and easy to follow, there are several grammatical errors throughout that need to be addressed.

      Thank you for highlighting this issue. Grammatical errors have been fixed in the revised version of the manuscript.

      Only p values are given in the text to support statistical differences. This is not sufficient. F and/or t values should be given as well.

      In response to this critique and similar comments from Reviewer 2, we re-evaluated our approach to statistical analyses and extensively revised analyses for nearly all figures. We also added a new table of summary statistics (Supplemental Table 1) containing the type of analysis, statistic, comparison, multiple comparisons, and p value(s). For Figures 4C-E, 5C, 6C-E, 7H-I, and 8H we analyzed these data using two-way repeated measures (RM) ANOVA that examined the main effect of time (either number of sessions or stimulation period) in the same animal and compared that to the main effect of genotype of the animal (Cre+ vs Cre-), and if there was an interaction. For Supplemental Figure 7A we also conducted a two-way RM ANOVA with time as a factor and activity state (number of port activations in active vs inactive nose port) as the other in Cre+ mice. For Figures 5D-E we conducted a two-way mixed model ANOVA that accounted and corrected for missing data. In figures that only compared two groups of data (Figures 5F-L, 6F, 8C-D, 8I, and Supp 6F-G) we used two-tailed t-test for the analysis. If our question and/or hypothesis required us to conduct multiple comparisons between or within treatments, we conducted Bonferroni’s multiple comparisons test for post hoc analysis (we note which groups we compared in Supplemental Table 1). For figures that did or did not show a change in calcium activity (Figure 3G, 3I-K, 7B, 7D-E, 8E-F), we compared waveform confidence intervals (Jean-Richard-Dit-Bressel, Clifford, McNally, 2020). The time windows we used as comparison are noted in Supplemental Table 1, and if the comparisons were significant at 95%, 99%, and 99.9% thresholds.

      None of prior comparisons in prior analyses that were significant were found to have fallen below thresh holds for significance. Of those found to be not significantly different, only one change was noted. In Figure 6E there was now a significant baseline difference between Cre+ and Cre- mice with Cre- mice taking longer to first engage the port compared to Cre+ mice (p=0.045). Although the more rigorous approach the statistical analyses did not change our interpretations we feel the enhanced the paper and thank the reviewer for pushing this improvement.

      Moreover, the fibre photometry data does not appear to have any statistical analyses reported - only confidence intervals represented in the figures without any mention of whether the null hypothesis that the elevations in activity observed are different from the baseline.

      This is particularly important where there is ambiguity, such as in Figure 3K, where the spontaneous activity of the animal appears to correlate with a spike in activity but the text mentions that there is no such difference. Without statistics, this is difficult to judge.

      Thank you for highlighting this critical point and providing an opportunity to strengthen our manuscript. We added statistical analyses of all fiber photometry data using a recently described approach based on waveform confidence intervals (Jean-Richard-Dit-Bressel, Clifford, McNally, 2020). In the statistical summary (Supplemental Table 1) we note the time window that we used for comparison in each analysis and if the comparisons were significant at 95%, 99%, and 99.9% thresholds. Thank you from highlighting this and helping make the manuscript stronger.

      With respect to Figure 3K, we are not certain we understood the spike in activity the reviewer referred to. Figure 3J and K include both velocity data (gold) and Ca2+ dependent signal (blue). We used episodes of velocity that were comparable to the avoidance respond during the ambush test and no significant differences in the Ca2+ signal when gating around changes in velocity in the absence of stressor (Supplemental Table1). This is in contrast to the significant change in Ca2+ signal following a mock predator ambush (Figure 3J). We interpret these data together to indicate that locomotion does not correlate with an increase in calcium activity in SuMVGLUT2+::POA neurons, but that coping to a stressor does. This conclusion is further examined in supplemental Figure 5, including examining cross-correlation to test for temporally offset relationship between velocity and Ca2+ signal in SUMVGLUT2+::POA neurons.

      The use of photostimulation only is unfortunate, it would have been really nice to see some inactivation of these neurons as well. This is because of the well-documented issues with being able to determine whether photostimulation is occurring in a physiological manner, and therefore makes certain data difficult to interpret. For instance, with regards to the 'active coping' behaviours - is this really the correct characterisation of what's going on? I wonder if the mice simply had developed immobile responding as a coping strategy but when they experience stimulation of these neurons that they find aversive, immobility is not sufficient to deal with the summative effects of the aversion from the swimming task as well as from the neuronal activation? An inactivation study would be more convincing.

      We agree with the point of the reviewer, experiments demonstrating necessity of SUMVGLUT2+::POA neurons would have added to the story here. We carried out multiple experiments aimed at addressing questions about necessity of SuMVGLUT2+::POA neurons in stress coping behaviors, specifically the forced swim assay. Efforts included employing chemogenetic, optogenetic, and tetanus toxin-based methods. We observed no effects on locomotor activity or stress coping. These experiments are both technically difficult and challenging to interpret. Interpretation of negative results, as we obtained, is particularly difficult because of potential technical confounds. Selective targeting of SuMVGLUT2+::POA neurons for inhibition requires a process requiring three viral injections and two recombination steps, increasing variability and reducing the number of neurons impacted. Alternatively, photoinhibition targeting SuMVGLUT2+::POA cells can be done using Retro-AAV injected into POA and a fiber implant over SuM. We tried both approaches. Data obtained were difficult to interpret because of questions about adequate coverage of SuMVGLUT2+::POA population by virally expressed constructs and/or light spread arose. The challenge of adequate coverage to effectively prevent output from the targeted population is further confounded by challenges inherent in neural inhibition, specifically determining if the inhibition created at the cellular level is adequate to block output in the context of excitatory inputs or if neurons must be first engaged in a particular manner for inhibition to be effective. Baseline neural activity, release probability, and post-synaptic effects could all be relevant, which photo-inhibition will potentially not resolve. So, while the trend is to always show “necessary and sufficient” effects, we’ve tried nearly everything, and we simply cannot conclude much from our mixed results. There are also wellestablished problems with existing photo-inhibition methods, which while people use them and tout them, are often ignored. We have a lot of expertise in photo-inhibition optogenetics, and indeed have used it with some success, developed new methods, yet in this particular case we are unable to draw conclusions related to inhibition. People have experienced similar challenges in locus coeruleus neurons, which have very low basal activity, and inhibition with chemogenetics is very hard, as well as with optogenetic pump-based approaches, because the neurons fire robust rebound APs. We have spent almost 2.5 years trying to get this to work in this circuit because reviews have been insistent on this result for the paper to be conclusive. Unfortunately, it simply isn’t possible in our view until we know more about the cell types involved. This is all in spite of experience using the approach in many other publications.

      We also employed less selective approaches, such as injecting AAV-DIO-tetanus toxin light chain (Tettox) constructs directly into SuM VGLUT2-Cre mice but found off target effects impacting animal wellbeing and impeding behavioral testing due viral spread to surrounding areas.

      While we are disappointed for being unable to directly address questions about necessity of SuMVGLUT2+::POA neurons in active coping with experimental data, we were unable to obtain results allowing for clear interpretation across numerous other domains the reviewers requested. We also feel strongly that until we have a clear picture of the molecular cell type architecture in the SuM, and Cre-drivers to target subsets of neurons, this question will be difficult to resolve for any group. We are working now on RNAseq and related spatial transcriptomics efforts in the SuM and examining additional behavioral paradigm to resolve these issues, so stay tuned for future publications.

      Accordingly, we avoid making statements relating to necessity in the manuscript. In spite of having several lines of physiological data with strong robust correlations behavior related to the SuMVGLUT2+::POA circuit.

      Nose poke is only nominally instrumental as it cannot be shown to have a unique relationship with the outcome that is independent of the stimuli-outcome relationships (in the same way that a lever press can, for example). Moreover, there is nothing here to show that the behaviours are goal-directed.

      Thank you for highlighting this point. Regarding goal-direct terminology, we removed this terminology from the manuscript. Since the mice perform highly selective (active vs inactive) port activation robustly across multiple days of training the behavior likely transitions to habitual behavior. We only tested the valuation of stimuli termination of the final day of training with time limited progressive ratio test. With respect to lever press versus active port activation, we are unclear how using a lever in this context would offer a different interpretation. Lever pressing may be more sensitive to changes in valuation when compared to nose poke port activation (Atalayer and Rowland 2008); however, in this study the focus of the operant behavior is separating innate behaviors for learned action–outcome instrumental learned behaviors for threat response (LeDoux and Daw 2018). The robust highly selective activation of the active port illustrated in Figure 6 fits as an action–outcome instrumental behavior wherein mice learn to engage the active but not inactive port to terminate photostimulation. The first activation of the port occurs through exploration of the arena but as demonstrated by the number of active port activations and the decline in time of the first active port engagement, mice expressing ChR2eYFP learn to engage the port to terminate the stimulation. To aid in illustrating this point we have added Supplemental Figure 7 showing active and inactive port activations for both Cre+ and Cre- mice. This adds clarity to high rate of selective port activation driven my stimulation of SUMVGLUT2+::POA neurons compared to controls. The elimination of goal directed and providing additional data narrows and supports one of the key points of the operant experiment.

      With regards to Figure 1: This is a nice figure, but I wonder if some quantification of the pathways and their density might be helpful, perhaps by measuring the intensity of fluorescence in image J (as these are processes, not cell bodies that can be counted)? Mind you, they all look pretty dense so perhaps this is not necessary! However, because the authors are looking at projections in so-called 'stress-engaged regions', the amygdala seems conspicuous by its absence. Did the authors look in the amygdala and find no projections? If so it seems that this would be worth noting.

      This is an interesting question but has proven to be a very technically challenging question. We consulted with several leaders who routinely use complimentary viral tracing methods in the field. We were unable to devise a method to provide a satisfactorily meaningful quantitative (as opposed to qualitative) approach to compare SUMVGLUT2+::POA to SuMVGLUT2+ projections. A few limitations are present that hinder a meaningful quantitative approach. One limitation was the need for different viral strategies to label the two populations. Labeling SuMVGLUT2+::POA neurons requires using VGLUT2-Flp mice with two injections into the POA and one into SuM. Two recombinase steps were required, reducing efficiency of overlap. This combination of viral injections, particularly the injections of RetroAAVs in the POA, can induce significant quantitative variability due to tropism, efficacy, and variability of retro-viral methods, and viral infection generally. These issues are often totally ignored in similar studies across the “neural circuit” landscape, but it doesn’t make them less relevant here.

      Although people do this in the field, and show quantification, we actually believe that it can be a quite misleading read-out of functionally relevant circuitry, given that neurotransmitter release ultimately is amplified by receptors post-synaptically, and many examples of robust behavioral effects have been observed with low fiber tracing complimentary methods (McCall, Siuda et al. 2017). In contrast, the broader SuMVGLUT2+ population was labeled using a single injection into the SuM. This means there like more efficient expression of the fluorophore. Additionally, in areas that contain terminals and passing fibers understanding and interpreting fluorescent signal is challenging. Together, these factors limit a meaningful quantitative comparison and make an interpretation difficult to make. In this context, we focused on a conservative qualitative presentation to demonstrate two central points. That 1) SuMVGLUT2+::POA neurons are subset of SuMVGLUT2+ neurons that project to specific areas and that exclude dentate gyrus, and they 2) arborize extensively to multiple areas which have be linked to threat responses. We agree that there is much to be learned about how different populations in SuM connect to targets in different regions of the brain and to continue to examine this question with different techniques. A meaningful quantitative study comparing projections is technically complex and, we feel, beyond our ability for this study.

      Also, for the reasons above we do not believe that quantification provides exceptional clarity with respect to the putative function of the circuit, glutamate released, or other cotransmitters given known amplification at the post-synaptic side of the circuit.

      With regard to the amygdala, other studies on SuM projections have found efferent projections to amygdala (Ottersen, 1980; Vertes, 1992). In our study we were unable to definitively determine projections from SuMVGLUT2+::POA neurons to amygdala, which if present are not particularly dense. For this reason we were conservative and do not comment on this particular structure.

      I would suggest removing the term goal-directed from the manuscript and just focusing on the active vs. passive distinction.

      We removed the use of goal-directed. Thank you for helping us clarify our terminology.

      The effect observed in Figure 7I is interesting, and I'm wondering if a rebound effect is the most likely explanation for this. Did the authors inhibit the VGAT neurons in this region at any other times and observe a similar rebound? If such a rebound was not observed it would suggest that it is something specific about this task that is producing the behaviour. I would like it if the authors could comment on this.

      We agree that results showing the change in coping strategy (passive to active) in forced swim after but not during stimulation of SuMVGAT+ neurons is quite interesting (Figure 7I). This experiment activated SuMVGAT+ neurons during a section of the forced swim assay and mice showed a robust shift to mobility after the stimulation of SuMVGAT+ neurons stopped. We did not carry out inhibition of SuMVGAT+ neurons in this manuscript. As the reviewer suggested, strong inhibition of local SuM neurons, including SUMVGLUT2+::POA neurons, could lead to rebound activity that may shift coping behaviors in confusing ways. We agree this is an interesting idea but do not have data to support the hypothesis further at this time.

      Reviewer 2

      (1) These are very difficult, small brain regions to hit, and it is commendable to take on the circuit under investigation here. However, there is no evidence throughout the manuscript that the authors are reliably hitting the targets and the spread is comparable across experiments, groups, etc., decreasing the significance of the current findings. There are no hit/virus spread maps presented for any data, and the representative images are cropped to avoid showing the brain regions lateral and dorsal to the target regions. In images where you can see the adjacent regions, there appears expression of cell bodies (such as Supp 6B), suggesting a lack of SuM specificity to the injections.

      We agree with the reviewer that the areas studied are small and technically challenging to hit. This was one of driving motivations for using multiple tools in tandem to restrict the area targeted for stimulation. Approaches included using a retrograde AAVs to express ChR2eFYP in SUMVGLUT2+::POA neurons; thereby, restricting expression to VGLUT2+ neurons that project to the POA. Targeting was further limited by placement of the optic fiber over cell bodies on SuM. Thus, only neurons that are VGLUT2+, project to the POA, and were close enough to the fiber were active by photostimulation. Regrettably, we were not able to compile images from mice where the fiber was misplaced leading to loss of behavioral effects. We would have liked to provide that here to address this comment. Unfortunately, generating heat maps for injections is not possible for anatomic studies that use unlabeled recombinase as part of an intersectional approach. Also determining the point of injection of a retroAAV can be difficult to accurately determine its location because neurons remote to injection site and their processes are labeled.

      Experiments described in Supplemental Figure 6B on VGAT neurons in SuM were designed and interpreted to support the point that SUMVGLUT2+::POA neurons are a distinct population that does not overlap with GABAergic neurons. For this point it is important that we targeted SuM, but highly confined targeting is not needed to support the central interpretation of the data. We do see labeling in SuM in VGAT-Cre mice but photo stimulation of SuMVGAT+ neurons does not generate the behavioral changes seen with activation of SUMVGLUT2+::POA neurons. As the reviewer points out, SuM is small target and viral injection is likely to spread beyond the anatomic boundaries to other VGAT+ neurons in the region, which are not the focus here. The activation would be restricted by the spread of light from the fiber over SuM (estimated to be about a 200um sphere in all directions). We did not further examine projections or localization of VGAT+ neurons in this study but focused on the differential behavioral effects of SUMVGLUT2+::POA neurons.

      (2) In addition, the whole brain tracing is very valuable, but there is very little quantification of the tracing. As the tracing is the first several figures and supp figure and the basis for the interpretation of the behavior results, it is important to understand things including how robust the POA projection is compared to the collateral regions, etc. Just a rep image for each of the first two figures is insufficient, especially given the above issue raised. The combination of validation of the restricted expression of viruses, rep images, and quantified tracing would add rigor that made the behavioral effects have more significance.

      For example, in Fig 2, how can one be sure that the nature of the difference between the nonspecific anterograde glutamate neuron tracing and the Sum-POA glutamate neuron tracing is real when there is no quantification or validation of the hits and expression, nor any quantification showing the effects replicate across mice? It could be due to many factors, such as the spread up the tract of the injection in the nonspecific experiment resulting in the labeling of additional regions, etc.

      Relatedly, in Supp 4, why isn’t C normalized to DAPI, which they show, or area? Similar for G what is the mcherry coverage/expression, and why isn’t Fos normalized to that?

      Thank you for highlighting the importance of anatomy and the value of anatomy. Two points based on the anatomic studies are central to our interpretation of the experimental data. First, SUMVGLUT2+::POA are a distinct population within the SuM. We show this by demonstrating they are not GABAergic and that they do not project to dentate gyrus. Projections from SuM to dentate gyrus have been described in multiple studies (Boulland et al., 2009; Haglund et al., 1987; Hashimotodani et al., 2018; Vertes, 1992) and we demonstrate them here for SuMVGLUT2+ cells. Using an intersectional approach in VGLUT2-Flp mice we show SUMVGLUT2+::POA neurons do not project to dentate gyrus. We show cell bodies of SUMVGLUT2+::POA neurons located in SuM across multiple figures including clear brain images. Thus, SUMVGLUT2+::POA neurons are SuM neurons that do not project to dentate gyrus, are not GABAergic, send projections to a distinct subset of targets, most notably excluding dentate gyrus. Second, SUMVGLUT2+::POA neurons arborize sending projections to multiple regions. We show this using a combinatorial genetic and viral approach to restrict expression of eYFP to only neurons that are in SuM (based on viral injection), project to the POA (based on retrograde AAV injection in POA), and VGLUT2+ (VGLUT2-Flp mice). Thus, any eYFP labeled projection comes from SUMVGLUT2+::POA neurons. We further confirmed projections using retroAAV injection into areas identified using anterograde approaches (Supplemental Figure 2). As discussed above in replies to Reviewer 1, we feel limitations are present that preclude meaningful quantitative analysis. We thus opted for a conservative interpretation as outlined.

      Prior studies have shown efferent projections from SuM to many areas, and projections to dentate gyrus have received substantial attention (Bouland et al., 2009; Haglund, Swanson, and Kohler, 1984; Hashimotodani et al., 2018; Soussi et al., 2010; Vertes, 1992; Pan and McNaugton, 2004). We saw many of the same projections from SuMVGLUT2+ neurons. We found no projections from SUMVGLUT2+::POA neurons to dentate gyrus (Figure 2). Our description of SuM projection to dentate gyrus is not new but finding a population of neurons in SuM that does not project to dentate gyrus but does project to other regions in hippocampus is new. This finding cannot be explained by spread of the virus in the tract or non-selective labeling.

      (3) The authors state that they use male and female mice, but they do not describe the n’s for each experiment or address sex as a biological variable in the design here. As there are baseline sex differences in locomotion, stress responses, etc., these could easily factor into behavioral effects observed here.

      Sex specific effects are possible; however, the studies presented here were not designed or powered to directly examine them. A point about experimental design that helps mitigate against strong sex dependent effect is that often the paradigm we used examined baseline (pre-stimulation) behavior, how behavior changed during stimulation, and how behavior returned (or not) to baseline after stimulation. Thus, we test changes in individual behaviors. Although we had limited statistical power, we conducted analyses to examine the effects of sex as variable in the experiments and found no differences among males and females.

      (4) In a similar vein as the above, the authors appear to use mice of different genotypes (however the exact genotypes and breeding strategy are not described) for their circuit manipulation studies without first validating that baseline behavioral expression, habituation, stress responses are not different. Therefore, it is unclear how to interpret the behavioral effects of circuit manipulation. For example in 7H, what would the VGLUT2-Cre mouse with control virus look like over time? Time is a confound for these behaviors, as mice often habituate to the task, and this varies from genotype to genotype. In Fig 8H, it looks like there may be some baseline differences between genotypes- what is normal food consumption like in these mice compared to each other? Do Cre+ mice just locomote and/or eat less? This issue exists across the figures and is related to issues of statistics, potential genotype differences, and other experimental design issues as described, as well as the question about the possibility of a general locomotor difference (vs only stress-induced). In addition, the authors use a control virus for the control groups in VGAT-Cre manipulation studies but do not explain the reasoning for the difference in approach.

      Thank you for highlighting the need for greater clarity about the breeding strategies used and for these related questions. We address the breeding strategy and then move to address the additional concerns raised. We have added details to the methods section to address this point. For VGLUT2-Cre mice we use litter mates controls from Cre/WT x WT/WT cross. The VGLUT2-Cre line (RRID:IMSR_JAX:028863) (Vong L , et al. 2011) used here been used in many other reports. We are not aware of any reports indicating a phenotype associated with the addition of the IRES-Cre to the Slc17a6 loci and there is no expected impact of expression of VGLUT2. Also, we see in many of the experiments here that the baseline (Figures 4, 5, and 7) behaviors are not different between the Cre+ and Cre- mice. For VGAT-Cre mice we used a different breeding strategy that allowed us to achieve greater control of the composition of litters and more efficient cohorts cohort. A Cre/Cre x WT/WT cross yielded all Cre/WT litters. The AAV injected, ChR2eYFP or eYFP, allowed us to balance the cohort.

      Regarding Figure 7H, which shows time immobile on the second day of a swim test, data from the Cre- mice demonstrate the natural course of progression during the second day of the test. The control mice in the VGAT-Cre cohort (Figure 7I) have similar trend. The change in behavior during the stimulation period in the Cre+ mice is caused by the activation of SUMVGLUT2+::POA neurons. The behavioral shift largely, but not completely, returns to baseline when the photostimulation stops. We have no reason to believe a VGLUT2-Cre+ mouse injected with control AAV to express eYFP would be different from WT littermate injected with AVV expressing ChR2eYFP in a Cre dependent manner.

      Turning to concerns related to 8H, which shows data from fasted mice quantify time spent interacting with food pellet immediately after presentation of a chow pellet, we found no significant difference between the control and Cre+ mice. We unaware of any evidence indicating that the two groups should have a different baseline since the Cre insertion is not expected to alter gene expression and we are unaware of reports of a phenotype relating to feeding and the presence of the transgene in this mouse line. Even if there were a small baseline shift this would not explain the large abrupt shift induced by the photostimulation. As noted above, we saw shifts in behavior abruptly induced by the initiation of photostimulation when compared to baseline in multiple experiments. This shift would not be explained by a hypothetical difference in the baseline behaviors of litter mates.

      (5) The statistics used throughout are inappropriate. The authors use serial Mann-Whitney U tests without a description of data distributions within and across groups. Further, they do not use any overall F tests even though most of the data are presented with more than two bars on the same graph. Stats should be employed according to how the data are presented together on a graph. For example, stats for pre-stim, stim, and post-stim behavior X between Cre+ and Cre- groups should employ something like a two-way repeated measures ANOVA, with post-hoc comparisons following up on those effects and interactions. There are many instances in which one group changes over time or there could be overall main effects of genotype. Not only is serially using Mann-Whitney tests within the same panel misleading and statistically inaccurate, but it cherry-picks the comparisons to be made to avoid more complex results. It is difficult to comprehend the effects of the manipulations presented without more careful consideration of the appropriate options for statistical analysis.

      We thank the reviewer for pointing this out and suggesting alterative analyses, we agree with the assessment on this topic. Therefore, we have extensively revised the statical approach to our data using the suggested approach. Reviewer 1 also made a similar comment, and we would like to point to our reply to reviewer 1’s second point in regard to what we changed and added to the new statistical analyses. Further, we have added a full table detailing the statical values for each figure to the paper.

      Conceptual:

      (6) What does the signal look like at the terminals in the POA? Any suggestion from the data that the projection to the POA is important?

      This is an interesting question that we will pursue in future investigations into the roles of the POA. We used the projection to the POA from SuM to identify a subpopulation in SuM and we were surprised to find the extensive arborization of these neurons to many areas associated with threat responses. We focused on the cell bodies as “hubs” with many “spokes”. Extensive studies are needed to understand the roles of individual projections and their targets. There is also the hypothetical technical challenge of manipulating one projection without activating retrograde propagation of action potentials to the soma. At the current time we have no specific insights into the roles of the isolated projection to POA. Interpretation of experiments activating only “spoke” of the hub would be challenging. Simple terminal stimulation experiments are challenged by the need to separate POA projections from activation of passing fibers targeting more anterior structures of the accumbens and septum.

      (7) Is this distinguishing active coping behavior without a locomotor phenotype? For example, Fig. 5I and other figure panels show a distance effect of stimulation (but see issues raised about the genotype of comparison groups). In addition, locomotor behavior is not included for many behaviors, so it is hard to completely buy the interpretation presented.

      We agree with the reviewer and thank them for highlighting this fundamental challenge in studies examining active coping behaviors in rodents, which requires movement. Additionally, actively responding to threatening stressors would include increased locomotor activity. Separation of movement alone from active coping can be challenging. Because of these concerns we undertook experiments using diverse behavioral paradigms to examine the elicited behaviors and the recruitment of SuMVGLUT2+::POA neurons to stressors. We conducted experiments to directly examine behaviors evoked by photoactivation of SuMVGLUT2+::POA. In these experiments we observed a diversity of behaviors including increased locomotion and jumping but also treading/digging (Figure 4). These are behaviors elicited in mice by threatening and noxious stimuli. An Increase of running or only jumping could signify a specific locomotor effect, but this is not what was observed. Based on these behaviors, we expected to find evidence of increase movement in open field (Figure 5G-I) and light dark choice (Figure 5J-L) assays. For many of the assays, reporting distance traveled is not practical. An important set of experiments that argues against a generic increase in locomotion is the operant behavior experiments, which require the animal to engage in a learned behavior while receiving photostimulation of SuMVGLUT2+::POA neurons (Figure 6). This is particularly true for testing using a progressive ratio when the time of ongoing photostimulation is longer, yet animals actively and selectively engage the active port (Figure 6G-H). Further, we saw a shift in behavioral strategy induce by photoactivation in forced swim test (Figure 7H). Thus, activation of SUMVGLUT2+::POA neurons elicited a range of behaviors that included swimming, jumping, treading, and learned response, not just increased movement. Together these data strongly argue that SuMVGLUT2+::POA neurons do not only promote increased locomotor behavior. We interpret these data together with the data from fiber photometry studies to show SuMVGLUT2+::POA neurons are recruited during acute stressors, contribute to aversive affective component of stress, and promote active behaviors without constraining the behavioral pattern.

      Regarding genotype, we address this in comments above as well but believe that clarifying the use of litter mates, the extensive use of the VGLUT2-Cre line by multiple groups, and experimental design allowing for comparison to baseline, stimulation evoked, and post stimulation behaviors within and across genotypes mitigate possible concerns relating to the genotype.

      (8) What is the role of GABA neurons in the SuM and how does this relate to their function and interaction with glutamate neurons? In Supp 8, GABA neuron activation also modulates locomotion and in Fig 7 there is an effect on immobility, so this seems pretty important for the overall interpretation and should probably be mentioned in the abstract.

      Thank you for noting these interesting findings. We added text to highlight these findings to the abstract. Possible roles of GABAergic neurons in SuM extend beyond the scope of the current study particularly since SuM neurons have been shown to release both GABA and glutamate (Li Y, Bao H, Luo Y, et al. 2020, Root DH, Zhang S, Barker DJ et al. 2018). GABAergic neurons regulate dentate gyrus (Ajibola MI, Wu JW, Abdulmajeed WI, Lien CC 2021), REM sleep (Billwiller F, Renouard L, Clement O, Fort P, Luppi PH 2017), and novelty processing Chen S, He L, Huang AJY, Boehringer R et al. 2020). The population of exclusively GABAergic vs dual neurotransmitter neurons in SuM requires further dissection to be understood. How they may relate to SUMVGLUT2+::POA neurons require further investigation.

      Questions about figure presentation:

      (9) In Fig 3, why are heat maps shown as a single animal for the first couple and a group average for the others?

      Thank you for highlighting this point for further clarification. We modified the labels in the figure to help make clear which figures are from one animal across multiple trials and those that are from multiple animals. In the ambush assay each animal one had one trial, to avoid habituation to the mock predator. Accordingly, we do not have multiple trials for each animal in this test. In contrast, the dunk assay (10 trial/animal) and the shock (5 trials/animal) had multiple trials for each animal. We present data from a representative animal when there are multiple trials per animal and the aggerate data.

      Why is the temporal resolution for J and K different even though the time scale shown is the same?

      Thank you for noticing this error carried forward from a prior draft of the figure so we could correct it. We replaced the image in 3J with a more correctly scaled heatmap.

      What is the evidence that these signal changes are not due to movement per se?

      Thank you for the question. There are two points of evidence. First, all the 465 nm excitation (Ca2+ dependent) data was collected in interleaved fashion with 415 nm (isosbestic) excitation data. The isosbestic signal is derived from GCaMP emission but is independent of Ca2+ binding (Martianova E, Aronson S, Proulx CD. 2019). This approach, time-division multiplexing, can correct calcium-dependent for changes in signal most often due to mechanical change. The second piece of evidence is experimental. Using multiple cohorts of mice, we examined if the change in Ca2+ signal was correlated with movement. We used the threshold of velocity of movement seen following the ambush. We found no correlation between high velocity movements and Ca2+ signal (Figure 3K) including cross correlational analysis (Supplemental figure 5). Based on these points together we conclude the change in the Ca2+ signal in SUMVGLUT2+::POA neurons is not due to movement induced mechanical changes and we find no correlation to movement unless a stressor is present, i.e. mock predator ambush or forced swim. Further, the stressors evoke very different locomotor responses fleeing, jumping, or swimming.

      (10) In Fig 4, the authors carefully code various behaviors in mice. While they pick a few and show them as bars, they do not show the distribution of behaviors in Cre- vs Cre+ mice before manipulation (to show they have similar behaviors) or how these behaviors shift categories in each group with stimulation. Which behaviors in each group are shifting to others across the stim and post-stim periods compared to pre-stim?

      This is an important point. We selected behaviors to highlight in Figure4 C-E because these behaviors are exhibited in response to stress (De Boer & Koolhaas, 2003; van Erp et al., 1994). For the highlighted behaviors, jumping, treading/digging, grooming, we show baseline (pre photostimulation), stimulation, and post stimulation for Cre+ and Cre- mice with the values for each animal plotted. We show all nine behaviors as a heat map in Figure 4B. The panels show changes that may occur as a function of time and show changes induced by photostimulation.

      The heatmaps demonstrate that photostimulation of SUMVGLUT2+::POA neurons causes a suppression of walking, grooming, and immobile behaviors with an increase in jumping, digging/treading, and rapid locomotion. After stimulation stops, there is an increase in grooming and time immobile. The control mice show a range of behaviors with no shifts noted with the onset or termination of photostimulation.

      Of note, issues of statistics, genotype, and SABV are important here. For example, the hint that treading/digging may have a slightly different pre-stim basal expression, it seems important to first evaluate strain and sex differences before interpreting these data.

      We examined the effects of sex as a biological variable in the experiments reported in the manuscript and found no differences among males and females in any of the experiments where we had enough animals in each sex (minimum of 5 mice) for meaningful comparisons. We did this by comparing means and SEM of males and females within each group (e.g. Cre+ males vs Cre+ female, Cre- males vs Cre- females) and then conducted a t-test to see if there was a difference. For figures that show time as a variable (e.g Figure 6C-E), we compared males and females with time x sex as main factors and compared them (including multiple comparisons if needed). We found no significant main effects or interactions between males and females. Because of this, and to maximize statistical power, we decided to move forward to keep males and females together in all the analyses presented in the manuscript. It is worth noting also that the core of the experimental design employed is a change in behavior caused by photostimulation. The mice are also the same strain with only difference being the modification to add an IRES and sequence for Cre behind the coding sequence of the Slc17A6 (VGLUT2) gene.

      (11) Why do the authors use 10 Hz stimulation primarily? is this a physiologically relevant stim frequency? They show that they get effects with 1 Hz, which can be quite different in terms of plasticity compared to 10 Hz.

      Thank you for the raising this important question. Because tests like open field and forced swim are subject to habituation and cannot be run multiple times per animal a test frequency was needed to use across multiple experiments for consistency. The frequency of 10Hz was selected because it falls within the rate of reported firing rates for SuM neurons (Farrel et al., 2021; Pedersen et al., 2017) and based on the robust but sub maximal effects seen in the real-time place preference assays. Identification of the native firing rates during stress response would be ideal but gathering this data for the identified population remains a dauting task.

      (12) In Fig 5A-F, it is unclear whether locomotion differences are playing a role. Entrances (which are low for both groups) are shown but distance traveled or velocity are not.

      In B, there is no color in the lower left panel. where are these mice spending their time? How is the entirety of the upper left panel brighter than the lower left? If the heat map is based on time distribution during the session, there should be more color in between blue and red in the lower left when you start to lose the red hot spots in the upper left, for example. That is, the mice have to be somewhere in apparatus. If the heat map is based on distance, it would seem the Cre- mice move less during the stim.

      We appreciate the opportunity to address this question, and the attention to detail the reviewer applied to our paper. In the real time place preference test (RTPP) stimulation would only be provided while the animal was on the stimulation side. Mice quickly leave the stimulation side of the arena, as seen in the supplemental video, particularly at the higher frequencies. Thus, the time stimulation is applied is quite low. The mice often retreat to a corner from entering the stimulation side during trials using higher frequency stimulation. Changing locomotor activity along could drive changes in the number entrances but we did not find this. In regard to the heat map, the color scale is dynamically set for each of the paired examples that are pulled from a single trial. To maximize the visibility between the paired examples the color scale does not transfer between the trials. As a result, in the example for 10 Hz the mouse spent a larger amount of time in the in the area corresponding to the lower right corner of the image and the maximum value of the color scale is assigned to that region. As seen in the supplemental video, mice often retreated to the corner of the non-stimulation side after entering the stimulation side. The control animal did not spend a concentrated amount of time in any one region, thus there is a lack of warmer colors. In contrast the baseline condition both Cre+ and Cre- mice spent time in areas disturbed on both sides of arena, as expected. As a result, the maximum value in the heat map is lower and more area are coded in warmer colors allowing for easier visual comparison between the pair. Using the scale for the 10 Hz pair across all leads to mostly dark images. We considered ways to optimized visualization across and within pairs and focused on the within pair comparison for visualization.

      (13) By starting with 1 hz, are the experimenters inducing LTD in the circuit? what would happen if you stop stimming after the first epoch? Would the behavioral effect continue? What does the heat map for the 1 hz stim look like?

      Relatedly, it is a lot of consistent stimulation over time and you likely would get glutamate depletion without a break in the stim for that long.

      Thank you for the opportunity to add clarity around this point regarding the trials in RTPP testing. Importantly, the trials were not carried out in order of increasing frequency of stimulation, as plotted. Rather, the order of trials was, to the extent possible with the number of mice, counterbalanced across the five conditions. Thus, possible contribution of effects of one trial on the next were minimized by altering the order of the trials.

      We have added a heat map for the 1 Hz condition to figure 5B.

      For experiments on RTPP the average stimulation time at 10Hz was less than 10 seconds per event. As a result, the data are unlikely to be affected by possible depletion of synaptic glutamate. For experiments using sustained stimulation (open field or light dark choice assays) we have no clear data to address if this might be a factor where 10Hz stimulation was applied for the entire trial.

      (14) In Fig 6, the authors show that the Cre- mice just don't do the task, so it is unclear what the utility of the rest of the figure is (such as the PR part). Relatedly, the pause is dependent on the activation, so isn't C just the same as D? In G and H, why ids a subset of Cre+ mice shown?

      Why not all mice, including Cre- mice?

      Thank you for the opportunity to improve the clarity of this section. A central aspect of the experiments in Figure 6 is the aversiveness of SUMVGLUT2+::POA neuron photostimulation, as shown in Figure 5B-F. The aversion to photostimulation drives task performance in the negative reinforcer paradigm. The mice perform a task (active port activation) to terminate the negative reinforcer (photostimulation of SuMVGLUT2+::POA neurons). Accordingly, control mice are not expected to perform the task because SuMVGLUT2+::POA neurons are not activated and, thus the mice are not motivated to perform the task.

      A central point we aim to covey in this figure is that while SuMVGLUT2+::POA neurons are being stimulated, mice perform the operant task. They selectively activated the active port (Supplemental Figure 7). As expected, control mice activate the active port at a low level in the process of exploring the arena. This diminishes on subsequent trials as mice habituate to the arena (Figure 6D). The data in Figures 6 C and D are related but can be divergent. Each pause in stimulation requires a port activation of a FR1 test but the number of port activations can exceed the pauses, which are 10 seconds long, if the animal continues to activate the port. Comparing data in Figures 6 C and D revels that mice generally activated the port two to three times for each pause earned with a trend towards greater efficiency on day 4 with more rewards and fewer activations.

      The purpose of the progressive ratio test is to examine if photostimulation of SuMVGLUT2+::POA continues to drive behavior as the effort required to terminate the negative stimuli increases. As seen in Figures 6 G and H, the stimulation of SuMVGLUT2+::POA neurons remains highly motivating. In the 20-minute trial we did not find a break point even as the number of port activations required to pause the stimulation exceed 50. We do not show the Cre- mice is Figure 6G and H because they did not perform the task, as seen in Figure 6F. For technical reasons in early trials, we have fully timely time stamped data for rewards and port activations from a subset of the Cre+ mice. Of note, this contains both the highest and lowest performing mice from the entire data set.

      Taken together, we interpret the results of the operant behavioral testing as demonstrating that SuMVGLUT2+::POA neuron activation is aversive, can drive performance of an operant tasks (as opposed to fixed escape behaviors), and is highly motivating.

      (15) In Fig 7, what does the GCaMP signal look like if aligned to the onset of immobility? It looks like since the hindpaw swimming is short and seems to precede immobility, and the increase in the signal is ramping up at the onset of hindpaw swimming, it may be that the calcium signal is aligned with the onset of immobility.

      What does it look like for swimming onset?

      In I, what is the temporal resolution for the decrease in immobility? Does it start prior to the termination of the stim, or does it require some elapsed time after the termination, etc?

      Thank for the opportunity to addresses these points and improve that clarity of our interpretation of the data. Regarding aligning the Ca2+ signal from fiber photometry recordings to swimming onset and offset, it is important to note that the swimming bouts are not the same length. As a result, in the time prior to alignment to offset of behaviors animals will have been swimming for different lengths of time. In Figure 7 C, we use the behavioral heat map to convey the behavioral average. Below we show the Ca2+ dependent signal aligned at the offset of hindpaw swim for an individual mouse (A) and for the total cohort (B). This alignment shows that the Ca2+ dependent signal declines corresponding to the termination of hindpaw swimming. Because these bouts last less than the total the widow shown, the data is largely included in Figure 7 C and D, which is aligned to onset. Due to the nuance of the difference is the alignment and the partial redundancy, we elected to include the requested alignment to swimming offset in the reply rather in primary figure.

      Author response image 1.

      Turning to the question regarding swimming onset, the animals started swimming immediately when placed in the water and maintained swimming and climbing behaviors until shifting behaviors as illustrated in Figure 7A and B. During this time the Ca2+-dependent signal was elevated but there is only one trial per animal. This question can perhaps be better addressed in the dunk assay presented in Figure 3C, F and G and Supplemental Figure 4 H and I. Here swimming started with each dunk and the Ca2+ signal increased.

      Regarding the question for about figure 7I. We scored for entire periods (2 mins) in aggerate. We noted in videos of the behavior test that there was an abrupt decrease in immobility tightly corresponding to the end of stimulation. In a few animals this shift occurred approximately 15-20s before the end of stimulation. This may relate to the depletion of neurotransmitter as suggested by the reviewer.

      Reviewer 3

      Major points

      (1) Results in Figure 1 suggested that SuM-Vglu2::POA projected not only POA but also to the diverse brain regions. We can think of two models which account for this. One is that homogeneous populations of neurons in SuM-Vglu2::POA have collaterals and innervated all the efferent targets shown in Figure 1. Another is to think of distinct subpopulations of neurons projecting subsets of efferent targets shown in Figure 1 as well as POA. It is suggested to address this by combining approaches taken in experiments for Figure 1 and Supplemental Figure 2.

      Thank you for raising this interesting point. We have attempted combining retroAAV injections into multiple areas that receive projections from SUMVGLUT2+::POA neurons. However, we have found the results unsatisfactory for separating the two models proposed. Using eYFP and tdTomato expressing we saw some overlapping expressing in SuM. We are not able to conclude if this indicates separate populations or partial labeling of a homogenous populations. A third option seems possible as well. There could be a mix of neurons projecting to different combinations of downstream targets. This seems particularly difficult to address using fluorophores. We are preparing to apply additional methodologies to this question, but it extends beyond the scope of this manuscript.

      (2) Since the authors drew a hypothetical model in which the diverse brain regions mediate the effect of SuM-Vglu2::POA activation in behavioral alterations at least in part, examination of the concurrent activation of those brain regions upon photoactivation of SuM-Vglu2::POA. This must help the readers to understand which neural circuits act upon the induction of active coping behavior under stress.

      Thank you for raising this important point. We agree that activating glutamatergic neurons should lead to activation of post synaptic neurons in the target regions. Delineating this in vivo is less straight forward. Doing so requires much greater knowledge of post synaptic partners of SUMVGLUT2+::POA neurons. There are a number of issues that would need to be accounted for. Undertaking two color photo stimulation plus fiber photometry is possible but not a technical triviality. Further, it is possible that we would measure Ca2+ signals in neurons that have no relevant input or that local circuits in a region may shape the signal. We would also lack temporal resolution to identify mono-postsynaptic vs polysynaptic connections. Thus, we would struggle to know if the change in signal was due to the excitatory input from SuM or from a second region. At present, we remain unclear on how to pursue this question experimentally in a manner that is likely to generate clearly interpretable results.

      (3) In Figure 4, "active coping behaviors" must be called "behaviors relevant to the active behaviors" or "active coping-like behaviors", since those behaviors were in the absence of stressors to cope with.

      Thank you for the suggestion on how to clarify our terminology. We have adopted the active coping-like term.

      (4) For the Dunk test, it is suggested to describe the results and methods more in detail, since the readers would be new to it. In particular, the mice could change their behavior between dunks under this test, although they still showed immobility across trials as in Supplemental Figure 4I. Since neural activity during the test was summarized across trials as in Figure 3, it is critical to examine whether the behavior changes according to time.

      Thank you for identifying this opportunity to improve our manuscript. We have expanded and added a detailed description of the dunk test in the methods section.

      As for Supplemental Figure 4I, we apologize for the confusion because the purpose of this figure is to show that mice remained mobile for the entire 30-second dunk trial. This did not appreciably change over the 10 trials. We have revised this figure to plot both immobile and mobile time to achieve greater clarity on this point.

      Minor points

      Typos

      In Figure 1, please add a serotype of AAVs to make it compatible with other figures and their legends.

      In the main text and Figure 2K, the authors used MHb/LHb and mHb/lHb in a mixed fashion. Please make them unified.

      In the figure legend of Figure 6, change "SuMVGLUT2+::POA neurons drive" to "SuMVGLUT2+::POA neurons " in the title.

      In line 86, please change "Retro-AAV2-Nuc-flox(mCherry)-eGFP" to "AAV5-Nuc-flox(mCherry)eGFP".

      In line 80, please change "Positive controls" to "As positive controls, ".

      Thank you for taking the time and making the effort to identify and call these out. We have corrected them.

    1. Author Response

      The following is the authors’ response to the previous reviews

      The revised manuscript is much improved - many unclear points are now better explained. However, in our opinion, some issues could still be significantly improved.

      1. Statistics: none of us are experts in statistics but several things remain questionable in our opinion and if it were our study, we would consult with an expert:

      a) while we understand the authors note about N-chasing and p-hacking, we wonder how the number of N's was premeditated before obtaining the results. Why in 4M an N of 3 is sufficient while in 3E the N is >20 (and not mentioned). At the very least, we think it would be wise to be cautious when stating something as not-significant when it is clear (as in 4M) that the likelihood of it actually being statistically significant is quite large.

      b) In most analyses, the data is not only normalized by actin or some other measure but also to the first (i.e left side on the graph) condition, resulting in identical data points that equal '1' (in Figure 4 alone - C; I; K; M; and O) - while this might be scientifically sound, it should be mentioned (the specific normalization) and also note that this technique shadows any real variance that exists in the original data in this condition. consider exploring techniques to overcome this issue.

      c) In 3C, - if we understand the experiment, you want to convince us that the DIFFERENCE between eB2-FC compared to FC is larger in the control compared to the experiment. We are not absolutely sure that the statistical tools employed here are sufficient - which is why we would consult an expert.

      A) We are aware that many studies do not consistently quantify such experiments. For example, there are essentially no published examples of the signalling timelines of EphB2 receptors as in Fig. 5. By striving to quantifying such biochemical effects, an unquantified experiment stands out, and so perhaps we were too strict by trying to quantify as many experiments as possible, resulting in low n’s for some of them. We acknowledge that additional experiments on EPHB1 protein stability may reach significance. We have adjusted our text on line 332-335 to point to this interesting trend, and slightly changed the conclusion to this section. Similarly, we commented on similar trends when describing Figs. 1E and 4G on lines 901 and 952.

      B) For the Western blot band intensity normalisation, we believe that our method is scientifically sound. Normally, when the replicate samples are loaded on one gel and blotted on the same membrane, the experimenter only needs to normalise the target band intensity to its cognate loading control band intensity for quantitation. However, we usually have a large number of samples from multiple experiments, carried out on different dates. For example, in Fig. 4B,C there are 7 biological replicates collected from 7 experiments and in Fig. 4D there are 10 protein samples. It is not possible for us to run all samples on the same gel. In addition, due to the combined effects of variance in transfer efficiency, the potency of antibodies, detection efficiency and the developing time for each blot, it is practically impossible to generate similar band intensity for each batch. Thus, we use normalisation of test bands to the loading control for individual experiments, and this analysis method is widely accepted by reputable journals with a focus on biochemical experiments (for example: PMID 37695914: Fig. 3 A,B,C; PMID 36282215: Fig. 3 B,C,D,E; PMID 33843588: Fig. 3 C,D,E,F,G,H). Since the value of the first sample on the plot is 1, which is a hypothetical value and does not meet the parametric test requirement, we performed one-sample t-test for statistics when other samples are compared with the first sample (PMID 35243233 Fig. 6 A,B,C,D; https://www.graphpad.com/quickcalcs/oneSampleT1/, “A one sample t-test compares the mean with a hypothetical value. In most cases, the hypothetical value comes from theory. For example, if you express your data as 'percent of control', you can test whether the average differs significantly from 100.”). Thus, we believe that our normalisation and statistical methods are both correct with a large number of precedents.

      C) This comment refers to the cell collapse experiment shown in Fig. 3C for which the data are plotted in Fig. 3D. We stand by the statistical method used. There are two groups of cells (CTRLCRISPR and MYCBP2 CRISPR) and two treatments for each cell group (Fc control and eB2), thus we should use two-way ANOVA. Since we compared the cell retraction effects of Fc and eB2 on the two groups of cells, Sidak post hoc comparison is the right method to avoid errors introduced by multiple comparisons. Here is an example of an eLife article that used the same statistical method for similar comparisons: PMID 37830910, Fig. 1 H,I. To make the comparison easier, we grouped the experiments by cell type (CTRLCRISPR and MYCBP2 CRISPR) as opposed to by treatment. Below, the old version is on the right, and the new version is on the left. The conclusion is that eB2 induces less cell collapse in cells depleted of MYCBP2, when compared to the control cells. However, eB2 is still able to collapse cells lacking MYCBP2.

      Author response image 1.

      Revisiting these data, we noticed an error introduced when CC compiled the data used to generate Fig. 3D. The data were acquired from nine biological replicates per condition. CC used a mix of two methods for cell collapse rate calculation: the first method involved the sum of collapsed cells and all cells from multiple regions of one coverslip (biological replicate). The second method involved computing a collapse rate in each region which then was used to calculate the average collapse rate for the entire coverslip (technical replicate). Given the small cell numbers due to sparse culture conditions, we believe that the first method is a more conservative approach. We hence re-plotted all replicate data using the first method. This resulted in slightly different % collapse and p values. These were changed accordingly in the text and plot and do not affect the conclusion of this experiment.

      2) thanks for the clarification that the interaction between the extracellular domain of EPHB2 and MYCBP2 might not occur directly - however, unless we missed this it was not clearly stated in the text. It is an important point and also a cool direction for the future - to find the elusive co-receptor that actually helps EPHB2 and MYCBP2 form a complex.

      We now also refer to this in the results section on line 215.

      “Since EPHB2 is a transmembrane protein and MYCBP2 is localised in the cytosol, these experiments suggest that the interaction between the extracellular domain of EPHB2 and MYCBP2 might be indirect and mediated by other unknown transmembrane proteins.”

      3) The Hela CRISPR cell line is better explained in the response letter but still not sufficiently explained in the text for a non-expert reader. If the authors want any reader to comprehend this, we would strongly recommend adding a scheme.

      We now include a schematic outlining the CRISPR cell generation as Fig. 3A and its description on line 926.

      Author response image 2.

      4) To clarify some of our previous (and persisting) concerns about Figure 3D/E - it is true that a reduction in 25% of cell size is dramatic. But (if we understand correctly) your claim is that a reduction in 22% (this is a guess, as the actual numbers are not supplies) is significantly less than 25%. Even if it is, statistically speaking, significant, what is the physiological relevance of this very slight effect? In this experiment, the N was quite large, and we wonder if the images in D are representative - it would be nice to label the data points in E to highlight which images you used.

      We now mention the average cell area contraction measurements in the legend to Fig. 3F on line 935. We also tracked down the individual cells shown in Fig. 3E and they are now labelled as data points in blue in Fig. 3F. HeLa cell collapse is a simplified model of EPHB2 function and we do not know whether the difference between the behaviour of CTRLCRISPR and MYCBP2 CRISPR cells is physiologically significant and thus we prefer not to speculate on this.

      5) Figure 3F and other stripe assays - In the end, it is your choice how to quantify. We believe that quantifying area of overlap is a more informative and objective measurement that might actually benefit your analyses. That said, if you do keep the quantification as it is now, you have to define the threshold of what you mean by "cell/s (or an axon in 7A, where it is even more complicated as are you eluding to primary, secondary, or even smaller branches) are RESIDING within the stripe". Is 1% overlap sufficient or do you need 10 or 50% overlap?

      We now added this statement to the methods on line 745: “A cell was considered to be on an ephrin-B2 stripe when more than 50% of its nucleus was located on that stripe”. For chick explant stripe assay, when measuring the length of an axon on a stripe, we only measured the main axons originated from the explants.

      For explant/stripe experiments in Fig. 7 AB, we now use the term “GFP-expressing neurite” rather than “branch”. This was already present in the results of the previous version, but the methods and legend needed to be brought up to date (lines 786 and 1008. We think that “branch” was a confusing term that was supposed to mean the same thing as “neurite” but came across as some indication of branching. We do not know whether the GFP+ neurites were primary or secondary extensions of explants, or in fact, whether some of them contained more than one axon. We also adjusted the method to reflect the fact that some stripes were used in conjunction with a single explant and added a reference to a previous study extensively using this method (Poliak et al., 2015) on line 778.

      6) We still don't get the link to the lysosomal degradation. Your data suggests that in your cells EPHB2 is primarily degraded by the lysosomal pathway and not proteasome. Any statement about MYCBP2 is not strongly supported by the data, in our opinion - Unless you develop some statistical measurement that shows that the effect of BafA1 is statistically different in MYCBP2 cells than in control cells. Currently, this is not the case and the link is therefore not warranted in our opinion.

      We generated a new version of Fig. 4K with average increase in EPHB2 levels in the presence of BafA1 and CoQ, compared to DMSO treated controls (see below). BafA1 and CoQ restored EPHB2 protein levels by 19% and 14% respectively in CtrlCRISPR cells, while the inhibitors restored EPHB2 protein levels by 40% and 35% respectively in MYCBP2 CRISPR cells.

      Author response image 3.

      For each of the 4 replicates, the increase in EPHB2 levels by BafA1 compared to DMSO is as follows:

      Author response table 1.

      These values are not significantly different between CtrlCRISPR cells versus MYCBP2 CRISPR cells (p= 0.08, student’s t test). Similarly for the CoQ experiment. We now temper our conclusion for this experiment: Although the difference in percentage increase between CTRLCRISPR cells and MYCBP2CRISPR cells is not significant, this trend raises the possibility that the loss of MYCBP2 promotes EPHB2 receptor degradation through the lysosomal pathway (line 319). We also adjusted the section title (line 306).

      7) While the C. elegans part is now MUCH better explained - we are not sure we understand the additional insight. The fact that vab-1 and glo4 double mutants are additive as are vab1 and fsn1, suggest they act in parallel (if the mutants are NULL, and not if they are hypomorphs, if one wants to be accurate) - how this relates to your story is unclear. The vab1/rpm1 double mutant is still uninformative and incomplete. rpm1 phenotype is so severe that nothing would make it more severe. We read the Jin paper that the authors directed to - nothing makes the rpm1 phenotype more severe. Yes, some DOWNSTREAM elements make the rpm1 phenotype LESS severe - this is not something you were testing, to the best of our knowledge. Rather, you wanted to see if rpm1 mutant resulted in stabilization of vab1 and thus suppression of vab1 phenotype - we are just not sure the system is amenable to test (actually reject) your hypothesis that Vab1 is degraded by rpm1. Also, assuming we are talking about NULLs, the fact that the rpm1 phenotype is WAY stronger than the vab1 mutant, suggests that rpm1 functions via multiple routes, adding even more complexity to the system. Given these results, despite the much improved clarity, we are still not sure that the worm data adds new insight, rather than potentially confusing the reader.

      We realise that the genetic interactions between vab-1 and the RPM-1/MYCBP2 signalling network are complicated. However, we insist on keeping the data for the sake of its availability for future studies and completeness. We also think it is important for readers and the community to see these data, even if the authors and reviewers are not entirely in agreement about the importance/interpretation of experimental outcomes. It is our hope that the community will examine the results and draw their own conclusions.

      A few points of clarification:

      The C. elegans experiments were designed to test genetically if the vertebrate interactions between EPHB2 and MYCBP2 and its signalling network are conserved. We studied two kinds of interactions: (1) between vab-1 and RPM-1/MYCBP2 downstream proteins (GLO-4 and FSN-1) and (2) between vab-1 and rpm-1. For these studies, we used null alleles for vab-1, glo-4 and fsn-1 which is now noted on lines 440, 453, 475 and 859. Our findings are consistent with the VAB-1 Ephrin receptor functioning in parallel to known RPM-1 binding proteins. This is further supported by new data: vab-1; fsn-1 double mutants showed enhanced incidence of axon overextension defects using a second transgenic background, zdIs5 (Pmec-4::GFP), to visualize axon termination (Fig. 8F).

      This second transgenic background also allowed us to generate new data to address your concerns about phenotypic saturation in rpm-1 mutants. To do this, we used the zdIs5 (Pmec4::GFP) genetic background, in which axon termination defects are not saturated in rpm-1 mutants (Fig. 8F) because they can be enhanced by other mutants such as cdc-42 and unc-33 (Fig. 7C, D, in Borgen et al. Development 144, 4658–4672 (2017), PMID 29084805). In this new background, we found that vab-1 loss of function fails to enhance the incidence of severe “hook” defects in rpm-1 mutants which is an indication that the two genes function in the same pathway. Importantly, prior studies in this background, also showed that mutants in the RPM-1 signalling network (e.g. fsn-1, glo-4 and ppm-2) do not enhance the incidence of severe “hook” defects as double mutants with rpm-1 compared to rpm-1 single mutants (Fig. 7B, ibid.).

      To reflect these ideas more clearly, we revised the Results section pertaining to C. elegans genetics (starting on line 418) and tempered our discussion (lines 517). Basically, this section now says that we studied genetic interactions between vab-1 and the RPM-1/MYCBP2 signalling network. From these experiments we conclude that: (1) The enhancement of overextension defects in vab-1; glo-4 and vab-1; fsn-1 double mutants compared to single mutants indicates that VAB-1/EPHR functions in parallel to known RPM-1 binding proteins to facilitate axon termination, and (2) Since the vab-1; rpm-1 double mutants do not display an increased frequency or severity of overextension defects compared to rpm-1 single mutants, VAB-1 /EPHR functions in the same genetic pathway as RPM-1/MYCBP2.

      The new genetic data included in this version were generated by Karla J. Opperman who is now included as a co-author.

      Further corrections:

      Author response image 4.

      Because of the errors associated with quantifications in Fig. 3D (see above), we reviewed other quantification methodologies and noticed another discrepancy that required a correction. In the hippocampal neuron growth cone collapse assay shown in the previous version of Fig. 7 D (left), the growth cones were classified into three groups: 1, fully collapsed; 2, hard to tell, but not fully collapsed; 3, fan-shape cones. Two different quantifications were performed as follows: (1), number of fully collapsed cones divided by the numbers of all growth cones; (2), number of fully collapsed cones divided by [number of fully collapsed cones + fan-shape cones]. CC erroneously used the second method to generate Fig. 7D.

      We think that the first method is more appropriate. Furthermore, since n=5 for the Fc and eB1-Fc conditions, but n=3 for the eB2-Fc condition, we decided to omit it. The final plot for figure 7D is the following:

      Author response image 5.

      Our conclusion still stands that exogenous FBD1 WT overexpression impaired the growth cone collapse mediated by EphB.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this paper, Steinemann et al. characterized the nature of stochastic signals underlying the trial-averaged responses observed in the lateral intraparietal cortex (LIP) of non-human primates (NHPs), while these performed the widely used random dot direction discrimination task. Ramp-up dynamics in the trial averaged LIP responses were reported in numerous papers before. However, the temporal dynamics of these signals at the single-trial level have been subject to debate. Using large-scale neuronal recordings with Neuropixels in NHPs, allows the authors to settle this debate rather compellingly. They show that drift-diffusion-like computations account well for the observed dynamics in LIP.

      Strengths:

      This work uses innovative technical approaches (Neuropixel recordings in behaving macaque monkeys). The authors tackle a vexing question that requires measurements of simultaneous neuronal population activity and hence leverage this advanced recording technique in a convincing way

      They use different population decoding strategies to help interpret the results.

      They also compare how decoders relying on the data-driven approach using dimensionality reduction of the full neural population space compare to decoders relying on more traditional ways to categorize neurons that are based on hypotheses about their function. Intriguingly, although the functionally identified neurons are a modest fraction of the population, decoders that only rely on this fraction achieve comparable decoding performance to those relying on the full population. Moreover, decoding weights for the full population did not allow the authors to reliably identify the functionally identified subpopulation.

      Weaknesses:

      No major weaknesses beyond a few, largely clarification issues, detailed below.

      We thank Reviewer 1 (R1) for this summary. The revised manuscript incorporates R1’s suggestions, as detailed below.

      Reviewer #2 (Public Review):

      Steinemann, Stine, and their co-authors studied the noisy accumulation of sensory evidence during perceptual decision-making using Neuropixels recordings in awake, behaving monkeys. Previous work has largely focused on describing the neural underpinnings through which sensory evidence accumulates to inform decisions, a process which on average resembles the systematic drift of a scalar decision variable toward an evidence threshold. The additional order of magnitude in recording throughput permitted by the methodology adopted in this work offers two opportunities to extend this understanding. First, larger-scale recordings allow for the study of relationships between the population activity state and behavior without averaging across trials. The authors’ observation here of covariation between the trial-to-trial fluctuations of activity and behavior (choice, reaction time) constitutes interesting new evidence for the claim that neural populations in LIP encode the behaviorally-relevant internal decision variable. Second, using Neuropixels allows the authors to sample LIP neurons with more diverse response properties (e.g. spatial RF location, motion direction selectivity), making the important question of how decision-related computations are structured in LIP amenable to study. For these reasons, the dataset collected in this study is unique and potentially quite valuable.

      However, the analyses at present do not convincingly support two of the manuscript’s key claims: (1) that ”sophisticated analyses of the full neuronal state space” and ”a simple average of Tconin neurons’ yield roughly equivalent representations of the decision variable; and (2) that direction-selective units in LIP provide the samples of instantaneous evidence that these Tconin neurons integrate. Supporting claim (1) would require results from sophisticated population analyses leveraging the full neuronal state space; however, the current analyses instead focus almost exclusively on 1D projections of the data. Supporting claim (2) convincingly would require larger samples of units overlapping the motion stimulus, as well as additional control analyses.

      We thank the reviewer (R2) for their careful reading of our paper and the many useful suggestions.

      As detailed below, the revised manuscript incorporates new control analyses, improved quantification, and statistical rigor, which now provide compelling support for key claim #1. We do not regard claim #2 as a key claim of the paper. It is an intriguing finding with solid support, worthy of dissemination and further investigation. We have clarified the writing on this matter.

      Specific shortcomings are addressed in further detail below:

      (1) The key analysis-correlation between trial-by-trial activity fluctuations and behavior, presented in Figure 5 is opaque, and would be more convincing with negative controls. To strengthen the claim that the relationship between fluctuations in (a projection of) activity and fluctuations in behavior is significant/meaningful, some evidence should be brought that this relationship is specific - e.g. do all projections of activity give rise to this relationship (or not), or what level of leverage is achieved with respect to choice/RT when the trial-by-trial correspondence with activity is broken by shuffling.

      We do not understand why R2 finds the analysis opaque, but we are grateful for the lucid recommendations. The relationships between fluctuations in neural activity and behavior are indeed “specific” in the sense that R2 uses this term. In addition to the shuffle control, which destroys both relationships (Reviewer Figure 1), we performed additional control analyses that preserve the correspondence of neural signals and behavior on the same trial. We generated random coding directions (CDs) by establishing weight vectors that were either chosen from a standard normal distribution or by permuting the weights assigned to PC-1 in each session. The latter is the more conservative measure. Projections of the neural responses onto these random coding directions render 𝑆rand(𝑡). Specifically, the degree of leverage is effectively zero or greatly reduced. These analyses are summarized in a new Supplementary Figure S10. The bottom row of Figure S10 also addresses the question, “What degree of leverage and mediation would be expected for a theoretical decision variable?” This is accomplished by simulating decision variables using the drift-diffusion model fits in Figure 1c. The simulation is consistent with the leverage and (incomplete) mediation observed for the populations of Tcon neurons. For details see Methods, Simulated decision variables and Leverage of single-trial activity on behavior.

      (2) The choice to perform most analysis on 1D projections of population activity is not wholly appropriate for this unique type of dataset, limiting the novelty of the findings, and the interpretation of similarity between results across choices of projection appears circular:

      We disagree with the characterization of our argument as circular, but R2 raises several important points that will probably occur to other careful readers. We address them as subpoints 2.1–2.4, below. Importantly, we are neither claiming nor assuming that the LIP population activity is one-dimensional. We have revised the paper to avoid giving this impression. We are also not claiming that the average of Tin neurons (or the 1D projections) explains all features of the LIP population, nor would we expect it to, given the diversity of response fields across the population. Our objective is to identify the specific dimension within population activity that captures the decision variable (DV), which has been characterized successfully as a one-dimensional stochastic process—that is, a scalar function of time. We have endeavored to clarify our thinking on this point in the revised manuscript (e.g., lines 97–98, 103–104).

      (2.1) The bulk of the analyses (Figure 2, Figure 3, part of Figure 4, Figure 5, Figure 6) operate on one of several 1D projections of simultaneously recorded activity. Unless the embedding dimension of these datasets really does not exceed 1 (dimensionality using e.g. participation ratio in each session is not quantified), it is likely that these projections elide meaningful features of LIP population activity.

      We now report the participation ratio (4.4 ± 0.4, mean ± s.e. across sessions), and we state that the first 3 PCs explain 67.1±3.1% of the variance of time- and coherence-dependent signals used for the PCA. We agree that the 1D projections may elide meaningful features of LIP population activity. Indeed, we make this point through our analysis of the Min neurons. We do not claim that the 1D projections explain all of the meaningful features of LIP population activity. They do, however, reveal the decision variable, which is our main focus. These 1D signals contain features that correlate with events in the superior colliculus, summarized in Stine et al. (2023), attesting to their biological relevance.

      (2.2) Further, the observed similarity of results across these 1D projections may not be meaningful/interpretable. First, the rationale behind deriving Sramp was based on the ramping historically observed in Tin neurons during this task, so should be expected to resemble Tin.

      The Reviewer is correct that we would expect 𝑆ramp to resemble the ramping observed in Tin neurons. We refer to this approach as hypothesis-driven. It captures the drift component of drift-diffusion. It is true that the Tcon neurons exhibit such ramps in their trial average firing rates, but this does not guarantee in

      that the single-trial population firing rates would manifest as drift-diffusion. Indeed Latimer et al. (2015) concluded that the ramp-like averages comprise stepping from a low to a high firing rate on each trial at a random time. Therefore, while R2 is right to characterize the similarity of Tcon to the ramp direction in in trial-averaged activity as unsurprising, their similarity on single trials is not guaranteed.

      (2.3) Second, Tin comprises the largest fraction of the neuron groups sampled during most sessions, so SPC1 should resemble Tin too. The finding that decision variables derived from the whole population’s activity reduce essentially to the average of Tin neurons is thus at least in part ’baked in’ to the approach used for deriving the decision variables.

      This is incorrect. The Tcon in neurons constitute only 14.5% of the population, on average, across the sessions (see Table 1). This misunderstanding might contribute to R2’s concern about the importance of these neurons in shaping PC1. It is not simply because they are over-represented. Also, addressing R2’s concern about circularity, we would like to remind R2 that the selection of Tin neurons was based only on their spatial selectivity in the delayed saccade task. We do not see how it could be baked-in/guaranteed that a simple average of these neurons (i.e. zero degrees of freedom) yields dynamics and behavioral correlations that match those produced by dimensionality-reduction techniques that (𝑖) have degrees of freedom equal to the number of neurons and (𝑖𝑖) are blind to the neurons’ spatial selectivity. We have additionally modified what is now Supplementary Figure S13 (old Supplementary Figure S8), which portrays the mean accuracy of choice decoders trained on the neural activity of all neurons, only Tin neurons, all but the Tin neurons, and all but Tin and Min neurons, respectively. Figure S13 now highlights how much more readily choice can be decoded from the small population of Tin neurons than the remainder of the population.

      (2.4) The analysis presented in Figure S6 looks like an attempt to demonstrate that this isn’t the case, but is opaque. Are the magnitudes of weights assigned to units in Tin larger than in the other groups of units with preselected response properties? What is their mean weighting magnitude, in comparison with the mean weight magnitude assigned to other groups? What is the null level of correspondence observed between weight magnitude and assignment to Tin (e.g. a negative control, where the identities of units are scrambled)?

      The revised Figure S6—what is now Figure S9—displays more clearly that the weights assigned to Tcon and Tips neurons (purple & yellow, respectively) are larger in magnitude than those assigned in in to other neurons (gray). Author response table 1 shows a more detailed breakdown of the groups. Note that the length of the vector of weights is one. We are unsure what R2 means by “the null level of correspondence.” Perhaps it helps to know that the mean weight of the “other neurons” is close to zero for all four coding directions. However, it is the overlap of the weights and the relative abundance of non-Tin neurons that is more germane to the point we are making. To wit, knowing the weight (or percentile) of a neuron is a poor predictor that it belongs to the Tin category. This point is most clearly supported by the logistic regression (Fig. S9, bottom row). In other words, the large group of non-Tin neurons contribute substantially to all four coding directions examined in Figure S9. Thus, the similarity between Tin neurons and PC1 is not simply due to an over-representation of Tin neurons as suggested in item 2.3.

      Author response table 1.

      Mean weights assigned to neuron classes in four coding directions.

      (3) The principal components analysis normalization procedure is unclear, and potentially incorrect and misleading: Why use the chosen normalization window (±25ms around 100ms after motion stimulus onset) for standardizing activity for PCA, rather than the typical choice of mean/standard deviation of activity in the full data window? This choice would specifically squash responses for units with a strong visual response, which distorts the covariance matrix, and thus the principal components that result. This kind of departure from the standard procedure should be clearly justified: what do the principal components look like when a standard procedure is used, and why was this insufficient/incorrect/unsuitable for this setting?

      We used the early window because it is a robust measure of overall excitability, but we now use a more conventional window that spans the main epoch of our analyses, 200–600 ms after motion onset. This method yields results qualitatively similar to the original method. We are persuaded that this is the more sensible choice. We thank R2 for raising this concern.

      (4) Analysis conclusions would generally be stronger with estimates of variability and control analyses: This applies broadly to Figures 2-6.

      We have added estimates of variability and control analyses where appropriate.

      Figure 2 shows examples of single-trial signals. The variability is addressed in Figure 3a and the new Supplementary Figure S5.

      Figure 3 now contains error bars derived by bootstrapping (see Methods, Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sublinearity claim using simulations.

      Figure 4 (i) We now indicate the s.e.m. of decoding accuracy (across sessions) by the shading in Figure 4a. (ii) The black symbols in new Supplementary Figure S8 show the mean±s.e.m. for all pairwise comparisons shown in Figure 4d & e. (iii) Supplementary Figure S8 also summarizes two control analyses that deploy random coding directions (CDs) in neuronal state space. The upper row of Fig S9 compares the observed cosine similarity (CoSim)—between the CD identified by the graph title and the other four CDs labeled along the abscissa—with values obtained with 1000 random CDs established by random permutations of the weight assignments. The brown symbols are the mean±sdev of the CoSim (N=1000). The error bars are smaller than the symbols. We use the cumulative distribution of CoSim under permutation to estimate p-values (p<0.001 for all comparisons). We used a similar approach to estimate the distribution of the analogous correlation statistics between signals rendered by random directions in state space (Figure S8, lower row). For additional details, please see Methods, Similarity of single-trial signals.

      Figure 5: The rigor of all claims associated with this figure is adduced from two control analyses and a simulation. The first control breaks the trial-by-trial correspondence between neural signals and behavior (Reviewer Figure 1). The second control shows that neural activity does not have substantial leverage on behavior when projected onto random directions in state space (Supplementary Figure S10, top). Simulations of decision variables using parameters derived from the fits to the behavioral data (Figure 1) support a degree of leverage and mediation comparable to the values observed for 𝑆Tincon (Supplementary Figure S10, bottom). For additional details, please see Methods (Leverage of single-trial activity on behavior) and the reply to item 1, above.

      Figure 6: Panels c&d show estimates of variability across neurons and experimental sessions, respectively. The reported p-value is based on a permutation test (see Methods, Correlations between Min and Tconin ). The correlations shown in panel e (heatmap) are derived from pooled data across sessions. The reported p-value is based on a permutation test (see Methods, Correlations between Min and Tconin ).

      Reviewer #3 (Public Review):

      Summary:

      The paper investigates which aspects of neural activity in LIP of the macaque give rise to individual decisions

      (specificity of choice and reaction times) in single trials, by recording simultaneously from hundreds of neurons. Using a variety of dimensionality reduction and decoding techniques, they demonstrate that a population-based drift-diffusion signal, which relies on a small subset of neurons that overlap choice targets, is responsible for the choice and reaction time variability. Analysis of direction-selective neurons in LIP and their correlation with decision-related neurons (T con in [Tconin ] neurons ) suggests that evidence integration occurs within area LIP.

      Strengths:

      This is an important and interesting paper, which resolves conflicting hypotheses regarding the mechanisms that underlie decision-making in single trials. This is made possible by exploiting novel technology (Primatepixels recordings), in conjunction with state-of-the-art analyses and well-established dynamic random dot motion discrimination tasks.

      General recommendations:

      (1) Please tone down causal language. You presentcompelling correlativeevidencefor the idea thatLIP population activity encodes the drift-diffusion DV. We feel that claims beyond that (e.g., ”Single-trial drift-diffusion signals control the choice and decision time”) would require direct interventions, and are only partially supported by the current evidence. Further examples are provided in point 1) of Reviewer 1 below.

      We have adopted the recommendation to “tone down the causal language.” Throughout the manuscript, we strive to avoid conveying the false impression that the present findings provide causal support for the decision mechanism. However, other causal studies of LIP support causality in the random dot motion task (Hanks et al., 2006; Jeurissen et al., 2022). It is therefore justifiable to use terms that imply causality in statements intended to convey hypotheses about mechanism. We agree that we should not give the false impression that the present support for said mechanism is adduced from causal perturbations in this study, as there were none.

      (2) Please provide a commonly used, data-driven quantification of the dimensionality of the population activity – for example, using participation ratio or the number of PCs explaining 90 % of the variance. This will help readers evaluate the conclusions about the dimensionality of the data.

      Principal component analysis reveals a participation ratio of 4.4 ± 0.4 (mean ±s.e., across sessions), and the first 3 PCs explain 67.1 ± 3.1 percent of the variance. The dimensionality of the data is low, but greater than one. We state this in Methods (Principal Component Analysis) and in Results (Single-trial drift-diffusion signals approximate the decision variable, lines 200–201).

      (3) Please justify the normalization procedure used for PCA: Why use the chosen normalization window (±25ms around 100ms after motion stimulus onset) for standardizing activity for PCA, rather than the more common quantification of mean/standard deviation across the full data window? What do the first principal components look like when the latter procedure is used?

      We now use a more conventional window that spans the main epoch of our analyses, 200–600 ms after motion onset. This method yields results qualitatively similar to the original method. We are persuaded that this is the more sensible choice.

      (4) Please provide estimates of variability for variance and autocorrelation in Fig. 3 (e.g., through bootstrapping). Further, simulations could substantiate the claim about the expected sub-linearity at later time points (Fig. 3a) due to the upper stopping bound and limited firing rate range.

      We thank the reviewers for these helpful recommendations. The revised Fig. 3 now contains error bars derived by bootstrapping (see Methods, Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sub-linearity claim using simulations.

      (5) Please add controls and estimates of variability for decoding across sessions in Fig. 4: what are the levels of within-trial correlation/cosine similarity for random coding directions? What is the variability in the estimates of values shown in a/d/e?

      We have addressed each of these items. (1) Figure 4a now shows the s.e.m. of decoding accuracy (across sessions). (2) Regarding the variability of estimates shown in Figure 4d & e, the standard errors are displayed in the new supplementary Figure S8. It makes sense to show them there because there is no natural way to represent error on the heat maps in Figure 4, and Figure S8 concerns the comparison of the values in Figure 4d&e to values derived from random coding directions. (3) Random coding directions lead to values of cosine similarity and within-trial correlation that do not differ significantly from zero. We show this in several ways, summarized in our reply to Public Review item 4. Additional details are in the revised manuscript (Methods, Similarity of single-trial signals) and the new Supplementary Figure S8.

      (6) Please perform additional analysis to strengthen the claim from Fig. 6, that Min represents the integrand and not the integral. The analysis in Fig. 6d could be repeated with the integral (cumulative sum) of the single-trial Min signals. Does this yield an increase in leverage over time?

      The short answer is, yes in part. Reviewer Figure 2a provides support for leverage of the integral on choice, and this leverage, like 𝑆Tincon (t), increases as a function of time. The effect is present in all seven sessions that have both Mleftin and Mrightin neurons (all 𝑝 < 1𝑒 − 10). However, as shown in panel b, the same integral fails to demonstrate more than a hint of leverage on RT. All correlations are barely negative, and the magnitude does not increase as a function of time. We suspect—but cannot prove—that this failure arises because of limited power and the expected weak effect. Recall that the mediation analysis of RT is restricted to longer trials. Moreover, the correlation between the Min difference and the Tin signal is less than 0.1 (heatmap, Fig. 6e), implying that the Min difference explains less than 1% of the variance of 𝑆Tin(𝑡). We considered including Reviewer Figure 2 in the paper, but we feel it would be disingenuous (cherry-picking) to report only the positive outcome of the leverage on choice. If the editors feel strongly about it, we would be open to including it, but leaving these analyses out of the revised manuscript seems more consistent with our effort to deëmphasize this finding. In the future, we plan to record simultaneously from populations MT and LIP neurons (Min and Tin, of course) and optimize Min neuron yield by placing the RDM stimulus in the periphery.

      (7) Please describe the complete procedure for determining spatially-selective activity. E.g.: What response epoch was used, what was the spatial layout of the response targets, were responses to all ipsi- vs contralateral targets pooled, what was the spatial distribution of response fields relative to the choice targets across the population?

      We thank the reviewers for pointing out this oversight. We now explain this procedure in the Methods (lines 629–644):

      Neurons were classified post hoc as Tin by visual-inspection of spatial heatmaps of neural activity acquired in the delayed saccade task. We inspected activity in the visual, delay, and perisaccadic epochs of the task. The distribution of target locations was guided by the spatial selectivity of simultaneously recorded neurons in the superior colliculus (see Stine 2023 for details). Briefly, after identifying the location of the SC response fields, we randomly presented saccade targets within this location and seven other, equally spaced locations at the same eccentricity. In monkey J we also included 1–3 additional eccentricities, spanning 5–16 degrees. Neurons were classified as Tin if they displayed a clear, spatially-selective response in at least one epoch to one of the two locations occupied by the choice targets in the main task. Neurons that switched their spatial selectivity in different epochs were not classified as Tin. The classification was conducted before the analyses of activity in the motion discrimination task. The procedure was meant to mimic those used in earlier single-neuron studies of LIP (e.g., Roitman & Shadlen 2002) in which the location of the choice targets was determined online by the qualitative spatial selectivity of the neuron under study. The Tcon neurons in the in present study were highly selective for either the contralateral or ipislateral choice target used in the RDM task (AUC = 0.89±0.01; 𝑝 < 0.05 for 97% of neurons, Wilcoxon rank sum test). Given the sparse sampling of saccade target locations, we are unable to supply a quantitative estimate of the center and spatial extent of the RFs.

      (8) Please clarify if a neuron could be classified as both Tin and Min. Or were these categories mutually exclusive?

      These categories are mutually exclusive. If a neuron has spatially-selective persistent activity, as defined by the method described above, it is classified as a Tin neuron and not as an Min neuron even if it also shows motion-selective activity during passive motion viewing. We now specify this in the Methods (lines 831–832).

      Reviewer #1 (Recommendations For The Authors):

      𝑅∗1.1a Causal language (Line 23-24): “population activity represents […] drift” and “we provide direct support for the hypothesis that drift-diffusion signal is the quantity responsible for the variability in choice and RT” reads at first sight as if the authors claim that they present evidence for a causal effect of LIP activity on choice. The authors areotherwisenuanced and carefultopointout thattheir evidence is correlational. What seems to be meant is that the population activity/drift-diffusion signal ”approximates the DV that gives rise to the choices […]” (cf. line 399). I would recommend using such alternative phrasing to avoid confusion (and the typically strong reactions by readers against misleading causal statements).

      We have adopted the reviewer’s recommendation and have modified the text throughout to reduce causal language. See our response to General Recommendation 1.

      𝑅∗1.1b Relatedly, any discussion about the possibility of LIP being causally involved in evidence integration (e.g. lines 429-445 [Au: now 462–478]) should also comment on the possibility of a distributed representation of the decision variable given that neural correlates of the DV have been reported in several areas including PFC, caudate and FEF.

      We believe this is possible. However, we hope to avoid discussions about causality given that it is not a focus of the paper. Although it is somewhat tangential, we have shown elsewhere that LIP is causal in the sense that causal manipulations affect behavior, but it is also true that causality does not imply necessity, and similarly, lack of necessity does not imply “only correlation.” Regarding distributed representations, it is worth keeping in mind the cautionary counter-example furnished by the SC study (Stine et al., 2023). The firing rates measured by averaging over trials are similar in SC and LIP; both manifest as coherence and direction-dependent ramps, leading to the suggestion that they form a distributed representation of the decision variable. With single-trial resolution, we now know that LIP and SC exhibit distinct dynamics—drift-diffusion and bursting, respectively. It remains to be seen if single-trial resolution achievable by simultaneous Neuropixels recordings from prefrontal areas and LIP reveal shared or distinct dynamics.

      𝑅∗1.2 How was the spatially selective activity determined? The classification of Tin neurons is critical to this study - how was their spatial selectivity determined? Please describe this in similar detail as the description of direction selectivity on lines 681-690 [Au: now 824–832]. E.g.: what response epoch was used, what was the spatial layout of the response targets, were responses to all ipsi- vs contralateral targets pooled, and what was the spatial distribution of response fields relative to the choice targets across the population?

      We now explain the selection procedure in Methods (lines 629–644). Please see our reply to General Recommendation 7, above.

      𝑅∗1.3 Could a neuron be classified as both Tin and Min, or were these categories mutually exclusive? Please clarify. (This goes beyond the scope of the current study: but did the authors find evidence for topographic organization or clustering of these categories of neurons?)

      These categories are mutually exclusive. Please see our response to General Recommendation 8, above.

      𝑅∗1.4 Contrary to the statement on line 121, the trial averages in Fig. 2a, 2b show coherence dependency at the time of the saccade in saccade-aligned traces for the coding strategies, except for STin (fig. 2c). Is this a result of the choice for t1 (= 0.1s)? (The authors may want to change their statement on line 121.) Relatedly, do the population responses for the two coding strategies Sramp and SPC1 depend on the epoch used to derive weights for individual neurons?

      We have revised the description to accommodate R2’s observation. 𝑆ramp retains weak coherence-dependence before saccades towards the choice target contralateral to the recording site. This was true in four of the eight sessions. For 𝑆PC1, there is no longer a coherence dependency for the Tin choices, owing to the change in normalization method (see revised Figure 2b).

      We also corrected an error in the Methods section. Specifically, the ramp ends at 𝑡1 \= 0.05 s before the time of the saccade, not 𝑡1 \= 0.1 s. While we no longer emphasize the similarity of traces aligned to saccade, it is reasonable to find issue with the observation that they retain a dependency on coherence (𝑆ramp only) because, according to theory, traces associated with Tin choices should reach a common positive threshold at decision termination. That said, for the Ramp direction there may be a reason to expect this discrepancy from theory. The deterministic part of drift-diffusion includes an urgency signal that confers positive convexity to the deterministic drift. This accelerating nonlinearity is not captured by the ramp, and it is more prominent at longer decision times, thus low coherences. We do not share this interpretation in the revised manuscript, in part because retention of coherence dependency is present in only half the sessions (see Reviewer Figure 3) The correction to the definition of 𝑡1 also provides an opportunity to address R2’s final question (“Relatedly,…?”). For 𝑆ramp this particular variation in 𝑡1 does not affect 𝑆ramp, and 𝑆PC1 no longer retains coherence dependency for Tin choices. Note that our choice of 𝑡0 and 𝑡1 is based on the empirical observation that the ramping activity in response averages of Tin neurons typically begins 200 ms after motion onset and ends 50–100 ms before initiation of the saccadic choice. The starting time (𝑡0) is also supported by the observation that the decoding accuracy of a choice-decoder begins to diverge from chance at this time (Figure 4a).

      𝑅∗1.5 It is intriguing that Sramp and SPC1 show dynamics that look so similar (fig. 2a, 2b). How do the weights assigned to each neuron in both strategies compare across the population?

      The weights assigned to each neuron are very similar across the two strategies as indicated by a cosine similarity (0.65 ± 0.04, mean ±s.e.m. across sessions).

      𝑅∗1.6 Tin neurons, which show dynamics closely resembling different coding directions (fig. 2) and the decoders do not have weights that can distinguish them from the rest of the population in each of these analyses (fig. S7). Is it fair to interpret these findings as evidence for broad decision-related co-variability in the recorded neural population in LIP?

      Yes, our results are consistent with this interpretation. However, it is worth reiterating that decoding performance drops considerably when Tin neurons are not included (see Supplementary Figure S13). Thus, this broad decision-related co-variability is present but weak.

      𝑅∗1.7 It is intriguing that the decoding weights of the different decoders did not allow the authors to reliably identify Tin neurons. Could this be, in part, due to the low dimensionality of the population activity and task that the animals are presumably overtrained on? Or do the authors expect this finding to hold up if the population activity and task were higher dimensional?

      Great question! We can only speculate, but it seems possible that a more complex, “higher dimensional” task could make it easier to identify Tin neurons. For example, a task with four choices instead of two may decrease correlations among groups of neurons with different response fields. We have added this caveat to the discussion (lines 459-–461). One minor semantic objection: The animal has learned to perform a highly contrived task at low signal-to-noise. The animal is well-trained, not over-trained.

      𝑅∗1.8 Lines 135-137 [Au: now 141–142]: The similarity in the single trial traces from different coding strategies (fig. 2a-2c, left) is not as evident to me as the authors suggest. It might be worthwhile computing the correlation coefficients between individual traces for each pair of strategies and reporting the mean correlation to support the author’s point.

      We report the mean correlation between single-trial signals generated by the chosen dimensionality reduction methods in Figure 4e. We show the variability in this measure in Supplementary Figure S8. We have also adjusted the opacity of the single-trial traces in Figure 2, left.

      𝑅∗1.9 Minor/typos:

      -line 74: consider additionally citing Hyafil et al. 2023.

      -line 588: ”that were strongly correlated”?

      -line 615: ”were the actual drift-diffusion process were...”.

      -line 717: ”a causal influence” -> ”no causal influence”.

      Fig. 6: panel labels e vs d are swapped between the figure and caption.

      Fig. 3c: labels r1,3 & r2,3 are flipped.

      We have addressed all of these items. Thank you.

      Reviewer #2 (Recommendations For The Authors):

      𝑅∗2.1 (Figure 2) Determine whether restricting the analysis to 1D projections of the data is a suitable approach given the actual dimensionality of the datasets being analyzed:

      - Should show some quantification of the dimensionality of the recorded activity; could do this by quantifying the dimensionality of population activity in each session, e.g. with participation ratio or related measures (like # PCs to explain some high proportion of the variance, e.g. 90 %). If much of the variation is not described in 1 dimension, then the paper would benefit from some discussion/analysis of the signals that occupy the other dimensions.

      We now report the participation ratio (4.4 ± 0.4, mean ±s.e. across sessions), and we state that the first 3 PCs explain 67.1 ± 3.1% of the variance of the time- and coherence-dependent signals used for the PCA (mean ±s.e). We agree that the 1D projections may elide meaningful features of LIP population activity. Indeed, we make this point through our analysis of the Min neurons. To reiterate our response above, we do not claim that the 1D projections explain all of the meaningful features of LIP population activity. They do, however, reveal the decision variable, which is our main focus. These 1D signals contain features that correlate with events in the superior colliculus, summarized in Stine et al. (2023), attesting to their biological relevance.

      The Reviewer is correct that our approach presupposes a linear embedding of the 1D decision variable inthepopulationactivity. Inotherwords, anonlinearrepresentationofthe1Ddecisionvariableinpopulation activity could have an embedding dimensionality greater than 1, and there may well be a non-linear method that reveals this representation. To test this possibility, we decoded choice on each trial from population activity using (1) a linear decoder (logistic classifier) or (2) a multi-layer neural network, which can exploit non-linearities. We found that, for each session, the two decoders performed similarly: the neural network outperforms the logistic decoder (barely) in just one session. The analysis suggests that the assumption of linear embedding of the decision variable is justified. We hope this analysis convinces the reviewer that “sophisticated analyses of the full neuronal state space” and “a simple average of [Tcon ] neurons” do in indeed yield roughly equivalent representations of the decision variable. We have included the results of this analysis in Supplementary Figure S12. See also item 2 of the Public response.

      𝑅∗2.2 (Figure 3) Add estimates of variability for variance and autocorrelation through time from single-trial signals:

      –   E.g. by bootstrapping. Would be helpful for making rigorous the discussion of when the deviation from the theory is outside what would be expected by chance, even if it doesn’t change the specific conclusions here.

      –   If possible, it would help (by simulations, or maybe an added reference if it exists) to substantiate the claim about the expected sub-linearity at later time-points (Figure 3a) due to the upper stopping bound and limited firing rate range.

      We thank the reviewer for this helpful comment. The revised Fig. 3 now contains error bars derived by bootstrapping (see Methods, §Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sub-linearity claim using simulations.

      𝑅∗2.3 (Figure 4) Add controls and estimates of variability for decoding across sessions:

      –   As a baseline - what is the level of within-trial correlation/cosine similarity when random coding directions are used?

      –   What is the variability in the estimates of values shown in a/d/e?

      We have addressed each of these items. (1) Figure 4a now shows the s.e.m. of decoding accuracy (across sessions). (2) Regarding the variability of estimates shown in Figure 4d & e, the standard errors are displayed in the new Supplementary Figure S8. It makes sense to show them there because (i) there is no natural way to represent error on the heat maps in Figure 4, and (ii) S8 concerns the comparison of the values in Figure 4d & e to values derived from random coding directions. (3) Random coding directions lead to values of cosine similarity and within-trial correlation that do not differ significantly from zero. We show this in several ways, summarized in our reply to Public Review item 4. Additional details are in the revised manuscript (Methods: Similarity of single-trial signals) and the new Supplementary Figure S8. We also provide this information in response to Recommendation 5, above.

      𝑅∗2.4 (Figure 5) Add negative controls and significance tests to support claims about trends in leverage:

      –   What is the level of increase in leverage attained from random 1D projections of the data, or other projections where the prior would be no leverage?

      –   What is the range of leverage values fit for a simulated signal with a ground-truth of no trend?

      We have added two control analyses. In addition to a shuffle control, which destroys the relationship (Review Figure 1) we performed additional analyses that preserve the correspondence of neural signals and behavior on the same trial. We generated random coding directions (CDs) by establishing weight-vectors that were either chosen from a Normal distribution or by permuting the weights assigned to PC-1 in each session. The latter is the more conservative measure. Projections of the neural responses onto these random coding directions render 𝑆rand(𝑡). Specifically, the degree of leverage is effectively zero or very much reduced. These analyses are summarized in a new Supplementary Figure S10. The distributions of our test statistics (e.g., leverage on choice and RT) under the variants of the null hypothesis also support traditional metrics of statistical significance. Figure S10 (bottom row) also provides an approximate answer to the question: What degree of leverage and mediation would be expected for a theoretical decision variable? Briefly, we simulated 60,000 trials using the race model that best fits the behavioral data of monkey M. For any noise-free representation of a Markovian integration process, the leverage of an early sample of the DV on behavior would be mediated completely by later activity as the latter sample—up to the time of commitment—subsumes all variability captured by the earlier sample. We, therefore, generated 𝑆sim(𝑡) by first subsampling the simulated data to match the trial numbers of each session. To evaluate a DV approximated from the activity of 𝑁 Tconin neurons per session rather than the true DV represented by the entire population, we generated 𝑁 noisy instantiations of the signal for each of the subsampled, simulated trials. The noisy decision variable, 𝑆sim (t) is the mean activity of these 𝑁 noise-corrupted signals. The simulation is consistent with the leverage and incomplete mediation observed for the populations of Tcon neurons. For in additional details, see Methods, §Leverage of single-trial activity on behavior) and Supplementary Figure S10, caption. See also our response to item 1 of the Public Response.

      𝑅∗2.5 The analysis is performed across several signed coherence levels, with data detrended for each signed coherence and choice to enable comparison of fluctuations relative to the relevant baseline; are results similar for the different coherences?

      The results are qualitatively similar for individual coherences. There is less power, of course, because there are fewer trials. The analyses cannot be performed for coherences ≥ 12.8% because there are not enough trials that satisfy the inclusion criteria (presence of left and right choice trials with RT ≤ 670 ms). Nonetheless, leverage on choice and RT is statistically significant for 27 of the 30 combinations of motion strengths < 12.8% × three signals (𝑆ramp, 𝑆PC1 and 𝑆Tin) × behavioral measures (RT and choice) (RT: all 𝑝 < 0.008, Fisher-z; choice: all 𝑝 < 0.05, t-test ). The three exceptions are trials with 6.4% coherence rightward motion, which do not correlate significantly with RT on leftward choice trials. Reviewer Figure 4 shows the results of the leverage and mediation analyses, using only the 0% coherence trials.

      𝑅∗2.6 (Figure 6) Additional analysis to strengthen the claim that Min represents the integrand and not the integral:

      a. Repeating the analysis in Figure 6d with the integral (cumulative sum) of the single-trial Min signals and instead observing a significant increase in leverage over time would be strong evidence for this interpretation. If you again see no increase, then it suggests that the activity of these units (while direction selective) may not be strongly yoked to behavior. This scenario (no increasing leverage of the integral of Min on behavior through time) also raises an intriguing alternative possibility: that the noise driving the ’diffusion’ of drift-diffusion here may originate in the integrating circuit, rather than just reflecting the complete integration of noise in the stream of evidence itself.

      b. Repeating the analysis in Figure 6d with the projection of the M subspace onto its own first PC (e.g. take the union of units {Mrightin, Mleftin} [our ], do PCA just on those units’ single

      trial activities, identify the first PC, and project those activities on that dimension to obtain SPC1-M.

      c. Ameliorating the sample-size limitation by relaxing the criteria for inclusion in Min - performing the same analyses shown, but including all units with visual RFs overlapping the motion stimulus, irrespective of their direction selectivity.

      a. Reviewer Figure 2a provides support for leverage of the integral on choice, and this leverage, like , increases as a function of time. The effect is present in all seven sessions that have both and neurons (all 𝑝 < 1𝑒 − 10). However, as shown in panel b, the same integral fails

      to demonstrate more than a hint of leverage on RT (all correlations are negative) and the magnitude does not vary as a function of time. We suspect—but cannot prove—that this failure arises because of limited power and the expected weak effect. Recall that the mediation analysis of RT is restricted to longer trials and that the correlation between the Min difference and the signal is less than 0.1 over the heatmap in Fig. 6e, implying that the Min difference explains less than 1% of the variance of 𝑆Tin(𝑡). We considered including Reviewer Figure 2 in the paper, but we feel it would be disingenuous (cherrypicking) to report only the positive outcome of the leverage on choice. If the editors feel strongly about it, we would be open to including it, but leaving these analyses out of the revised manuscript seems more consistent with our effort to deëmphasize this finding. In the future, we plan to record simultaneously from populations MT and LIP neurons (Min and Tin, of course) and optimize Min neuron yield by placing the RDM stimulus in the periphery. We also provide this information in response to Recommendation (6) above.

      b.  We tried the R’s suggestion to apply PCA to the union of Min neurons , , fully expecting PC1 to comprise weights of opposite sign for the right and left preferring neurons, but that is not what we observed. Instead, the direction selectivity is distributed over at least two PCs. We think this is a reflection of the prominence of other signals, such as the strong visual response and normalization signals (see Shushruth et al., 2018). In the spirit of the R’s suggestion, we also established an “evidence coding direction” using a regression strategy similar to the Ramp CD applied to the union of Min neurons. The strategy produced a coding direction with opposite signed weights dominating the right and left subsets. The projection of the neural data on this evidence CD yields a signal similar to the difference variable used in Fig. 6e (i.e., signals that are approximately constant firing rates vs time and scale as a function of signed coherence). These unintegrated signals exhibit weak leverage on choice and RT, consistent with Figure 6d. However, the integrated signal has leverage on choice but not RT, similar to the integral of the difference signal in Reviewer Figure 2.

      c.   We do not understand the motivation for this analysis. We could apply PCA or dPCA (or the regression approach, described above) to the population of units with RFs that overlap the motion stimulus, but it is hard to see how this would test the hypothesis that direction-selective neurons similar to those in area MT supply the momentary evidence. As mentioned, we have very few Min neurons (as few as two in session 3). Future experiments that place the motion stimulus in the periphery would likely increase the yield of Min neurons and would be better suited to study this question. As such, we do not see the integrand-like responses of Min neurons as a major claim of the paper. Instead, we view it as an intriguing observation that deserves follow-up in future experiments, including simultaneous recordings from populations of MT and LIP neurons (Min and Tin, of course). We have softened the language considerably to make it clear that future work will be needed to make strong claims about the nature of Min neurons.

      𝑅∗2.7 Other questions: Figure 2c is described as showing the average firing rate of units in Tconin on single trials, but must also incorporate some baseline subtraction (as the shown traces dip into negative firing rates). Whatbaselineissubtracted? Aretheseresidualsignals, asdescribedforlaterfigures, orisadifferent method used? (Presumably, a similar procedure is used also for Figure 2a/b, given that all single-trial traces begin at 0.). Is the baseline subtraction justified? If the dataset really does reflect the decision variable with single-trial resolution, eliminating the baseline subtraction when visualizing single-trial activity might actually help to make the point clearer: trials which (for any reason) begin with a higher projection on the particular direction that furnishes the DV would be predicted to reach the decision bound, at any fixed coherence, more quickly than trials with a smaller projection onto this direction.

      We thank the reviewer for this comment. For each trial, the mean activity between 175 ms and 225 ms after motion onset was subtracted when generating the single-trial traces. The baseline subtraction was only applied for visualization to better portray the diffusion component in the signal. Unless otherwise indicated, all analyses are computed on non-baseline corrected data. We now describe in the caption of Figure 2 that “For visualization, single-trial traces were baseline corrected by subtracting the activity in a 50 ms window around 200 ms.” Examples of the raw traces used for all follow-up analyses are displayed in Reviewer Figure 6.

      Reviewer #3 (Recommendations For The Authors):

      I only have a few comments to make the paper more accessible:

      𝑅∗3.1 I struggle to understand how the linear fitting from -1 to 1 was done. More detail about how the single cell single-trial activity was generated to possibly go from -1 to 1 or do I completely misunderstand the approach? I assume the data standardization does that job?

      We have rephrased and added clarifying detail to the section describing the derivation of the ramp signal in the Methods (Ramp direction).

      We applied linear regression to generate a signal that best approximates a linear ramp, on each trial, 𝑖, that terminates with a saccade to the choice-target contralateral to the hemisphere of the LIP recordings. The ramps are defined in the epoch spanning the decision time: each ramp begins at 𝑓𝑖(𝑡0) = −1, where 𝑡0 \= 0.2 s after motion onset, and ends at 𝑓𝑖(𝑡1) = 1, where 𝑡1 \= 𝑡sac − 0.05 s (i.e., 50 ms before saccade initiation). The ramps are sampled every 25 ms and concatenated using all eligible trials to construct a long saw-tooth function (see Supplementary Figure S2). The regression solves for the weights assigned to each neuron such that the weighted sum of the activity of all neurons best approximates the saw-tooth. We constructed a time series of standardized neural activity, sampled identically to the saw-tooth. The spike times from each neuron are represented as delta functions (rasters) and convolved with a non-causal 25 ms boxcar filter. The mean and standard deviation of all sampled values of activity were used to standardize the activity for each neuron (i.e., Z-transform). The coefficients derived by the regression establish the vector of weights that define 𝑆ramp. The algorithm ensures that the population signal 𝑆ramp(𝑡), but not necessarily individual neurons, have amplitudes ranging from approximately −1 to 1.

      𝑅∗3.2 It is difficult to understand how the urgency signal is derived, to then generate fig S4.

      The urgency signal is estimated by averaging 𝑆𝑥(𝑡) at each time point relative to motion onset, using only the 0% coherence trials. We have clarified this in the caption of Supplementary Figure S4.

      Author response image 1.

      Shuffle control for Fig. 5. Breaking the within-trial correspondence between neural signal, 𝑆(𝑡), and choice suppresses leverage to near zero.

      Author response image 2.

      Leverage of the integrated difference signal on choice and RT. Traces are the average leverage across seven sessions. Same conventions as in Figure 5.

      Author response image 3.

      Trial-averaged 𝑆ramp activity during individual sessions. Same as Figure 2b for individual sessions for Monkey M (left) and Monkey J (right). The figure is intended to illustrate the consistency and heterogeneity of the averaged signals. For example, the saccade-aligned averages lose their association with motion strength before left (contra) choices in sessions 1, 2, 5, and 6 but retain the association in sessions 3, 4, 7, and 8.

      Author response image 4.

      Drift-diffusion signals have measurable leverage on choice and RT even when only 0%-coherence trials are included in the analysis.

      Author response image 5.

      Raw single-trial activity for three types of population averages. Representative single-trial activity during the first 300 ms of evidence accumulation using two motion strengths: 0% and 25.6% coherence toward the left (contralateral) choice target. Unlike in Figure 2 in the paper, single-trial traces are not baseline corrected by subtracting the activity in a 50 ms window around 200 ms. We highlight a number of trials with thick traces and these are the same trials in each of the rows.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Weaknesses & incompletely supported claims:

      (1) A central mechanistic claim of the paper is that "DCP1a can regulate DCP2's cellular decapping activity by enhancing DCP2's affinity to RNA, in addition to bridging the interactions of DCP2 with other decapping factors. This represents a pivotal molecular mechanism by which DCP1a exerts its regulatory control over the mRNA decapping process." Similar versions of this claim are repeated in the abstract and discussion sections. However, this appears to be entirely at odds with the observation from in vitro decapping assays with immunoprecipitated DCP2 that showed DCP1 knockout does not significantly affect the enzymatic activity of DCP2 (Figures 2B-D; I note that there may be a very small change in DCP2 activity shown in panel C, but this may be due to slightly different amounts of immunoprecipitated DCP2 used in the assay, as suggested by panel D). If DCP1 pivotally regulates decapping activity by enhancing RNA binding to DCP2, why is no difference in decapping activity observed in the absence of DCP1?

      Furthermore, the authors show only weak changes in relative RNA levels immunoprecipitated by DCP2 with versus without DCP1 (~2-3 fold change; consistent with the Valkov 2016 NSMB paper, which shows what looks like only modest changes in RNA binding affinity for yeast Dcp2 +/- Dcp1). Is the argument that only a 2-3 fold change in RNA binding affinity is responsible for the sizable decapping defects and significant accumulation of deadenylated intermediates observed in cells upon Dcp1 depletion? (and if so, why is this the case for in-cell data, but not the immunoprecipitated in vitro data?)

      We appreciate the reviewer's thoughtful comments on our paper. The reviewer points out an apparent contradiction between the claim that DCP1a regulates DCP2's cellular decapping activity and the observation that knocking out DCP1a does not significantly affect DCP2's enzymatic activity in vitro. However, it is important to underscore the challenge of reconciling differences between in vitro and in vivo experiments in scientific research. Although in vitro systems provide a controlled environment, they have inherent limitations that often fail to capture the complexities of cellular processes. Our in vitro experiments used immunoprecipitated proteins to ensure the presence of relevant factors, but these experiments cannot fully replicate the precise stoichiometry and dynamic interactions present in a cellular environment. Furthermore, the limited volume in vitro can actually facilitate reactions that may not occur as readily in the complex and heterogeneous environment of a cell. Therefore, the lack of a significant difference in decapping activity observed in vitro does not necessarily negate the regulatory role of DCP1 in the cellular context. Rather, it underscores our previous oversight of DCP1's importance in the decapping process under in vitro conditions. The conclusions regarding DCP1's regulatory mechanisms remain valid and supported by the presented evidence, especially when considering the inherent differences between in vitro and in vivo experimental conditions. It is precisely because of these differences that we recognized our previous underestimation of DCP1's significance. Therefore, our subsequent experiments focused on elucidating DCP1's regulatory mechanisms in the decapping process

      The authors acknowledge this apparent discrepancy between the in vitro DCP2 decapping assays and in-cell decapping data, writing: "this observation could be attributed to the inherent constraints of in vitro assays, which often fall short of faithfully replicating the complexity of the cellular environment where multiple factors and cofactors are at play. To determine the underlying cause, we postulated that the observed cellular decapping defect in DCP1a/b knockout cells might be attributed to DCP1 functioning as a scaffold." This is fair. They next show that DCP1 acts as a scaffold to recruit multiple factors to DCP2 in cells (EDC3, DDX6, PatL1, and PNRC1 and 2). However, while DCP1 is shown to recruit multiple cofactors to DCP2 (consistent with other studies in the decapping field, and primarily through motifs in the Dcp1 C-terminal tail), the authors ultimately show that *none* of these cofactors are actually essential for DCP2-mediated decapping in cells (Figures 3A-F). More specifically, the authors showed that the EVH1 domain was sufficient to rescue decapping defects in DCP1a/b knockout cells, that PNRC1 and PNRC2 were the only cofactors that interact with the EVH1 domain, and finally that shRNA-mediated PNRC1 or PNCR2 knockdown has no effect on in-cell decapping (Figures 3E and F). Therefore, based on the presented data, while DCP1 certainly does act as a scaffold, it doesn't seem to be the case that the major cellular decapping defect observed in DCP1a/b knockout is due to DCP1's ability to recruit specific cofactors to DCP2.

      The findings that none of the decapping cofactors recruited by DCP1 to DCP2 are essential for decapping in cells further underscore the complexity of the decapping process in vivo. This observation suggests that while DCP1's scaffolding function is crucial for recruiting cofactors, the decapping process likely involves additional layers of regulation that are not fully captured by our current understanding of DCP1. Furthermore, the reviewer mentions that the observed changes in RNA binding affinity (approximately 2-3 fold) in our in vitro experiments seem relatively modest. While these changes may appear insignificant in vitro, their cumulative impact in the dynamic cellular environment could be substantial. Even minor perturbations in RNA binding affinity can trigger cascading effects, leading to significant changes in decapping activity and the accumulation of deadenylated intermediates upon Dcp1 depletion. Cellular processes involve complex networks of interrelated events, and small molecular changes can result in amplified biological outcomes. The subtle molecular variations observed in vitro may translate into significant phenotypic outcomes within the complex cellular environment, underscoring the importance of DCP1a's regulatory role in the cellular decapping process.

      So as far as I can tell, the discrepancy between the in vitro (DCP1 not required) and in-cell (DCP1 required) decapping data, remains entirely unresolved. Therefore, I don't think that the conclusions that DCP1 regulates decapping by (a) changing RNA binding affinity (authors show this doesn't matter in vitro, and that the change in RNA binding affinity is very small) or (b) by bridging interactions of cofactors with DCP2 (authors show all tested cofactors are dispensable for robust in-cell decapping activity), are supported by the evidence presented in the paper (or convincingly supported by previous structural and functional studies of the decapping complex).

      We have addressed the reconciliation of differences between in vitro and in vivo experiments in the revised manuscript and emphasized the importance of considering cellular interactions when interpreting our findings.

      (2) Related to the RNA binding claims mentioned above, are the differences shown in Figure 3H statistically significant? Why are there no error bars shown for the MBP control? (I understand this was normalized to 1, but presumably, there were 3 biological replicates here that have some spread of values?). The individual data points for each replicate should be displayed for each bar so that readers can better assess the spread of data and the significance of the observed differences. I've listed these points as major because of the key mechanistic claim that DCP1 enhances RNA binding to DCP2 hinges in large part on this data.

      Thank you for your feedback. Regarding your comments on the statistical significance of the differences shown in Figure 3H and the absence of error bars for the MBP control, we will address these concerns in the revised manuscript. We’ll include individual data points for the three biological replicates and corresponding statistical analysis to more clearly demonstrate the data spread and significance of the observed differences.

      (3) Also related to point (1) above, the kinetic analysis presented in Figure 2C shows that the large majority of transcript is mostly decapped at the first 5-minute timepoint; it may be that DCP2-mediated decapping activity is actually different in vitro with or without DCP1, but that this is being missed because the reaction is basically done in less than 5 minutes under the conditions being assayed (i.e. these are basically endpoint assays under these conditions). It may be that if kinetics were done under conditions to slow down the reaction somewhat (e.g. lower Dcp2 concentration, lower temperatures), so that more of the kinetic behavior is captured, the apparent discrepancy between in vitro and in-cell data would be much less. Indeed, previous studies have shown that in yeast, Dcp1 strongly activates the catalytic step (kcat) of decapping by ~10-fold, and reduces the KM by only ~2 fold (Floor et al, NSMB 2010). It might be beneficial to use purified proteins here (only a Western blot is used in Figure 2D to show the presence of DCP2 and/or DCP1, but do these complexes have other, and different, components immunoprecipitated along with them?), if possible, to better control reaction conditions.

      This contradiction between the in vitro and in-cell decapping data undercuts one of the main mechanistic takeaways from the first half of the paper. This needs to be addressed/resolved with further experiments to better define the role of DCP1-mediated activation, or the mechanistic conclusions significantly changed or removed.

      We genuinely appreciate the reviewer’s insightful comments on the kinetic analysis presented in Figure 2C. Your astute observation regarding the potential influence of reaction duration on the interpretation of in vitro decapping activity, especially in the absence of DCP1, is well-received. The time-sensitive nature of our experiments, as you rightly pointed out, might not fully capture the nuanced kinetic behaviors. In addition, the DCP2 complex purified from cells could not be precisely quantified. In response to your suggestion, we attempted to purify human DCP2 protein from E. coli; however, regrettably, the purified protein failed to exhibit any enzymatic activity. This disparity may be attributed to species differences.

      Considering the reviewer’s valuable insights, our revised manuscript emphasized that purified DCP2 from cells exhibits activity regardless of the presence of DCP1. This adjustment aims to provide a clearer perspective on our findings and to better align with the nuances of our experimental design and the meticulous consideration of the results.

      (4) The second half of the paper compares the transcriptomic and metabolic profiles of DCP1a versus DCP1b knockouts to reveal that these target a different subset of mRNAs for degradation and have different levels of cellular metabolites. This is a great application of the DCP1a/b KO cells developed in this paper and provides new information about DCP1a vs b function in metazoans, which to my knowledge has not really been explored at all. However, the analysis of DCP1 function/expression levels in human cancer seems superficial and inconclusive: for example, the authors conclude that "...these findings indicate that DCP1a and DCP1b likely have distinct and non-redundant roles in the development and progression of cancer", but what is the evidence for this? I see that DCP1a and b levels vary in different cancer cell types, but is there any evidence that these changes are actually linked to cancer development, progression, or tumorigenesis? If not, these broader conclusions should be removed.

      Thank you to the reviewer for pointing out that such a description may be misleading. We have removed our previous broader conclusion and revised our sentences. To further explore the potential impact of DCP1a and DCP1b on cancer progression, we examined the association between the expression levels of DCP1a and DCP1b and progression-free interval (PFI). We have incorporated this information into our revised manuscript.

      (5) The authors used CRISPR-Cas9 to introduce frameshift mutations that result in premature termination codons in DCP1a/b knockout cells (verified by Sanger sequencing). They then use Western blotting with DCP1a or DCP1b antibodies to confirm the absence of DCP1 in the knockout cell lines. However, the DCP1a antibody used in this study (Sigma D5444) is targeted to the C-terminal end of DCP1a. Can the authors conclusively rule out that the CRISPR/Cas-generated mutations do not result in the production of truncated DCP1a that is just unable to be detected by the C-terminally targeted antibody? While it is likely the introduced premature termination codon in the DCP1a gene results in nonsense-mediated decay of the resulting transcript, this outcome is indeed supported by the knockout results showing large defects in cellular decapping which can be rescued by the addition of the EVH1 domain, it would be better to carefully validate the success of the DCP1a knockout and conclusively show no truncated DCP1a is produced by using N-terminally targeted DCP1a antibodies (as was the case for DCP1b).

      Thank you for your insightful comment regarding the validation of our DCP1a/b knockout cell line. We acknowledge your point about the DCP1a C-terminal targeting of the Sigma D5444 antibody used in our Western blot analysis. We agree that we cannot definitively rule out the possibility of truncated DCP1a protein production solely based on the lack of full-length protein detection. To address this limitation, we utilized a commercial information available N-terminally targeted DCP1a antibody (aviva ARP39353_T100) in a Western blot analysis. This will allow us to comprehensively detect any truncated protein fragments remaining after the CRISPR-Cas9-generated frameshift mutation.

      Some additional minor comments:

      • More information would be helpful on the choice of DCP1 truncation boundaries; why was 1-254 chosen as one of the truncations?

      Thank you for the reviewer's comment and suggestion. Regarding the choice of DCP1 1-254 truncation boundaries based on the predicted structure from AlphaFoldDB (A0A087WT55). We will include this information in the revised manuscript.

      • Figure S2D is a pretty important experiment because it suggests that the observed deadenylated intermediates are in fact still capped; can a positive control be added to these experiments to show that removal of cap results in rapid terminator-mediated degradation?

      Unfortunately, due to our institution's current laboratory safety policies, we are unable to perform experiments involving the use of radioactive isotopes such as 32P. Therefore, while adding the suggested positive control experiment to demonstrate rapid RNA degradation upon decapping would further validate our interpretation, we regret that we cannot carry out this experiment at the moment. However, the observed deadenylated intermediates in Figure S2D match the predicted size of capped RNA fragments, and not the expected sizes of degradation products after decapping. Furthermore, previous literature has well-established that for these types of RNAs, decapping leads directly to rapid 5' to 3' exonuclease-mediated degradation, without producing stable deadenylated intermediates. Thus, we believe that the current data is sufficient to support our conclusion that the deadenylated intermediates retain the 5' cap structure.

      Reviewer #2 (Public Review):

      Weaknesses:

      The direct targets of DCP1a and/or DCP1b were not determined as the analysis was restricted to RNA-seq to assess RNA abundance, which can be a result of direct or indirect regulation by DCP1a/b.

      Thank you for raising this important point. In our study, we acknowledge that the use of RNA-seq to assess RNA abundance provides a broad overview of the regulatory impacts of DCP1a and DCP1b. This method captures changes in RNA levels that may arise from both direct and indirect regulatory actions of these proteins. While we did not directly determine the targets of DCP1a and DCP1b, the data obtained from our RNA-seq analysis serve as a foundational step for future targeted experiments, which could include techniques such as RIP-seq, to delineate the direct targets of DCP1a and DCP1b more precisely. We believe that our current findings contribute valuable information to the field and pave the way for these subsequent analyses.

      P-bodies appear to be larger in human cells lacking DCP1a and DCP1b but a lack of image quantification prevents this conclusion from being drawn.

      Thank you for the reviewer’s valuable feedback. We have addressed the reviewer’s concern regarding P-bodies' size in human cells lacking DCP1a and DCP1b. We have now performed image quantification and can confirm that P-bodies are indeed larger in these cells.

      The lack of details in the methodology and figure legends limit reader understanding.

      We acknowledge the reviewer's concerns regarding the level of detail provided in the methodology and figure legends. To address this, we are committed to enhancing both sections with additional details and clarifications in our revised manuscript. Thank you for bringing this to our attention.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) To me, the second half of the paper comparing DCP1a and DCP1b is in many ways distinct from the first half and could stand on its own as an interesting paper if this comparative analysis is explored a little deeper (maybe by validating some of the differences in decay observed for individual mRNAs targeted by DCP1a versus DCP1b, by measuring and comparing the decay rates of some individual transcripts under differential control by DCP1a vs b?), and revising the conclusions about links to cancer as mentioned above. I think these later comparative results in the paper present the most new and interesting data concerning DCP1 function in humans (especially since I think the mechanistic conclusions from the first half aren't well supported yet or are at least inconsistent), but when I read these later sections of the paper I struggle to understand the key takeaways from the transcriptomic and metabolomic data.

      Thank you for the reviewer's suggestions. Estimating the decay rates of individual transcripts within the transcriptomes of DCP1a_KO, DCP1b_KO, and wild type can provide insight into the direct targets of DCP1a or DCP1b. However, this requires either time-series RNA-seq or specialized sequencing technologies such as Precision Run-On sequencing (PRO-seq) or RNA Approach to Equilibrium Sequencing (RATE-Seq). Unfortunately, we lack the necessary dataset in our project to estimate the decay rates for the potential targets identified in our RNA-seq data. Despite this limitation, we acknowledge the potential of this approach in identifying the true targets of DCP1a and DCP1b and have included this idea in our discussion.

      (2) I think it would be helpful to add a little more descriptive or narrative language to the figure legends (I know some of them are already quite long!) so that readers can follow the general idea of the experiment through the figure legend as well as the main text; as written, the figure legends are mostly exclusively technical details, so it can be hard to parse what experiment is being carried out in some cases.

      Thank you for the reviewer’s suggestion, we will strive to improve the language of the figure legends to include technical details while clearly conveying the main idea of the experiment. We will ensure that the language of the figure legends is more readable and comprehensible so that readers can more easily parse what experiment is being carried out.

      Reviewer #2 (Recommendations For The Authors):

      Suggestions for improved or additional experiments, data, or analyses:

      The use of RNA-seq to measure RNA abundance in DCP1a and/or b knockout cells can give some insight into both the indirect and direct effects of DCP1a/b on gene expression but cannot identify the direct targets of these genes. Rather, global analysis of RNA stability or capturing uncapped RNA decay intermediates would allow the authors to conclude they have identified direct targets of DCP1a and/or b. Without such analyses, the interpretation of these data should be scaled back to clearly state that RNA levels can be altered through indirect effects of DCP1a/b absence throughout the text.

      We appreciate the reviewer's suggestion. We have modified our sentences to emphasize that the dysregulated genes could be caused by both direct and indirect effects.

      A control/randomly generated gene list should be analyzed for GO terms to determine whether the enrichment of cancer-related pathways in the differentially expressed genes in the DCP1a/b knockout cells is meaningful.

      Thank you for the reviewer's comment. We shuffled our gene list and reperformed the pathway enrichment analysis in Figure 4C and 4D 1,000 times. We focused on the following cancer-related pathways: E2F targets, MTORC1 signaling, G2M checkpoint, MYC target V1, EMT transition, KRAS signaling DN, P53 pathway, and NOTCH signaling pathways. We then calculated how many times the q-values obtained from the shuffled gene list were more significant than the q-value obtained from our real data. In four of the eight pathways (E2F targets, MTORC1 signaling, G2M checkpoint, and MYC target v1), none of the shuffled gene lists resulted in a q-value smaller than the real one. In the other four pathways (EMT transition, KRAS signaling DN, P53 pathway, and NOTCH signaling pathways), the q-values were smaller than the real q-value 2, 11, 4, and 4 times out of the 1000 shuffles. Based on the shuffled results, we conclude that the transcriptome of DCP1a/b knockout cells is statistically enriched in these cancer-related pathways.

      Author response image 1.

      Distribution of q-values resulting from the Gene Set Enrichment Analysis (GSEA) conducted on 1,000 shuffled gene lists for eight cancer-related pathways. The q-values derived from Figure 4C and 4D are indicated by red (DCP1a_KO) and blue (DCP1b_KO) dashed lines, respectively. Some q-values derived from Figure 4C are too small to be labeled on the plots, such as in E2F targets (q value: 5.87E-07), MTORC1 signaling (q values: 6.59E-07 and 1.58E-06 for DCP1a_KO and DCP1b_KO, respectively), MYC target V1 (q value: 0.004644174 for DCP1a_KO), etc. The numbers x/1000 indicate how often the shuffled q-values were smaller than the real q-value out of 1,000 permutations.

      Comparisons of the DCP1a and/or b knockout RNA-seq results should be done to published datasets such as those published by Luo et al., Cell Chemical Biology (2021) to determine whether there are common targets with DCP2 and validate the reported findings.

      Thank you for reviewer’s suggestion. We compared the upregulated genes from DCP1a_KO, DCP1b_KO, and DCP1a/b_KO cell lines with the 91 targets of DPC2 identified by Luo et al. in Cell Chemical Biology (2021). Only EPPK1 was found to be overlapped between the potential DCP1b_KO targets and the targets of DCP2. No genes were found to be overlapped between the potential DCP1a_KO targets and the targets of DCP2. However, three genes, TES, PAX6, and C18orf21, were found to be overlapped between the significantly upregulated DEGs of DCP1a/b_KO and the targets of DCP2. We have included this information in the discussion section.

      The RNA tethering assays are not clear and are difficult to interpret without further controls to delineate the polyadenylated and deadenylated species.

      Thank you for the reviewer’s feedback. We acknowledge that the reviewer might harbor some doubts regarding the outcomes of the RNA tethering assays. Nonetheless, this methodology is well-established and has also found extensive application across many studies. We are committed to enhancing the clarity of our experiment’s details and results within the figure legends and textual descriptions.

      The representative images of p-bodies clearly show that DCP1a/b KO cells have larger p-bodies than the wild-type cells. The authors should quantify p-body size in each image set as the current interpretation of the data is that there is no difference in size or number of p-bodies, but the data suggest otherwise.

      Thank you very much for the reviewer’s insightful comments and for drawing our attention to the need to quantify p-body sizes in DCP1a/b KO and wild-type cells. We agree with the reviewer’s assessment that the representative images suggest a difference in p-body size between DCP1a/b KO cells and wild-type cells, which we initially overlooked. We will revise our manuscript accordingly to include these findings, ensuring that our interpretation of the data aligns with the observed differences.

      Statistical analysis of the Figure 2C results should be included because the difference between the wild-type and Dco1a/b KO cells with GFP-DCP2 looks significantly different but is interpreted in the text as not significant.

      Thank you for pointing out the need for a statistical analysis of the results shown in Figure 2C. We acknowledge that the visual difference between the wild-type and Dco1a/b KO cells with GFP-DCP2 suggests a significant variation, which may not have been clearly communicated in our text. We will conduct the necessary statistical analysis to substantiate the observations made in Figure 2C. Furthermore, we would like to emphasize that our primary focus was to demonstrate that purified DCP2 within cells retains its activity even in the absence of DCP1. This critical point will be highlighted and clarified in the revised version of our manuscript to prevent any misunderstanding.

      Recommendations for improving the writing and presentation:

      Additional context including what is known about the role of dcp1 in decapping from the decades of work in yeast and other model organisms should be incorporated into the introduction and discussion sections.

      Thank you for the reviewer’s suggestion. We will incorporate additional context about the function and significance of DCP1 in decapping processes within our revised manuscript's introduction and discussion sections.

      Details should be provided within the figure legends and methods section on experimental approaches and the number of replicates and statistical analyses used throughout the manuscript. For example, it is not clear whether western blots or RNA-IP experiments were performed more than once as representative images are shown.

      Thank you for the reviewer’s suggestion. In the figure legends and methods section, we will provide more details about the experimental methods, number of replicates, and statistical analyses. Regarding the Western blots and RNA-IP experiments the reviewer mentioned, we performed multiple experiments and presented representative images in the manuscript. We will clarify this in the revised manuscript to eliminate potential confusion.

      The rationale for performing metabolic profiling is not clear.

      We appreciate the reviewer's thoughtful feedback. The rationale behind conducting metabolic profiling in our study is rooted in its efficacy as a valuable tool for deciphering the consequences of specific gene mutations, particularly those closely associated with phenotypic changes or final metabolic pathways. Our objective is to utilize metabolic profiling to unravel the distinct biofunctions of DCP1a and DCP1b. By employing this approach, we aim to gain insights into the intricate metabolic alterations that result from the absence of these genes, thereby enhancing our understanding of their roles in cellular processes. We recognize the necessity of clearly presenting this rationale and promise to bolster the articulation of these points in the revised version of our manuscript to ensure the clarity and transparency of our research motivation.

      Details in the methods section should be included for the CRISPR/Cas9-mediated gene editing validation. The Sangar sequencing results presented in Figure S1b should be explained. The entire western blot(s) should be shown in Figure S1A to give confidence the Dcp1a/b KO cells are not expressing truncated proteins and the epitopes of the antibodies used to detect Dcp1a/b should be described. The northern blot probes should be described and sequences included. The transcriptomics method should be detailed.

      Thank you for your feedback, in the revised manuscript we will detail the CRISPR/Cas9 gene editing validation, explain the Sanger sequencing results in Figure S1b, show the full Western blot in Figure S1A to confirm that the Dcp1a/b knockout cells are not expressing truncated proteins, describe the Northern blot probes used, and detail the transcriptomics method, all to ensure clarity and comprehensiveness in our experimental procedures and results.

      A diagram showing the RNA tethering assays with labels corresponding to all blots/gels should be provided.

      Thank you for your suggestion. We will provide a diagram showing the RNA tethering assays with labels corresponding to all blots/gels in our revised manuscript. This will help readers better understand our experimental design and results.

      The statement, "This suggests that the disruption of the decapping process in DCP1a/b-knockout cells results in the accumulation of unprocessed mRNA intermediates" regarding the results of the RNA-seq assay is not supported by the evidence as RNA-seq does not measure RNA decay intermediates or RNA decay rates.

      Thank you for the reviewer’s comment. We agree with that RNA-seq experiments indeed do not directly measure RNA decay intermediates or RNA decay rates. Our statement could have caused confusion, and we have therefore removed this sentence from the manuscript.

      Minor corrections to the text and figures:

      Figure S6A is uninterpretable as presented.

      Thank you for the reviewer’s valuable feedback. We have taken note and made improvements. We have simplified Figure S6A to enhance its interpretability, hoping that the current version will make it easier for the readers to understand.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Original comment: There is no explanation for how this work could be a breakthrough in simulation gregarious feeding as is stated in the manuscript.

      Reviewer response: I think I understand where the authors are trying to take this next step. If the authors were to follow up on this study with the proposed implementation of inhalant/exhalent velocities profiles (or more preferably velocity/pressure fields), then that study would be a breakthrough in simulating such gregarious feeding. Based on what has been done within the present study, I think the term "breakthrough" is instead overly emphatic. An additional note on this. The authors are correct that incorporating additional models could be used to simulation a population (as has been successfully done for several Ediacaran taxa despite computational limitations), but it's not the only way. The authors 1 might explore using periodic boundary conditions on the external faces of the flow domain. This could require only a single Olivooid model to assess gregarious impacts - see the abundant literature of modeling flow through solar array fields.

      We appreciate the reviewer 1 for the suggestion. Modeling gregarious feeding via periodic boundary conditions is surely a practical way with limited computational resources. Modeling flow through solar array fields can also be an inspiring case. However, to realism the simulation of gregarious feeding behavior on an uneven seabed and with irregular organism spatial distribution, just using periodic boundary conditions may not be sufficient (see Author response image 1 for a simple example). We will go on exploring the way of realizing the simulations of large-scale gregarious feeding.

      Author response image 1.

      An example of modeling gregarious feeding behavior on an uneven seabed.

      Original comment: The claim that olivooid-type feeding was most likely a prerequisite transitional form to jet-propelled swimming needs much more support or needs to be tailored to olivooids. This suggests that such behavior is absent (or must be convergent) before olivooids, which is at odds with the increasing quantities of pelagic life (whose modes of swimming are admittedly unconstrained) documented from Cambrian and Neoproterozoic deposits. Even among just medusozoans, ancestral 1 state reconstruction suggests that they would have been swimming during the Neoproterozoic (Kayal et al., 2018; BMC Evolutionary Biology) with no knowledge of the mechanics due to absent preservation. Author response: Thanks for your suggestions. Yes, we agree with you that the ancestral swimming medusae may appear before the early Cambrian, even at the Neoproterozoic deposits. However, discussions on the affinities of Ediacaran cnidarians are severely limited because of the lack of information concerning their soft anatomy. So, it is hard to detect the mechanics due to absent preservation. Olivooids found from the basal Cambrian Kuanchuanpu Formation can be reasonably considered as cnidarians based on their radial symmetry, external features, and especially the internal anatomies (Bengtson and Yue 1997; Dong et al. 2013; 2016; Han et al. 2013; 2016; Liu et al. 2014; Wang et al. 2017; 2020; 2022). The valid simulation experiment here was based on the soft tissue preserved in olivooids.

      Reviewer response: This response does not sufficiently address my earlier comment. While the authors are correct that individual Ediacaran affinities are an area of active research and that Olivooids can reasonably be considered cnidarians, this doesn't address the actual critique in my comment. Most (not all) Ediacaran soft-bodied fossils are considered to have been benthic, but pelagic cnidarian life is widely acknowledged to at least be present during later White Sea and Nama assemblages (and earlier depending on molecular clock interpretations). The authors have certainly provided support for the mechanics of this type of feeding being co-opted for eventual jet propulsion swimming in Olivooids. They have not provided sufficient justifications within the manuscript for this to be broadened beyond this group.

      Thanks for your sincere commentary. We of course agree with the possibility of the emergence of swimming cnidarians before the lowermost Cambrian Fortunian Stage. See lines 16-129: “Ediacaran fossil assemblages with complex ecosystems consist of exceptionally preserved soft-bodied eukaryotes of enigmatic morphology, which their affinities are mostly unresolved (Tarhan et al., 2018, Integrative and Comparative Biology, 58 (4), 688–702; Evans et al., 2022, PNAS, 11(46), e220747511).” Undoubtedly Olivooids belong to cnidarians charactered by their external and internal biological structures. Limited by the fossil records, we could only speculate on the transition from the benthic to the swimming of ancestral cnidarians via the valid fossil preservation, e.g. olivooids. The transition may require processes such as increasing body size, thickening the mesoglea, and degenerating the periderm, etc. And these processes may also evolve independently or comprehensively. Moreover, the ecological behaviors of the ancestral cnidarians may evolve independently at different stages from Ediacaran to Cambrian. We therefore could not provide more sufficient justifications beyond olivooids.

      Original comment: L446: two layers of hexahedral elements is a very low number for meshing boundary layer flow

      Reviewer response: As the authors point out in the main text, these organisms are small (millimeters in scale) and certainly lived within the boundary layer range of the ocean. While the boundary layer is not the main point, it still needs to be accurately resolved as it should certainly affect the flow further towards the far field at this scale. I'm not suggesting the authors need to perfectly resolve the boundary layer or focus on using turbulence models more tailored to boundary layer flows (such as k-w), but the flow field still needs sufficient realism for a boundary bounded flow. The authors really should consider quantitatively assessing the number of hexahedral elements within their mesh refinement study.

      To address this concern, we run another four simulations based on mesh4 within our mesh refinement study to assess the number of hexahedral elements (five layers and eight layers of hexahedral elements with different thickness of boundary layer mesh (controlled by thickness adjustment factor), respectively). the results had been supplemented to Table supplement 2. As shown in the results, the number of layers of hexahedral elements seems does not significant influence the result, but the thickness of boundary layer mesh can influence the maximum flow velocity of the contraction phase. However, the results of all the simulations were generally consistent, as shown in Author response image 2. The description of the results above were added to section “Mesh sensitivity analysis”.

      Author response image 2.

      Results of mesh refinement study of different boundary layer mesh parameters.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary:

      This paper explores how diverse forms of inhibition impact firing rates in models for cortical circuits. In particular, the paper studies how the network operating point affects the balance of direct inhibition from SOM inhibitory neurons to pyramidal cells, and disinhibition from SOM inhibitory input to PV inhibitory neurons. This is an important issue as these two inhibitory pathways have largely been studies in isolation. Support for the main conclusions is generally solid, but could be strengthened by additional analyses.

      Strengths

      The paper has improved in revision, and the new intuitive summary statements added to the end of each results section are quite helpful. Weaknesses

      The concern about whether the results hold outside of the range in which neural responses are linear remains. This is particularly true given the discontinuity observed in the stability measure. I appreciate the concern (provided in the response to the first round of reviews) that studying nonlinear networks requires a lot of work. A more limited undertaking would be to test the behavior of a spiking network at a few key points identified by your linearization approach. Such tests could use relatively simple (and perhaps imperfect) measures of gain and stability. This could substantially enhance the paper, regardless of the outcome.

      We appreciate the reviewer’s concern and in our resubmission we explore if networks dynamics that operate outside of the case where linearization is possible would continue to show our main result on the (dis)entanglement of stability and gain; the short answer is yes. To this end we have added a new section and Figure to our main text.

      “Gain and stability in stochastically forced E – PV – SOM circuits

      To confirm that our results do not depend on our approach of a linearization around a fixed point, we numerically simulate similar networks as shown above (Figure 2) in which the E and PV population receive slow varying, large amplitude noise (Figure 6A). This leads to noisy rate dynamics sampling a large subspace of the full firing rate grid (r<sub>E</sub>,r<sub>P</sub>) and thus any linearization would fail to describe the network response. In this stochastically forced network we explore how adding an SOM modulation or a stimulus affects this subspace (Figure 6B). To quantify stability without linearization, we assume that a network is more stable the lower the mean and variance of E rates. This is because very stable networks can better quench input fluctuations [Kanashiro et al., 2017; Hennequin et al., 2018]. To quantify gain, we calculate the change in E rates when adding the stimulus, yet having identical noise realizations for stimulated and non-stimulated networks (Methods).

      For the disinhibitory network without feedback a positive SOM modulation decreases stability due to increases of the mean and variance of E rates (Figure 6Ci) while the network gain increases (Figure 6Cii). As seen before (Figure 2A,B), stability and gain change in opposite directions in a disinhibitory circuit without feedback. Adding feedback PV → SOM and applying a negative SOM modulation increases both, stability and gain and therefore disentangles the inverse relation also in a noisy circuit (Figure 6D-F). This gives numerical support that our results do not depend on the assumption of linearization.

      “Methods: Noisy input and numerical measurement of stability and gain

      We consider a temporally smoothed input process ξ<sub>X</sub> with white noise ζ (zero mean, standard deviation one): for populations X ∈{E,P} with timescale τ<sub>ξ</sub> = 50ms, σ<sub>X</sub> \= 6 and fixed mean input IX. To quantify the stability of the network without linearization, we assume that a network is more stable if the mean and variance of excitatory rates are low. To quantify network gain, we freeze the white noise process ζ for the case of with and without stimulus presentation and calculate the difference of E rates at each time point, leading to a distribution of network gains (Figure 6Cii,Fii). Total simulation time is 1000 seconds.”

      We decided against using a spiking network because sufficiently asynchronous spiking network dynamics can still obey a linearized mean field theory (if the fluctuations in population firing rates are small). In our new analysis the firing rate deviations from the time averaged firing rate are sizable, making a linearization ineffective.

      In summary, based on our additional analysis of recurrent circuits with noisy inputs we conclude that our results also hold in fluctuating networks, without the need of assuming realization aroud a stable fixed point.

      Reviewer #2 (Public Review):

      Summary:

      Bos and colleagues address the important question of how two major inhibitory interneuron classes in the neocortex differentially affect cortical dynamics. They address this question by studying Wilson-Cowan-type mathematical models. Using a linearized fixed point approach, they provide convincing evidence that the existence of multiple interneuron classes can explain the counterintuitive finding that inhibitory modulation can increase the gain of the excitatory cell population while also increasing the stability of the circuit’s state to minor perturbations. This effect depends on the connection strengths within their circuit model, providing valuable guidance as to when and why it arises.

      Overall, I find this study to have substantial merit. I have some suggestions on how to improve the clarity and completeness of the paper.

      Strengths:

      (1) The thorough investigation of how changes in the connectivity structure affect the gain-stability relationship is a major strength of this work. It provides an opportunity to understand when and why gain and stability will or will not both increase together. It also provides a nice bridge to the experimental literature, where different gain-stability relationships are reported from different studies.

      (2) The simplified and abstracted mathematical model has the benefit of facilitating our understanding of this puzzling phenomenon. (I have some suggestions for how the authors could push this understanding further.) It is not easy to find the right balance between biologically-detailed models vs simple but mathematically tractable ones, and I think the authors struck an excellent balance in this study.

      We thank the reviewer for their support of our work.

      Weaknesses:

      (1) The fixed-point analysis has potentially substantial limitations for understanding cortical computations away from the steady-state. I think the authors should have emphasized this limitation more strongly and possibly included some additional analyses to show that their conclusions extend to the chaotic dynamical regimes in which cortical circuits often live.

      In the response to reviewer 1 we have included model analyses that addresses the limitations of linearization. Rather than use a chaotic model, which would require significant effort, we opted for a stochastically forced network, where the sizable fluctuations in rate dynamics preclude linearization.

      (2) The authors could have discussed – even somewhat speculatively – how VIP interneurons fit into this picture. Their absence from this modelling framework stands out as a missed opportunity.

      We agree that including VIP neurons into the framework would be an obvious and potentially interesting next step. At this point we only include them as potential modulators of SOM neurons. Modeling their dynamics without them receiving inputs from E, PV, or SOM neurons would be uninteresting. However, including them properly into the circuit would be outside the scope of the paper.

      (3) The analysis is limited to paths within this simple E, PV, SOM circuit. This misses more extended paths (like thalamocortical loops) that involve interactions between multiple brain areas. Including those paths in the expansion in Eqs. 11-14 (Fig. 1C) may be an important consideration.

      We agree that our pathway expansion can be used to study more than just the E – PV – SOM circuit. However, properly investigating full thalamocortcial loops should be done in a subsequent study.

      Comments on revisions:

      I think the authors have done a reasonable job of responding to my critiques, and the paper is in pretty good shape. (Also, thanks for correctly inferring that I meant VIP interneurons when I had written SST in my review! I have updated the public review accordingly.)

      I still think this line of research would benefit substantially from considering dynamic regimes including chaotic ones. I strongly encourage the authors to consider such an extension in future work.

      Please see our response above to Reviewer 1.

      Reviewer #3 (Public Review):

      Summary:

      Bos et al study a computational model of cortical circuits with excitatory (E) and two subtypes of inhibition parvalbumin (PV) and somatostatin (SOM) expressing interneurons. They perform stability and gain analysis of simplified models with nonlinear transfer functions when SOM neurons are perturbed. Their analysis suggests that in a specific setup of connectivity, instability and gain can be untangled, such that SOM modulation leads to both increases in stability and gain, in contrast to the typical direction in neuronal networks where increased gain results in decreased stability.

      Strengths:

      - Analysis of the canonical circuit in response to SOM perturbations. Through numerical simulations and mathematical analysis, the authors have provided a rather comprehensive picture of how SOM modulation may affect response changes.

      - Shedding light on two opposing circuit motifs involved in the canonical E-PV-SOM circuitry - namely, direct inhibition (SOM -¿ E) vs disinhibition (SOM -¿ PV -¿ E). These two pathways can lead to opposing effects, and it is often difficult to predict which one results from modulating SOM neurons. In simplified circuits, the authors show how these two motifs can emerge and depend on parameters like connection weights.

      - Suggesting potentially interesting consequences for cortical computation. The authors suggest that certain regimes of connectivity may lead to untangling of stability and gain, such that increases in network gain are not compromised by decreasing stability. They also link SOM modulation in different connectivity regimes to versatile computations in visual processing in simple models.

      We thank the reviewer for their support of our work.

      Weaknesses

      Computationally, the analysis is solid, but it’s very similar to previous studies (del Molino et al, 2017). Many studies in the past few years have done the perturbation analysis of a similar circuitry with or without nonlinear transfer functions (some of them listed in the references). This study applies the same framework to SOM perturbations, which is a useful computational analysis, in view of the complexity of the high-dimensional parameter space.

      Link to biology: the most interesting result of the paper with regard to biology is the suggestion of a regime in which gain and stability can be modulated in an unconventional way - however, it is difficult to link the results to biological networks:

      - A general weakness of the paper is a lack of direct comparison to biological parameters or experiments. How different experiments can be reconciled by the results obtained here, and what new circuit mechanisms can be revealed? In its current form, the paper reads as a general suggestion that different combinations of gain modulation and stability can be achieved in a circuit model equipped with many parameters (12 parameters). This is potentially interesting but not surprising, given the high dimensional space of possible dynamical properties. A more interesting result would have been to relate this to biology, by providing reasoning why it might be relevant to certain circuits (and not others), or to provide some predictions or postdictions, which are currently missing in the manuscript.

      - For instance, a nice motivation for the paper at the beginning of the Results section is the different results of SOM modulation in different experiments - especially between L23 (inhibition) and L4 (disinhibition). But no further explanation is provided for why such a difference should exist, in view of their results and the insights obtained from their suggested circuit mechanisms. How the parameters identified for the two regimes correspond to different properties of different layers?

      Please see our answer to the previous round of revision.

      - One of the key assumptions of the model is nonlinear transfer functions for all neuron types. In terms of modelling and computational analysis, a thorough analysis of how and when this is necessary is missing (an analysis similar to what has been attempted in Figure 6 for synaptic weights, but for cellular gains). A discussion of this, along with the former analysis to know which nonlinearities would be necessary for the results, is needed, but currently missing from the study. The nonlinearity is assumed for all subtypes because it seems to be needed to obtain the results, but it’s not clear how the model would behave in the presence or absence of them, and whether they are relevant to biological networks with inhibitory transfer functions.

      Please see our answer to the previous round of revision.

      - Tuning curves are simulated for an individual orientation (same for all), not considering the heterogeneity of neuronal networks with multiple orientation selectivity (and other visual features) - making the model too simplistic.

      Please see our answer to the previous round of revision.

      Reviewer #1 (Recommendations For The Authors):

      Introduction, first paragraph, last sentence: suggest ”sense,” -¿ ”sense” (no comma)

      Introduction, second paragraph, first sentence: suggest ”is been” -¿ ”has been”

      Introduction, very end of next to last paragraph: clarify ”modulate the circuit”

      Figure 1 legend: can you make the ”Change ...” in the legend for 1D clearer - e.g. ”strenghen SOM → E connections and eliminate SOM → P connections”.

      Paragraph immediately below Figure 1: In sentence starting ”Specifically ...” can you relate the cases described here back to the equation in Figure 1C?

      Sentence right below equation 2: This sentence does not separate the network gain from the cellular gain as clearly as it could.

      Page 7, second full paragraph: sentence starting ”Therefore, with ...” could be split into two or otherwise made clearer.

      Sentence starting ”Furthermore” right below Figure 5 has an extra comma

      We thank the reviewer for their additional comments, we made the respective changes in the manuscript.

      Reviewer #3 (Recommendations For The Authors):

      There is a long part in the reply letter discussing the link to biology - but the revised manuscript doesn’t seem to reflect that.

      The information in the reply letter discussing the link to biology has been added at multiple points in the discussion. In the section ‘decision of labor between PV and SOM neurons’ we mention Ferguson and Carding 2020, in the section ‘impact of SOM neuron modulation on tuning curves’ we discuss Phillups and Hasenstaub 2016, and in the section ‘limitations and future directions’ we mention Tobin et al., 2023.

      The writing can be improved - for example, see below instances:

      P. 7: Intuitively, the inverse relationship follows for inhibitory and disinhibitory pathways (and their mixture) because the firing rate grid (heatmap) does not depend on how the SOM neurons inhibit the E - PV circuit.

      P.8: We first remark that by adding feedback E connections onto SOM neurons, changes in SOM rates can now affect the underlying heatmaps in the (rE, rP) grid.

      Not clear how ”rates can affect the heatmaps”. It’s too colloquial and not scientifically rigorous or sound.

      We added further explanations at the respective places in the manuscript to improve the writing.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Response to Reviewer 1

      Thank you for your recognition of our revised work.

      Response to Reviewer 2

      It would be useful to have a demonstration of where this model outperforms SaProt systematically, and a discussion about what the success of this model teaches us given there is a similar, previously successful model, SaProt.

      As two concurrent works, ProtSSN and SaProt employ different methods to incorporate the structure information of proteins. Generally speaking, for two deep learning models that are developed during a close period, it is challenging to conclude that one model is systematically superior to another. Nonetheless, on DTm and DDG (the two low-throughput datasets that we constructed), ProtSSN demonstrates better empirical performance than SaProt.  

      Moreover, ProtSSN is more efficient in both training and inference compared to SaProt. In terms of training cost, SaProt uses 40 million protein structures for pretraining (requiring 64 A100 GPUs for three months), whereas ProtSSN requires only about 30,000 crystal structures from the CATH database (trained on a single 3090 GPU for two days). Despite SaProt’s significantly higher training cost, its pretrained version does not exhibit superior performance on low-throughput datasets such as DTm, DDG, and Clinvar. Furthermore, the high training cost limits many users from retraining or fine-tuning the model for specific needs or datasets.

      Regarding the inference cost, ProtSSN requires only one embedding computation for a wild-type protein, regardless of the number of mutants (n). In contrast, SaProt computes a separate embedding and score for each mutant. For instance, when evaluating the scoring performance on ProteinGym, ProtSSN only needs 217 inferences, while SaProt needs more than 2M inferences. This inference speed is important in practice, such as high-throughput design and screening.

      Please remove the reference to previous methods as "few shot". This typically refers to their being trained on experimental data, not their using MSAs. A "few shot" model would be ProteinNPT.

      The definition of "few-shot" we used here is following ESM1v [1]. This concept originates from providing a certain number of examples as input to GPT-3 [2]. In the context of protein deep learning models, MSA serves as the wild-type protein examples.

      Also, Reviewer 1 uses the concept in the same way. 

      “Readers should note that methods labelled as "few-shot" in comparisons do not make use of experimental labels, but rather use sequences inferred as homologous; these sequences are also often available even if the protein has never been experimentally tested.”

      In the main text, we also included this definition as well as the reference of ESM-1v in lines 457-458.

      “We extend the evaluation on ProteinGym v0 to include a comparison of our zero-shot ProtSSN with few-shot learning methods that leverage MSA information of proteins (Meier et al., 2021).”

      (1) Meier J, Rao R, Verkuil R, et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 2021.

      (2) Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.

      Furthermore, I don't think it is fair to state that your method is not comparable to these models -- one can run an MSA just as one can predict a structure. A fairer comparison would be to highlight particular assays for which getting an MSA could be challenging -- Transcription did this by showing that they outperform EVE when MSAs are shallow.

      We recognize that there are often differences in the definitions and classifications of various methodologies. Here, we follow the definitions provided by ProteinGym. As the most comprehensive and large scale open benchmark in the community, we believe this classification scheme should be widely accepted. All classifications are available on the official website of ProteinGym (https://proteingym.org/benchmarks), which categorizes methods into PLMs, Structure-based models, and Alignment-based models. For example, GEMME is classified as an alignment-based model, and MSA Transformer is considered a hybrid model combining alignment and PLM features.

      We believe that methodologies with different inputs and architectures can lead to inherent unfairness. Also, it is generally believed that models including evolutionary relationships tend to outperform end-to-end models due to the extra information and efforts involved during the training phase. Some empirical evidence and discussions are in the ablation studies of retrieval factors in Tranception [3]. Moreover, the choice of MSA search parameters can introduce uncertainty, which could have positive or negative impacts. 

      We showcase the impact of MSA depth on model performance with an additional analysis below. Author response image 1 visualizes the Spearman’s correlation between the scores of each model and the number of MSAs on 217 ProteinGym assays, where each point represents one of 217 assays. The summary correlation of each model with respect to all assays are reported in Author response table 1. These results demonstrate no clear correlation between MSA depth and model performance even for MSA-based models.

      Author response image 1.

      Scatter plots of the number of MSA sequences and spearman’s correlation.

      Author response table 1.

      Spearmar’s score of the number of MSA sequences and the model’s performance.

      (3) Notin P, Dias M, Frazer J, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. International Conference on Machine Learning, 2022.

      The authors state that DTm and DDG are conceptually appealing because they come from low-throughput assays with lower experimental noise and are also mutations that are particularly chosen to represent the most interesting regions of the protein. I agree with the conceptual appeal but I don't think these claims have been demonstrated in practice. The cited comparison with Frazer as a particularly noisy source of data I think is particularly unconvincing: ClinVar labels are not only rigorously determined from multiple sources of evidence, Frazer et al demonstrates that these labels are actually more reliable than experiment in some cases. They also state that ProteinGym data doesn't come with environmental conditions, but these can be retrieved from the papers the assays came from. The paper would be strengthened by a demonstration of the conceptual benefit of these new datasets, say a comparison of mutations and signal for a protein that may be in one of these datasets vs ProteinGym.

      In the work by Frazer et al. [4], they mentioned that

      "However, these technologies do not easily scale to thousands of proteins, especially not to combinations of variants, and depend critically on the availability of assays that are relevant to or at least associated with human disease phenotypes." 

      It points out that the results of high-throughput experiments are usually based on the design of specific genes (such as BRCA1 and TP53.) and cannot be easily extended to thousands of other genes. At the same time, due to the complexity of the experiment, there may be problems with reproducibility or deviations from clinical relevance.

      This statement aligns with our perspective that high-throughput experiments inherently involve a significant amount of noise and error. It is important to clarify that the noise we discuss here arises from the limitations of high-throughput experiments themselves, instead of from the reliability of the data sources, such as systematic errors in experimental measurements. This latter issue is a complex problem common to all wetlab experiments and falls outside the scope of our study.

      Under this premise, low-throughput datasets like DTm and DDG can be considered to have less noise than high-throughput datasets, as they have undergone manual curation. As for your suggestion, while valuable, unfortunately, we were unable to identify datasets in DTM and DDG that align with those in ProteinGym after a careful search. Thus, we are unable to conduct this comparative experiment at this stage.

      (4) Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature, 2021.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      I would like to express my appreciation for the authors' dedication to revising the manuscript. It is evident that they have thoughtfully addressed numerous concerns I previously raised, significantly contributing to the overall improvement of the manuscript.

      Response: We appreciate the reviewers’ recognition of our efforts in revising the manuscript.

      My primary concern regarding the authors' framing of their findings within the realm of habitual and goal-directed action control persists. I will try explain my point of view and perhaps clarify my concerns. While acknowledging the historical tendency to equate procedural learning with habits, I believe a consensus has gradually emerged among scientists, recognizing a meaningful distinction between habits and skills or procedural learning. I think this distinction is crucial for a comprehensive understanding of human action control. While these constructs share similarities, they should not be used interchangeably. Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses).

      Response: We would like to clarify that, contrary to the reviewer’s assertion of a scientific consensus on this matter, the discussion surrounding the similarities and differences between habits and skills remains an ongoing and unresolved topic of interest among scientists (Balleine and Dezfouli, 2019; Du and Haith, 2023; Graybiel and Grafton, 2015; Haith and Krakauer, 2018; Hardwick et al., 2019; Kruglanski and Szumowska, 2020; Robbins and Costa, 2017). We absolutely agree with the reviewer that “Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses)”. But so do habits. Some researchers also highlight the intentional/goal-directed nature of habits (e.g., Du and Haith, 2023, “Habits are not automatic” (preprint) or Kruglanski and Szumowska, 2020, “Habitual behavior is goal-driven”: “definitions of habits that include goal independence as a foundational attribute of habits are begging the question; they effectively define away, and hence dispose of, the issue of whether habits are goal-driven (p 1258).” Therefore, there is no clear consensus concerning the concept of habit.

      While we acknowledge the meaningful distinctions between habits and skills, we also recognize a substantial body of literature supporting the overlap between these concepts (cited in our manuscript), particularly at the neural level. The literature clearly indicates that both habits and skills are mediated by subcortical circuits, with a progressive disengagement of cognitive control hubs in frontal and cingulate cortices as repetition evolves. We do not use these concepts interchangeably. Instead, we simply present evidence supporting the assertion that our trained app sequences meet several criteria for their habitual nature.

      Our choice of Balleine and Dezfouli (2018)'s criteria stemmed from the comprehensive nature of their definitions, which effectively synthesized insights from various researchers (Mazar and Wood, 2018; Verplanken et al., 1998; Wood, 2017, etc). Importantly, their list highlights the positive features of habits that were previously overlooked. However, these authors still included a controversial criterion ("habits as insensitive to changes in their relationship to their individual consequences and the value of those consequences"), even though they acknowledged the problems of using outcome devaluation methods and of relying on a null-effect. According to Kruglanski and Szumowska (2020), this criterion is highly problematic as “If, by definition, habits are goalindependent, then any behavior found to be goal-dependent could not be a habit on sheer logical grounds” (p. 1257). In their definition, “habitual behavior is sensitive to the value of the reward (i.e., the goal) it is expected to mediate and is sensitive to the expectancy of goal attainment (i.e., obtainment of the reward via the behavior, p.1265). In fact, some recent analyses of habitual behavior are not using devaluation or revaluation as a criterion (Du and Haith, 2023). This article, for example, ascertains habits using different criteria and provides supporting evidence for trained action sequences being understood as skills, with both goal-directed and habitual components.

      In the discussion of our manuscript, we explicitly acknowledge that the app sequences can be considered habitual or goal-directed in nature and that this terminology does not alter the fact that our overtrained sequences exhibit clear habitual features.

      Watson et al. (2022) aptly detailed my concerns in the following statements: "Defining habits as fluid and quickly deployed movement sequences overlaps with definitions of skills and procedural learning, which are seen by associative learning theorists as different behaviors and fields of research, distinct from habits."

      "...the risk of calling any fluid behavioral repertoire 'habit' is that clarity on what exactly is under investigation and what associative structure underpins the behavior may be lost." I strongly encourage the authors, at the very least, to consider Watson et al.'s (2022) suggestion: "Clearer terminology as to the type of habit under investigation may be required by researchers to ensure that others can assess at a glance what exactly is under investigation (e.g., devaluationinsensitive habits vs. procedural habits)", and to refine their terminology accordingly (to make this distinction clear). I believe adopting clearer terminology in these respects would enhance the positioning of this work within the relevant knowledge landscape and facilitate future investigations in the field.

      Response: We would like to highlight that we have indeed followed Watson et al (2022)’s recommendations on focusing on other features/criteria of habits at the expense of the outcome devaluation/contingency degradation paradigm, which has been more controversial in the human literature. Our manuscript clearly aligns with Watson et al. (2022) ‘s recommendations: “there are many other features of habits that are not captured by the key metrics from outcome devaluation/contingency degradation paradigms such as the speed at which actions are performed and the refined and invariant characteristics of movement sequences (Balleine and Dezfouli, 2019). Attempts are being made to develop novel behavioral tasks that tap into these positive features of habits, and this should be encouraged as should be tasks that are not designed to assess whether that behavior is sensitive to outcome devaluation, but capture the definition of habits through other measures”.

      Regarding the authors' use of Balleine and Dezfouli's (2018) criteria to frame recorded behavior as habitual, as well as to acknowledgment the study's limitations, it's important to highlight that while the authors labelled the fourth criterion (which they were not fulfilling) as "resistance to devaluation," Balleine and Dezfouli (2018) define it as "insensitive to changes in their relationship to their individual consequences and the value of those consequences." In my understanding, this definition is potentially aligned with the authors' re-evaluation test, namely, it is conceptually adequate for evaluating the fourth criterion (which is the most accepted in the field and probably the one that differentiate habits from skills). Notably, during this test, participants exhibited goaldirected behavior.

      The authors characterized this test as possibly assessing arbitration between goal-directed and habitual behavior, stating that participants in both groups "demonstrated the ability to arbitrate between prior automatic actions and new goal-directed ones." In my perspective, there is no justification for calling it a test of arbitration. Notably, the authors inferred that participants were habitual before the test based on some criteria, but then transitioned to goal-directed behavior based on a different criterion. While I agree with the authors' comment that: "Whether the initiation of the trained motor sequences in experiment 3 (arbitration) is underpinned by an action-outcome association (or not) has no bearing on whether those sequences were under stimulus-response control after training (experiment 1)." they implicitly assert a shift from habit to goal-directed behavior without providing evidence that relies on the same probed mechanism. Therefore, I think it would be more cautious to refer to this test as solely an outcome revaluation test. Again, the results of this test, if anything, provide evidence that the fourth criterion was tested but not met, suggesting participants have not become habitual (or at least undermines this option).

      Response: In our previously revised manuscript, we duly acknowledged that the conventional (perhaps nowadays considered outdated) goal devaluation criterion was not met, primarily due to constraints in designing the second part of the study. We did cite evidence from another similar study that had used devaluation app-trained action sequences to demonstrate habitual qualities (but the reviewer ignored this).

      The reviewer points out that we did use a manipulation of goal revaluation in one of the follow-up tests conducted (although this was not a conventional goal revaluation test inasmuch that it was conducted in a novel context). In this test, please note that we used 2 manipulations: monetary and physical effort. Although we did show that subjects, including OCD patients, were apparently goaldirected in the monetary reward manipulation, this was not so clear when goal re-evaluation involved the physical effort expended. In this effort manipulation, participants were less goaloriented and OCD patients preferred to perform the longer, familiar, to the shorter, novel sequence, thus exhibiting significantly greater habitual tendencies, as compared to controls. Hence, we cannot decisively conclude that the action sequence is goal-directed as the reviewer is arguing. In fact, the evidence is equivocal and may reflect both habitual and goal-directed qualities in the performance of this sequence, consistent with recent interpretations of skilled/habitual sequences (Du and Haith, 2023). Relying solely on this partially met criterion to conclude that the app-trained sequences are goal-directed, and therefore not habitual, would be an inaccurate assessment for several reasons: 1) the action sequences did satisfy all other criteria for being habitual; 2) this approach would rest on a problematic foundation for defining habits, as emphasized by Kruglanski & Szumowska (2020); and 3) it would succumb to the pitfall of subscribing to a zero-sum game perspective, as cautioned by various researchers, including the review by Watson et al. (2022) cited by the referee, thus oversimplifying the nuanced nature of human behavior.

      While we have previously complied with the reviewer’s suggestion on relabelling our follow-up test as a “revaluation test” instead of an “arbitration test”, we have now explicitly removed all mentions of the term “arbitration” (which seems to raise concerns) throughout the manuscript. As the reviewer has suggested, we now use a more refined terminology by explicitly referring to the measured behavior as "procedural habits", as he/she suggested. We have also extensively revised the discussion section of our manuscript to incorporate the reviewer’s viewpoint. We hope that these adjustments enhance the clarity and accuracy of our manuscript, addressing the concerns raised during this review process.

      In essence, this is an ontological and semantic matter, that does not alter our findings in any way. Whether the sequences are consider habitual or goal directed, does not change our findings that 1) Both groups displayed equivalent procedural learning and automaticity attainment; 2) OCD patients exhibit greater subjective habitual tendencies via self-reported questionnaires; 3) Patients who had elevated compulsivity and habitual self-reported tendencies engaged significantly more with the motor habit-training app, practiced more and reported symptom relief at the end of the study; 4) these particular patients also show an augmented inclination to attribute higher intrinsic value to familiar actions, a possible mechanism underlying compulsions.

      Reviewer #2 (Recommendations For The Authors):

      A few more small comments (with reference to the point numbers indicated in the rebuttal):

      (14) I am not entirely sure why the suggested analysis is deemed impractical (i.e., why it cannot be performed by "pretending" participants received the points they should have received according to their performance). This can further support (or undermine) the idea of effect of reward on performance rather than just performance on performance.

      Response: We have now conducted this analysis, generating scores for each trial of practices after day 20, when participants no longer gained points for their performance. This analysis assesses whether participants trial-wise behavioral changes exhibit a similar pattern following simulated relative increases or decrease in scores, as if they had been receiving points at this stage. Note that this analysis has fewer trials available, around 50% less on average.

      Before presenting our results, we wish to emphasize the importance of distinguishing between the effects of performance on performance and the effects of reward on performance. In response to a reviewer's suggestion, we assessed the former in the first revision of our manuscript. We normalized the movement time variable and evaluated how normalized behavioral changes responded to score increments and decrements. The results from the original analyses were consistent with those from the normalized data.

      Regarding the phase where participants no longer received scores, we believe this phase primarily helps us understand the impact of 'predicted' or 'learned' rewards on performance. Once participants have learned the simple association between faster performance and larger scores, they can be expected to continue exhibiting the reward sensitivity effects described in our main analysis. We consider it is not feasible to assess the effects of performance on performance during the reward removal phase, which occurs after 20 days. Therefore, the following results pertain to how the learned associations between faster movement times and scores persist in influencing behavior, even when explicit scores are no longer displayed on the screen.

      Results: The main results of the effect of reward on behavioral changes persist, supporting that relative increases or decreases in scores (real or imagined/inferred) modulate behavioral adaptations trial-by-trial in a consistent manner across both cohorts. The direction of the effects of reward is the same as in the main analyses presented in the manuscript: larger mean behavioral changes (smaller std) following ∆R- . First, concerning changes in “normalized” movement time (MT) trial-by-trial, we conducted a 2 x 2 factorial analysis of the centroid of the Gaussian distributions with the same factors Reward, Group and Bin. This analysis demonstrated a significant main effect of Reward (P = 2e-16), but not of Group (P = 0.974) or Bin (P = 0.281). There were no significant interactions between factors. The main Reward effect can be observed in the top panel of the figure below. The same analysis applied to the spread (std) of the Gaussian distributions revealed a significant main effect of Reward (P = 0.000213), with no additional main effects or interactions.

      Author response image 1.

      Next, conducting the same 2 x 2 factorial analyses on the centroid and spread of the Gaussian distributions fitted to the Consistency data, we also obtained a robust significant main effect of Reward. For the centroid variable, we obtained a significant main effect of Reward (P = 0.0109) and Group (P = 0.0294), while Bin and the factor interactions were non-significant. See the top panel of the figure below.

      On the other hand, Reward also modulated significantly the spread of the Gaussian distributions fitted to the Consistency data, P = 0.00498. There were no additional significant main effects or interactions. See the bottom panel in the figure below.

      Note that here the factorial analysis was performed on the logarithmic transformation of the std.

      Author response image 2.

      (16) I find this result interesting and I think it might be worthwhile to include it in the paper.

      Response: We have now included this result in our revised manuscript (page 28)

      (18) I referred to this sentence: "The app preferred sequence was their preferred putative habitual sequence while the 'any 6' or 'any 3'-move sequences were the goal-seeking sequences." In my understanding, this implies one choice is habitual and another indicates goal-directedness.

      One last small comment:
In the Discussion it is stated: "Moreover, when faced with a choice between the familiar and a new, less effort-demanding sequence, the OCD group leaned toward the former, likely due to its inherent value. These insights align with the theory of goal-direction/habit imbalance in OCD (Gillan et al., 2016), underscoring the dominance of habits in particular settings where they might hold intrinsic value."

      This could equally be interpreted as goal-directed behavior, so I do not think there is conclusive support for this claim.

      Response: The choice of the familiar/trained sequence, as opposed to the 'any 6' or 'any 3'-move sequences cannot be explicitly considered goal-directed: firstly, because the app familiar sequences were associated with less monetary reward (in the any-6 condition), and secondly, because participants would clearly need more effort and time to perform them. Even though these were automatic, it would still be much easier and faster to simply tap one finger sequentially 6 times (any6) or 3 times (any-3). Therefore, the choice for the app-sequence would not be optimal/goaldirected. In this sense, that choice aligns with the current theory of goal-direction/habit imbalance of OCD. We found that OCD patients prefer to perform the trained app sequences in the physical effort manipulation (any-3 condition). While this, on one hand cannot be explicitly considered a goal-directed choice, we agree that there is another possible goal involved here, which links to the intrinsic value associated to the familiar sequence. In this sense the action could potentially be considered goal-directed. This highlights the difficulty of this concept of value and agrees with: 1) Hommel and Wiers (2017): “Human behavior is commonly not driven by one but by many overlapping motives . . . and actions are commonly embedded into larger-scale activities with multiple goals defined at different levels. As a consequence, even successful satiation of one goal or motive is unlikely to also eliminate all the others(p. 942) and 2) Kruglanski & Szumowska (2020)’s account that “habits that may be unwanted from the perspective of an outsider and hence “irrational” or purposeless, may be highly wanted from the perspective of the individual for whom a habit is functional in achieving some goal” (p. 1262) and therefore habits are goal-driven.

      References:

      Balleine BW, Dezfouli A. 2019. Hierarchical Action Control: Adaptive Collaboration Between Actions and Habits. Front Psychol 10:2735. doi:10.3389/fpsyg.2019.02735

      Du Y, Haith A. 2023. Habits are not automatic. doi:10.31234/osf.io/gncsf Graybiel AM, Grafton ST. 2015. The Striatum: Where Skills and Habits Meet. Cold Spring Harb Perspect Biol 7:a021691. doi:10.1101/cshperspect.a021691

      Haith AM, Krakauer JW. 2018. The multiple effects of practice: skill, habit and reduced cognitive load. Current Opinion in Behavioral Sciences 20:196–201. doi:10.1016/j.cobeha.2018.01.015

      Hardwick RM, Forrence AD, Krakauer JW, Haith AM. 2019. Time-dependent competition between goal-directed and habitual response preparation. Nat Hum Behav 1–11. doi:10.1038/s41562019-0725-0

      Hommel B, Wiers RW. 2017. Towards a Unitary Approach to Human Action Control. Trends Cogn Sci 21:940–949. doi:10.1016/j.tics.2017.09.009

      Kruglanski AW, Szumowska E. 2020. Habitual Behavior Is Goal-Driven. Perspect Psychol Sci 15:1256– 1271. doi:10.1177/1745691620917676

      Mazar A, Wood W. 2018. Defining Habit in Psychology In: Verplanken B, editor. The Psychology of Habit: Theory, Mechanisms, Change, and Contexts. Cham: Springer International Publishing. pp. 13–29. doi:10.1007/978-3-319-97529-0_2

      Robbins TW, Costa RM. 2017. Habits. Current Biology 27:R1200–R1206. doi:10.1016/j.cub.2017.09.060

      Verplanken B, Aarts H, van Knippenberg A, Moonen A. 1998. Habit versus planned behaviour: a field experiment. Br J Soc Psychol 37 ( Pt 1):111–128. doi:10.1111/j.2044-8309.1998.tb01160.x

      Watson P, O’Callaghan C, Perkes I, Bradfield L, Turner K. 2022. Making habits measurable beyond what they are not: A focus on associative dual-process models. Neurosci Biobehav Rev 142:104869. doi:10.1016/j.neubiorev.2022.104869

      Wood W. 2017. Habit in Personality and Social Psychology. Pers Soc Psychol Rev 21:389–403. doi:10.1177/1088868317720362

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We appreciate the reviewers for their insightful feedback, which has substantially improved our manuscript. Following the suggestions of the reviewers, we have undertaken the following major revisions:

      a. Concerning data transformation, we have adjusted the methodology in Figures 2 and 3. Instead of normalizing c-Fos density to the whole brain c-Fos density as initially described, we now normalize to the c-Fos density of the corresponding brain region in the control group. b. We have substituted the PCA approach with hierarchical clustering in Figures 2 and 3.

      c. In the discussion section, we added a subsection on study limitations, focusing on the variations in drug administration routes and anesthesia depth.

      Enclosed are our detailed responses to each of the reviewer's comments.

      Reviewer #1:

      1a. The addition of the EEG/EMG is useful, however, this information is not discussed. For instance, there are differences in EEG/EMG between the two groups (only Ket significantly increased delta/theta power, and only ISO decreased EMG power). These results should be discussed as well as the limitation of not having physiological measures of anesthesia to control for the anesthesia depth.

      1b. The possibility that the differences in fos observed may be due to the doses used should be discussed.

      1c. The possibility that the differences in fos observed may be due kinetic of anesthetic used should be discussed.

      Thank you for your suggestions. We have now discussed EEG/EMG result, limitation of not having physiological measures of anesthesia to control for the anesthesia depth, The possibility that the differences in fos observed may be due to the doses, The possibility that the differences in Fos observed may be due kinetic of anesthetic in the revised manuscript (Lines 308-331, also shown below).

      Lines 308-331: "...Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Supplementary Figure 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression. Although the difference in EEG power between the ISO group and the home cage control was not significant, the increase in EEG power observed in the ISO group was similar to that of KET (0.47 ± 0.07 vs 0.59 ± 0.10), suggesting that both agents may induce loss of consciousness in mice. Regarding EMG power, ISO showed a significant decrease in EMG power compared to its control group. In contrast, the KET group showed a lesser reduction in EMG power (ISO: -1.815± 0.10; KET: -0.96 ± 0.21), which may partly explain the higher overall c-Fos expression levels in the KET group. This is consistent with previous studies where ketamine doses up to 150 mg/kg increase delta power while eliciting a wakefulness-like pattern of c-Fos expression across the brain [1]. Furthermore, the observed differences in c-Fos expression may arise in part from the dosages, routes of administration, and their distinct pharmacokinetic profiles. This variation is compounded by the lack of detailed physiological monitoring, such as blood pressure, heart rate, and respiration, affecting our ability to precisely assess anesthesia depth. Future studies incorporating comprehensive physiological monitoring and controlled dosing regimens are essential to further elucidate these relationships and refine our understanding of the effects of anesthetics on brain activity"

      1. Lu J, Nelson LE, Franks N, Maze M, Chamberlin NL, Saper CB: Role of endogenous sleep-wake and analgesic systems in anesthesia. J Comp Neurol 2008, 508(4):648-662.

      2b. I am confused because Fig 2C seems to show significant decrease in %fos in the hypothalamus, midbrain and cerebellum after KET, while the author responded that " in our analysis, we did not detect regions with significant downregulation when comparing anesthetized mice with controls." Moreover the new figure in the rebuttal in response to reviewer 2 suggests that Ket increases Fos in almost every single region (green vs blue) which is not the conclusion of the paper.

      Your concern regarding the apparent discrepancy is well-founded. The inconsistency arose due to an inappropriate data transformation, which affected the interpretation. We have now rectified this by adjusting the data transformation in Figures 2 and 3. Specifically, we have recalculated the log relative c-Fos density values relative to the control group for each brain region. This revision has resolved the issue, confirming that our analysis did not detect any regions with significant downregulation in the anesthetized mice compared to controls. We have also updated the results, discussion, and methods sections of Figures 2 and 3 to accurately reflect these changes and ensure consistency with our findings.

      Author response image 1.

      Figure 2. Whole-brain distributions of c-Fos+ cells induced by ISO and KET. (A) Hierarchical clustering was performed on the log relative c-Fos density data for ISO and KET using the complete linkage method based on the Euclidean distance matrix, with clusters identified by a dendrogram cut-off ratio of 0.5. Numerical labels correspond to distinct clusters within the dendrogram. (B) Silhouette values plotted against the ratio of tree height for ISO and KET, indicating relatively higher Silhouette values at 0.5 (dashed line), which is associated with optimal clustering. (C) The number of clusters identified in each treatment condition at different ratios of the dendrogram tree height, with a cut-off level of 0.5 corresponding to 4 clusters for both ISO and KET (indicated by the dashed line). (D) The bar graph depicts Z scores for clusters in ISO and KET conditions, represented with mean values and standard errors. One-way ANOVA with Tukey's post hoc multiple comparisons. ns: no significance; ***P < 0.001. (E) Z-scored log relative density of c-Fos expression in the clustered brain regions. The order and abbreviations of the brain regions and the numerical labels correspond to those in Figure 2A. The red box denotes the cluster with the highest mean Z score in comparison to other clusters. CTX: cortex; TH: thalamus; HY: hypothalamus; MB: midbrain; HB: hindbrain.

      Author response image 2.

      Figure 3. Similarities and differences in ISO and KET activated c-Fos brain areas. (A) Hierarchical clustering was performed on the log-transformed relative c-Fos density data for ISO and KET using the complete linkage method based on the Euclidean distance matrix, with clusters identified by a dendrogram cut-off ratio of 0.5. (B) Silhouette values are plotted against the ratio of tree height from the hierarchical clustered dendrogram in Figure 3A. (C) The relationship between the number of clusters and the tree height ratio of the dendrogram for ISO and KET, with a cut-off ratio of 0.5 resulting in 3 clusters for ISO and 5 for KET (indicated by the dashed line). (D) The bar graph depicts Z scores for clusters in ISO and KET conditions, represented with mean values and standard errors. One-way ANOVA with Tukey's post hoc multiple comparisons. ns: no significance; ***P < 0.001. (E) Z-scored log relative density of c-Fos expression within the identified brain region clusters. The arrangement, abbreviations of the brain regions, and the numerical labels are in accordance with Figure 3A. The red boxes highlight brain regions that rank within the top 10 percent of Z score values. The white boxes denote brain regions with an Z score less than -2.

      1. There are still critical misinterpretations of the PCA analysis. For instance, it is mentioned that " KET is associated with the activation of cortical regions (as evidenced by positive PC1 coefficients in MOB, AON, MO, ACA, and ORB) and the inhibition of subcortical areas (indicated by negative coefficients) " as well as " KET displays cortical activation and subcortical inhibition, whereas ISO shows a contrasting preference, activating the cerebral nucleus (CNU) and the hypothalamus while inhibiting cortical areas. To reduce inter-individual variability." These interpretations are in complete contradiction with the answer 2b above that there was no region that had decreased Fos by either anesthetic.

      Thank you for bringing this to our attention. In response to your concerns, we have made significant revisions to our data analysis. We have updated our input data to incorporate log-transformed relative c-Fos density values, normalized against the control group for each brain region, as illustrated in Figures 2 and 3. Instead of PCA, we have applied this updated data to hierarchical clustering analysis. The results of these analyses are consistent with our original observation that neither anesthetic led to a decrease in Fos expression in any region.

      1. I still do not understand the rationale for the use of that metric. The use of a % of total Fos makes the data for each region dependent on the data of the other regions which wrongly leads to the conclusion that some regions are inhibited while they are not when looking at the raw data. Moreover, the interdependence of the variable (relative density) may affect the covariance structure which the PCA relies upon. Why not using the PCA on the logarithm of the raw data or on a relative density compared to the control group on a region-per-region basis instead of the whole brain?

      Thank you for your insightful suggestion. Following your advice, we have revised our approach and now utilize the logarithm of the relative density compared to the control group on a region-by-region basis. We attempted PCA analyses using the logarithm of the raw data, the logarithm of the Z-score, and the logarithm of the relative density compared to control, but none yielded distinct clusters.

      Author response image 3.

      As a result, we employed hierarchical cluster analysis. We then examined the Z-scores of the log-transformed relative c-Fos densities (Figures 2E and 3E) to assess expression levels across clusters. Our analysis revealed that neither ISO nor KET treatments led to a significant suppression of c-Fos expression in the 53 brain regions examined. In the ISO group alone, there were 10 regions that demonstrated relative suppression (Z-score < -2, indicated by white boxes) as shown in Figure 3.

      Fig. 2B: it's unclear to me why the regions are connected by a line. Such representation is normally used for time series/within-subject series. What is the rationale for the order of the regions and the use of the line? The line connecting randomly organized regions is meaningless and confusing.

      Thank you for your suggestion. We have discontinued the use of PCA calculations and have removed this figure.

      Fig 6A. The correlation matrices are difficult to interpret because of the low resolution and arbitrary order of brain regions. I recommend using hierarchical clustering and/or a combination of hierarchical clustering and anatomical organization (e.g. PMID: 31937658). While it is difficult to add the name of the regions on the graph I recommend providing supplementary figures with large high-resolution figures with the name of each brain region so the reader can actually identify the correlation between specific brain regions and the whole brain, Rationale for Metric Choice: Note that I do not dispute the choice of the log which is appropriate, it is the choice of using the relative density that I am questioning.

      Thank you for your constructive feedback. In line with your suggestion, we have implemented hierarchical clustering combined with anatomical organization as per the referenced literature. Additionally, we have updated the vector diagrams in Figure 6A to present them with greater clarity.

      Furthermore, we have revised our network modular division method based on cited literature recommendations. We used hierarchical clustering with correlation coefficients to segment the network into modules, illustrated in Figure 6—figure supplement 1. Due to the singular module structure of the KET network and the sparsity of intermodular connections in the home cage and saline networks, the assessment of network hub nodes did not employ within-module degree Z-score and participation coefficients, as these measures predominantly underscore the importance of connections within and between modules. Instead, we used degree, betweenness centrality, and eigenvector centrality to detect the hub nodes, as detailed in Figure 6—figure supplement 2. With this new approach, the hub node for the KET condition changed from SS to TeA. Corresponding updates have been made to the results section for Figure 6, as well as to the related discussions and the abstract of our paper.

      Author response image 4.

      Figure 6. Generation of anesthetics-induced networks and identification of hub regions. (A) Heatmaps display the correlations of log c-Fos densities within brain regions (CTX, CNU, TH, HY, MB, and HB) for various states (home cage, ISO, saline, KET). Correlations are color-coded according to Pearson's coefficients. The brain regions within each anatomical category are organized by hierarchical clustering of their correlation coefficients. (B) Network diagrams illustrate significant positive correlations (P < 0.05) between regions, with Pearson’s r exceeding 0.82. Edge thickness indicates correlation magnitude, and node size reflects the number of connections (degree). Node color denotes betweenness centrality, with a spectrum ranging from dark blue (lowest) to dark red (highest). The networks are organized into modules consistent with the clustering depicted in Supplementary Figure 8. Figure 6—figure supplement 1

      Author response image 5.

      Figure 6—figure supplement 1. Hierarchical clustering of brain regions under various conditions: home cage, ISO, saline, and KET. (A) Heatmaps show the relative distances among brain regions assessed in naive mice. Modules were identified by sectioning each dendrogram at a 0.7 threshold. (B) Silhouette scores plotted against the dendrogram tree height ratio for each condition, with optimal cluster definition indicated by a dashed line at a 0.7 ratio. (C) The number of clusters formed at different cutoff levels. At a ratio of 0.7, ISO and saline treatments result in three clusters, whereas home cage and KET conditions yield two clusters. (D) The mean Pearson's correlation coefficient (r) was computed from interregional correlations displayed in Figure 6A. Data were analyzed using one-way ANOVA with Tukey’s post hoc test, ***P < 0.001.

      Author response image 6.

      Figure 6—figure supplement 2. Hub region characterization across different conditions: home cage (A), ISO (B), saline (C), and KET (D) treatments. Brain regions are sorted by degree, betweenness centrality, and eigenvector centrality, with each metric presented in separate bar graphs. Bars to the left of the dashed line indicate the top 20% of regions by rank, highlighting the most central nodes within the network. Red bars signify regions that consistently appear within the top rankings for both degree and betweenness centrality across the metrics.

      1. I am still having difficulties understanding Fig. 3.

      Panel A: The lack of identification for the dots in panel A makes it impossible to understand which regions are relevant.

      Panel B: what is the metric that the up/down arrow summarizes? Fos density? Relative density? PC1/2?

      Panel C: it's unclear to me why the regions are connected by a line. Such representation is normally used for time series/within-subject series. What is the rationale for the order of the regions?

      Thank you for your patience and for reiterating your concerns regarding Figure 3.

      a. In Panel A, we have substituted the original content with a display of hierarchical clustering results, which now clearly marks each brain region. This change aids readers in identifying regions with similar expression patterns and facilitates a more intuitive understanding of the data.

      a. Acknowledging that our analysis did not reveal any significantly inhibited brain regions, we have decided to remove the previous version of Panel B from the figure.

      b. We have discontinued the use of PCA calculations and have removed this figure to avoid any confusion it may have caused. Our revised analysis focuses on hierarchical clustering, which are presented in the updated figures.

      Reviewer #2:

      1. Aside from issues with their data transformation (see below), (a) I think they have some interesting Fos counts data in Figures 4B and 5B that indicate shared and distinct activation patterns after KET vs. ISO based anesthesia. These data are far closer to the raw data than PC analyses and need to be described and analyzed in the first figures long before figures with the more abstracted PC analyses. In other words, you need to show the concrete raw data before describing the highly transformed and abstracted PC analyses. (b) This gets to the main point that when selecting brain areas for follow up analyses, these should be chosen based on the concrete Fos counts data, not the highly transformed and abstracted PC analyses.

      Thank you for your suggestions.

      a. We have added the original c-Fos cell density distribution maps for Figures 2, 3, 4, and 5 in Supplementary Figures 2 and 3 (also shown below). To maintain consistency across the document, we have updated both the y-axis label and the corresponding data in Figures 4B and 5B from 'c-Fos cell count' to 'c-Fos density'.

      b. The analyses in Figures 2 and 3 include all brain regions. Figures 4 and 5 present the brain regions with significant differences as shown in Figure 3—figure supplement 1.

      Author response image 7.

      Figure 2—figure supplement 1. The c-Fos density in 53 brain areas for different conditions. (home cage, n = 6; ISO, n = 6 mice; saline, n = 8; KET, n = 6). Each point represents the c-Fos density in a specific brain region, denoted on the y-axis with both abbreviations and full names. Data are shown as mean ± SEM. Brain regions are categorized into 12 brain structures, as indicated on the right side of the graph.

      Author response image 8.

      Figure 3—figure supplement 1. c-Fos density visualization across 201 distinct brain regions under various conditions. The graph depicts the c-Fos density levels for each condition, with data presented as mean and standard error. Brain regions with statistically significant differences are featured in Figures 4 and 5. Brain regions are organized into major anatomical subdivisions, as indicated on the left side of the graph.

      1. Now, the choice of data transformation for Fos counts is the most significant problem. First, the authors show in the response letter that not using this transformation (region density/brain density) leads to no clustering. However, they also showed the region-densities without transformation (which we appreciate) and it looks like overall Fos levels in the control group Home (ISO) are a magnitude (~10-fold) higher than those in the control group Saline (KET) across all regions shown. This large difference seems unlikely to be due to a biologically driven effect and seems more likely to be due to a technical issue, such as differences in staining or imaging between experiments. Was the Homecage-ISO experiment or at least the Fos labeling and imaging performed at the same time as for the Saline-Ketamine experiment? Please state the answer to this question in the Results section one way or the other.

      a. “Home (ISO) are a magnitude (~10-fold) higher than those in the control group saline (KET) across all regions shown.” We believe you might be indicating that compared to the home cage group (gray), the saline group (blue) shows a 10-fold higher expression (Supplementary Figure 2/3). Indeed, we observed that the total number of c-Fos cells in the home cage group is significantly lower than in the saline group. This difference may be due to reduced sleep during the light-on period (ZT 6- ZT 7.5) in the saline mice or the pain and stress response caused by intraperitoneal injection of saline. We have explained this discrepancy in the discussion section.Line 308-317(also see below)

      “…Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 1—figure supplement 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression…”

      b. Drug administration and tissue collection for both Homecage-ISO and Saline-Ketamine groups were consistently scheduled at 13:00 and 14:30, respectively. Four mice were administered drugs and had tissues collected each day, with two from the experimental group and two from the control group, to ensure consistent sampling. The 4% PFA fixation time, sucrose dehydration time, primary and secondary antibody concentrations and incubation times, staining, and imaging parameters and equipment (exposure time for VS120 imaging was fixed at 100ms) were all conducted according to a unified protocol.

      We have included the following statement in the results section: Line 81-83, “Sample collection for all mice was uniformly conducted at 14:30 (ZT7.5), and the c-Fos labeling and imaging were performed using consistent parameters throughout all experiments. ”

      1. Second, they need to deal with this large difference in overall staining or imaging for these two (Home/ISO and Saline/KET) experiments more directly; their current normalization choice does not really account for the large overall differences in mean values and variability in Fos counts (e.g. due to labeling and imaging differences).

      3a. I think one option (not perfect but I think better than the current normalization choice) could be z-scoring each treatment to its respective control. They can analyze these z-scored data first, and then in later figures show PC analyses of these data and assess whether the two treatments separate on PC1/2. And if they don't separate, then they don't separate, and you have to go with these results.

      3b. Alternatively, they need to figure out the overall intensity distributions from the different runs (if that the main reason of markedly different counts) and adjust their thresholds for Fos-positive cell detection based on this. I would expect that the saline and HC groups should have similar levels of activation, so they could use these as the 'control' group to determine a Fos-positive intensity threshold that gets applied to the corresponding 'treatment' group.

      3c. If neither 3a nor 3b is an option then they need to show the outcomes of their analysis when using the untransformed data in the main figures (the untransformed data plots in their responses to reviewer are currently not in the main or supplementary figs) and discuss these as well.

      a. Thank you very much for your valuable suggestion. We conducted PCA analysis on the ISO and KET data after Z-scoring them with their respective control groups and did not find any significant separation.

      Author response image 9.

      As mentioned in our response to reviewer #1, we have reprocessed the raw data. Firstly, we divided the ISO and KET data by their respective control brain regions and then performed a logarithmic transformation to obtain the log relative c-Fos density. The purpose of this is to eliminate the impact of baseline differences and reduce variability. We then performed hierarchical clustering, and finally, we Z-scored the log relative c-Fos density data. The aim is to facilitate comparison of ISO and KET on the same data dimension (Figure 2 and 3).

      b. We appreciate your concerns regarding the detection thresholds for Fos-positive cells. The enclosed images, extracted from supplementary figures for Figures 4 and 5, demonstrate notable differences in c-Fos expression between saline and home cage groups in specific brain regions. These regions exhibit a discernible difference in staining intensity, with the saline group showing enhanced c-Fos expression in the PVH and PVT regions compared to the home cage group. An examination of supplementary figures for Figures 4 and 5 shows that c-Fos expression in the home cage group is consistently lower than in the saline group. This comparative analysis confirms that the discrepancies in c-Fos levels are not due to varying detection thresholds.

      Author response image 10.

      b. We have added the corresponding original data graphs to Supplementary Figures 2 and 3, and discussed the potential reasons for the significant differences between these groups in the discussion section (also shown below).

      Lines 308-317: "...Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 3—figure supplement 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression.…”

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Thank you and the reviewers for further providing constructive comments and suggestions on our manuscript. On behalf of all the co-authors, I have enclosed a revised version of the above referenced paper. Below, I have merged similar public reviews and recommendations (if applicable) from each reviewer and provided point-by-point responses.

      Reviewer #1:

      People can perform a wide variety of different tasks, and a long-standing question in cognitive neuroscience is how the properties of different tasks are represented in the brain. The authors develop an interesting task that mixes two different sources of difficulty, and find that the brain appears to represent this mixture on a continuum, in the prefrontal areas involved in resolving task difficulty. While these results are interesting and in several ways compelling, they overlap with previous findings and rely on novel statistical analyses that may require further validation.

      Strengths

      1. The authors present an interesting and novel task for combining the contributions of stimulus-stimulus and stimulus-response conflict. While this mixture has been measured in the multi-source interference task (MSIT), this task provides a more graded mixture between these two sources of difficulty.

      2. The authors do a good job triangulating regions that encoding conflict similarity, looking for the conjunction across several different measures of conflict encoding. These conflict measures use several best-practice approaches towards estimating representational similarity.

      3. The authors quantify several salient alternative hypothesis and systematically distinguish their core results from these alternatives.

      4. The question that the authors tackle is important to cognitive control, and they make a solid contribution.

      The authors have addressed several of my concerns. I appreciate the authors implementing best practices in their neuroimaging stats.

      I think that the concerns that remain in my public review reflect the inherent limitations of the current work. The authors have done a good job working with the dataset they've collected.

      Response: We would like to thank the reviewer for the positive evaluation of our manuscript and the constructive comments and suggestions. In response to your suggestions and concerns, we have removed the Stroop/Simon-only and the Stroop+Simon models, revised our conclusion and modified the misleading phrases.

      We have provided detailed responses to your comments below.

      1. The evidence from this previous work for mixtures between different conflict sources makes the framing of 'infinite possible types of conflict' feel like a strawman. The authors cite classic work (e.g., Kornblum et al., 1990) that develops a typology for conflict which is far from infinite. I think few people would argue that every possible source and level of difficulty will have to be learned separately. This work provides confirmatory evidence that task difficulty is represented parametrically (e.g., consistent with the n-back, MOT, and random dot motion literature).

      notes for my public concerns.

      In their response, the authors say:

      'If each combination of the Stroop-Simon combination is regarded as a conflict condition, there would be infinite combinations, and it is our major goal to investigate how these infinite conflict conditions are represented effectively in a space with finite dimensions.'

      I do think that this is a strawman. The paper doesn't make a strong case that this position ('infinite combinations') is widely held in the field. There is previous work (e.g., n-back, multiple object tracking, MSIT, dot motion) that has already shown parametric encoding of task difficulty. This paper provides confirmatory evidence, using an interesting new task, that demand are parametric, but does not provide a major theoretical advance.

      Response: We agree that the previous expression may have seemed somewhat exaggerative. While it is not “infinite”, recent research indeed suggests that the cognitive control shows domain-specificity across various “domains”, including conflict types (Egner, 2008), sensory modalities (Yang et al., 2017), task-irrelevant stimuli (Spape et al., 2008), and task sets (Hazeltine et al., 2011), to name a few.

      These findings collectively support the notion that cognitive control is contextspecific (Bream et al., 2014). That is, cognitive control can be tuned and associated with different (and potentially large numbers of) contexts. Recently, Kikumoto and Mayr (2020) demonstrated that combinations of stimulus, rule and response in the same task formed separatable, conjunctive representations. They further showed that these conjunctive representations facilitate performance. This is in line with the idea that each stimulus-location combination in the present task may be represented separately in a domain-specific manner. Moreover, domain-general task representation can also become domain-specific with learning, which further increases the number of domain-specific conjunctive representations (Mill et al., 2023). In line with the domain-specific account of cognitive control, we referred to the “infinite combinations” in our previous response to emphasize the extreme case of domainspecificity. However, recognizing that the term “infinite” may lead to ambiguity, we have replaced it with phrases such as “a large number of”, “hugely varied”, in our revised manuscript.

      We appreciate the reviewer for highlighting the potential connection of our work to existing literature that showed the parametric encoding of task difficulty (e.g., Dagher et al., 1999; Ritz & Shenhav, 2023). For instance, in Ritz et al.’s (2023) study, they parametrically manipulated target difficulty based on consistent ratios of dot color, and found that the difficulty was encoded in the caudal part of dorsal anterior cingulate cortex. Analogically, in our study, the “difficulty” pertains to the behavioral congruency effect that we modulated within the spatial Stroop and Simon dimensions. Notably, we did identify univariate effects in the right dmPFC and IPS associated with the difficulty in the Simon dimension. This parametric effect may lend support to our cognitive space hypothesis, although we exercised caution in interpreting their significance due to the absence of a clear brain-behavioral relevance in these regions. We have added the connection of our work to prior literature in the discussion. The parametric encoding of conflict also mirrors prior research showing the parametric encoding of task demands (Dagher et al., 1999; Ritz & Shenhav, 2023).

      However, our analyses extend beyond solely testing the parametric encoding of difficulty. Instead, we focused on the multivariate representation of different conflict types, which we believe is independent from the univariate parametric encoding. Unlike the univariate encoding that relies on the strength within one dimension, the multivariate representation of conflict types incorporates both the spatial Stroop and Simon dimensions. Furthermore, we found that similar difficulty levels did not yield similar conflict representation, as indicated by the low similarity between the spatial Stroop and Simon conditions, despite both showing a similar level of congruency effect (Fig. S1). Additionally, we also observed an interaction between conflict similarity and difficulty (i.e., congruency, Fig. 4B/D), such that the conflict similarity effect was more pronounced when conflict was present. Therefore, we believe that our findings make contribution to the literature beyond the difficulty effect.

      Reference:

      Egner, T. (2008). Multiple conflict-driven control mechanisms in the human brain. Trends in Cognitive Sciences, 12(10), 374-380. https://doi.org/10.1016/j.tics.2008.07.001

      Yang, G., Nan, W., Zheng, Y., Wu, H., Li, Q., & Liu, X. (2017). Distinct cognitive control mechanisms as revealed by modality-specific conflict adaptation effects. Journal of Experimental Psychology: Human Perception and Performance, 43(4), 807-818. https://doi.org/10.1037/xhp0000351

      Spapé MM, Hommel B (2008). He said, she said: episodic retrieval induces conflict adaptation in an auditory Stroop task. Psychonomic Bulletin Review,15(6):1117-21. https://doi.org/10.3758/PBR.15.6.1117

      Hazeltine E, Lightman E, Schwarb H, Schumacher EH (2011). The boundaries of sequential modulations: evidence for set-level control. Journal of Experimental Psychology: Human Perception & Performance. 2011 Dec;37(6):1898-914. https://doi.org/10.1037/a0024662

      Braem, S., Abrahamse, E. L., Duthoo, W., & Notebaert, W. (2014). What determines the specificity of conflict adaptation? A review, critical analysis, and proposed synthesis. Frontiers in Psychology, 5, 1134. https://doi.org/10.3389/fpsyg.2014.01134

      Kikumoto A, Mayr U. (2020). Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proceedings of the National Academy of Sciences, 117(19):10603-10608. https://doi.org/10.1073/pnas.1922166117.

      Mill, R. D., & Cole, M. W. (2023). Neural representation dynamics reveal computational principles of cognitive task learning. bioRxiv. https://doi.org/10.1101/2023.06.27.546751

      Dagher, A., Owen, A. M., Boecker, H., & Brooks, D. J. (1999). Mapping the network for planning: a correlational PET activation study with the Tower of London task. Brain, 122 ( Pt 10), 1973-1987. https://doi.org/10.1093/brain/122.10.1973

      Ritz, H., & Shenhav, A. (2023). Orthogonal neural encoding of targets and distractors supports multivariate cognitive control. https://doi.org/10.1101/2022.12.01.518771

      1. (Public Reviews) The degree of Stroop vs Simon conflict is perfectly negatively correlated across conditions. This limits their interpretation of an integrated cognitive space, as they cannot separately measure Stroop and Simon effects. The author's control analyses have limited ability to overcome this task limitation. While these results are consistent with parametric encoding, they cannot adjudicate between combined vs separated representations.

      (Recommendations) I think that it is still an issue that the task's two features (stroop and simon conflict) are perfectly correlated. This fundamentally limits their ability to measure the similarity in these features. The authors provide several control analyses, but I think these are limited.

      Response: We need to acknowledge that the spatial Stroop and Simon components in the five conflict conditions were not “perfectly” correlated, with r = –0.89. This leaves some room for the preliminary model comparison to adjudicate between these models. However, it’s essential to note that conclusions based on these results must be tempered. In line with the reviewer’s observation, we agree that the high correlation between the two conflict sources posed a potential limitation on our ability to independently investigate the contribution of spatial Stroop and Simon conflicts. Therefore, in addition to the limitation we have previously acknowledged, we have now further revised our conclusion and adjusted our expressions accordingly.

      Specifically, we now regard the parametric encoding of cognitive control not as direct evidence of the cognitive space view but as preliminary evidence that led us to propose this hypothesis, which requires further testing. Notably, we have also modified the title from “Conflicts are represented in a cognitive space to reconcile domain-general and domain-specific cognitive control” to “Conflicts are parametrically encoded: initial evidence for a cognitive space view to reconcile the debate of domain-general and domain-specific cognitive control”. Also, we revised the conclusion as: In sum, we showed that the cognitive control can be parametrically encoded in the right dlPFC and guides cognitive control to adjust goal-directed behavior. This finding suggests that different cognitive control states may be encoded in an abstract cognitive space, which reconciles the long-standing debate between the domain-general and domain-specific views of cognitive control and provides a parsimonious and more broadly applicable framework for understanding how our brains efficiently and flexibly represents multiple task settings.

      From Recommendations The authors perform control analyses that test stroop-only and simon-only models. However, these analyses use a totally different similarity metric, that's based on set intersection rather than geometry. This metric had limited justification or explanation, and it's not clear whether these models fit worse because of the similarity metric. Even here, Simon-only model fit better than Stroop+Simon model. The dimensionality analyses may reflect the 1d manipulation by the authors (i.e. perfectly corrected stroop and simon effects).

      Response: The Jaccard measure is the most suitable method we can conceive of for assessing the similarity between two conflicts when establishing the Stroop-only and Simon-only models, achieved by projecting them onto the vertical or horizontal axes, respectively (Author response image 1A). This approach offers two advantages. First, the Jaccard similarity combines both similarity (as reflected by the numerator) and distance (reflected by the difference between denominator and numerator) without bias towards either. Second, the Jaccard similarity in our design is equivalent to the cosine similarity because the denominator in the cosine similarity is identical to the denominator in the Jaccard similarity (both are the radius of the circle, Author response image 1B).

      Author response image 1.

      Definition of Jaccard similarity. A) Two conflicts (1 and 2) are projected onto the spatial Stroop/Simon axis in the Stroop/Simon-only model, respectively. The Jaccard similarity for Stroop-only and Simon-only model are and respectively. Letters a-d are the projected vectors from the two conflicts to the two axes. Blue and red colors indicate the conflict conditions. Shorter vectors are the intersection and longer vectors are the union. B) According to the cosine similarity model, the similarity is defined as , where e is the projected vector from conflict 1 to conflict 2, and g is the vector of conflict 1. The Jaccard similarity for this case is defined by , where f is the projector vector from conflict 2 to itself. Because f = g in our design, the Jaccard similarity is equivalent to the cosine similarity.

      Therefore, we believe that the model comparisons between cosine similarity model and the Stroop/Simon-Only models were equitable. However, we acknowledge the reviewer’s and other reviewers’ concerns about the correlation between spatial Stroop and Simon conflicts, which reduces the space to one dimension (1d) and limits our ability to distinguish between the Stroop-only and Simon-only models, as well as between Stroop+Simon and cosine similarity models. While these distinctions are undoubtedly important for understanding the geometry of the cognitive space, we recognize that they go beyond the major objective of this study, that is, to differentiate the cosine similarity model from domain-general/specific models. Therefore, we have chosen to exclude the Stroop-only, Simon-only and Stroop+Simon models in our revised manuscript.

      Something that raised additional concerns are the RSMs in the key region of interest (Fig S5). The pure stroop task appears to be represented very differently from all of the conditions that include simon conflict.

      Together, I think these limitations reflect the structure of the task and research goals, not the statistical approach (which has been meaningfully improved).

      Response: We appreciate the reviewer for pointing this out. It is essential to clarify that our conclusions were based on the significant similarity modulation effect identified in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four within-conflict conditions (Fig. 7A, now Fig. 8A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here. Moreover, to specifically test the differences between the within-Stroop condition and the other within-conflict conditions, we conducted a mixed-effect model analysis only including trial pairs from the same conflict type. In this analysis, the primary predictor was the cross-condition difference (0 for within-Stroop condition and 1 for other within-conflict conditions). The results showed no significant cross-condition difference in either the incongruent (t = 1.22, p = .23) or the congruent (t = 1.06, p = .29) trials. Thus, we believe the evidence for different similarities is inconclusive in our data and decided not to interpret this numerical difference. We have added this note in the revised figure caption for Figure S5.

      Author response image 2.

      Fig. S5. The stronger conflict type similarity effect in incongruent versus congruent conditions. (A) Summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of Stroop and other within-conflict cells (i.e., the diagonal) did not reach significance for either incongruent (t = 1.22, p = .23) or congruent (t = 1.06, p = .29) trials. (2) Scatter plot showing the averaged neural similarity (Pearson correlation) as a function of conflict type similarity in both conditions. The values in both A and B are calculated from raw Pearson correlation values, in contrast to the z-scored values in Fig. 4D.

      Minor:

      • In the analysis of similarity_orientation, the df is very large (~14000). Here, and throughout, the df should be reflective of the population of subjects (ie be less than the sample size).

      Response: The large degrees of freedom (df) in our analysis stem from the fact that we utilized a mixed-effect linear model, incorporating all data points (a total of 400×35=14000). In mixed-effect models, the df is determined by subtracting the number of fixed effects (in our case, 7) from the total number of observations. Notably, we are in line with the literature that have reported the df in this manner (e.g., Iravani et al., 2021; Schmidt & Weissman, 2015; Natraj et al., 2022).

      Reference:

      Iravani B, Schaefer M, Wilson DA, Arshamian A, Lundström JN. The human olfactory bulb processes odor valence representation and cues motor avoidance behavior. Proc Natl Acad Sci U S A. 2021 Oct 19;118(42):e2101209118. https://doi.org/10.1073/pnas.2101209118.

      Schmidt, J.R., Weissman, D.H. Congruency sequence effects and previous response times: conflict adaptation or temporal learning?. Psychological Research 80, 590–607 (2016). https://doi.org/10.1007/s00426-015-0681-x.

      Natraj, N., Silversmith, D. B., Chang, E. F., & Ganguly, K. (2022). Compartmentalized dynamics within a common multi-area mesoscale manifold represent a repertoire of human hand movements. Neuron, 110(1), 154-174. https://doi.org/10.1016/j.neuron.2021.10.002.

      • it would improve the readability if there was more didactic justification for why analyses are done a certain way (eg justifying the jaccard metric). This will help less technically-savvy readers.

      Response: We appreciate the reviewer’s suggestion. However, considering the Stroop/Simon-only models in our design may not be a valid approach for distinguishing the contributions of the Stroop/Simon components, we have decided not to include the Jaccard metrics in our revised manuscript.

      Besides, to improve the readability, we have moved Figure S4 to the main text (labeled as Figure 7), and added the domain-general/domain-specific schematics in Figure 8.

      Author response image 3.

      Figure 8. Schematic of key RSMs. (A) and (B) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (C) and (D) show the two alternative models. Like the cosine model (A), within-group trial pairs resemble between-group trial pairs in these two models. The domain-specific model is an identity matrix. The domain-general model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0(lowest similarity)-1(highest similarity) to aid comparison. The plotted matrices here include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      Reviewer #2:

      This study examines the construct of "cognitive spaces" as they relate to neural coding schemes present in response conflict tasks. The authors use a novel experimental design in which different types of response conflict (spatial Stroop, Simon) are parametrically manipulated. These conflict types are hypothesized to be encoded jointly, within an abstract "cognitive space", in which distances between task conditions depend only on the similarity of conflict types (i.e., where conditions with similar relative proportions of spatial-Stroop versus Simon conflicts are represented with similar activity patterns). Authors contrast such a representational scheme for conflict with several other conceptually distinct schemes, including a domain-general, domain-specific, and two task-specific schemes. The authors conduct a behavioral and fMRI study to test which of these coding schemes is used by prefrontal cortex. Replicating the authors' prior work, this study demonstrates that sequential behavioral adjustments (the congruency sequence effect) are modulated as a function of the similarity between conflict types. In fMRI data, univariate analyses identified activation in left prefrontal and dorsomedial frontal cortex that was modulated by the amount of Stroop or Simon conflict present, and representational similarity analyses (RSA) that identified coding of conflict similarity, as predicted under the cognitive space model, in right lateral prefrontal cortex.

      This study tackles an important question regarding how distinct types of conflict might be encoded in the brain within a computationally efficient representational format. The ideas postulated by the authors are interesting ones and the statistical methods are generally rigorous.

      Response: We would like to express our sincere appreciation for the reviewer’s positive evaluation of our manuscript and the constructive comments and suggestions. In response to your suggestions and concerns, we excluded the StroopOnly, SimonOnly and Stroop+Simon models, and added the schematic of domain-general/specific model RSMs. We have provided detailed responses to your comments below.

      The evidence supporting the authors claims, however, is limited by confounds in the experimental design and by lack of clarity in reporting the testing of alternative hypotheses within the method and results.

      1. Model comparison

      The authors commendably performed a model comparison within their study, in which they formalized alternative hypotheses to their cognitive space hypothesis. We greatly appreciate the motivation for this idea and think that it strengthened the manuscript. Nevertheless, some details of this model comparison were difficult for us to understand, which in turn has limited our understanding of the strength of the findings.

      The text indicates the domain-general model was computed by taking the difference in congruency effects per conflict condition. Does this refer to the "absolute difference" between congruency effects? In the rest of this review, we assume that the absolute difference was indeed used, as using a signed difference would not make sense in this setting. Nevertheless, it may help readers to add this information to the text.

      Response: We apologize for any confusion. The “difference” here indeed refers to the “absolute difference” between congruency effects. We have now clarified this by adding the word “absolute” accordingly.

      "Therefore, we defined the domain-general matrix as the absolute difference in their congruency effects indexed by the group-averaged RT in Experiment 2."

      Regarding the Stroop-Only and Simon-Only models, the motivation for using the Jaccard metric was unclear. From our reading, it seems that all of the other models --- the cognitive space model, the domain-general model, and the domain-specific model --- effectively use a Euclidean distance metric. (Although the cognitive space model is parameterized with cosine similarities, these similarity values are proportional to Euclidean distances because the points all lie on a circle. And, although the domain-general model is parameterized with absolute differences, the absolute difference is equivalent to Euclidean distance in 1D.) Given these considerations, the use of Jaccard seems to differ from the other models, in terms of parameterization, and thus potentially also in terms of underlying assumptions. Could authors help us understand why this distance metric was used instead of Euclidean distance? Additionally, if Jaccard must be used because this metric seems to be non-standard in the use of RSA, it would likely be helpful for many readers to give a little more explanation about how it was calculated.

      Response: We believe that the Jaccard similarity measure is consistent with the Cosine similarity measure. The Jaccard similarity is calculated as the intersection divided by the union. To define the similarity of two conflicts in the Stroop-only and Simon-only models, we first project them onto the vertical or horizontal axes, respectively (as shown in Author response image 1A). The Jaccard similarity in our design is equivalent to the cosine similarity because the denominator in the Jaccard similarity is identical to the denominator in the cosine similarity (both are the radius of the circle, Author response image 1B).

      However, it is important to note that a cosine similarity cannot be defined when conflicts are projected onto spatial Stroop or Simon axis simultaneously. Therefore, we used the Jaccard similarity in the previous version of our manuscript.

      Author response image 4.

      Definition of Jaccard similarity. A) Two conflicts (1 and 2) are projected onto the spatial Stroop/Simon axis in the Stroop/Simon-only model, respectively. The Jaccard similarity for Stroop-only and Simon-only model are and respectively. Letters a-d are the projected vectors from the two conflicts to the two axes. Blue and red colors indicate the conflict conditions. Shorter vectors are the intersection and longer vectors are the union. B) According to the cosine similarity model, the similarity is defined as , where e is the projected vector from conflict 1 to conflict 2, and g is the vector of conflict 1. The Jaccard similarity for this case is defined by , where f is the projector vector from conflict 2 to itself. Because f = g in our design, the Jaccard similarity is equivalent to the cosine similarity.

      However, we agree with the reviewer’s and other reviewers’ concern that the correlation between spatial Stroop and Simon conflicts makes it less likely to distinguish the Stroop+Simon from cosine similarity models. While distinguishing them is essential to understand the detailed geometry of the cognitive space, it is beyond our major purpose, that is, to distinguish the cosine similarity model with the domain-general/specific models. Therefore, we have chosen to exclude the Stroop-only, Simon-only and Stroop+Simon models from our revised manuscript.

      When considering parameterizing the Stroop-Only and Simon-Only models with Euclidean distances, one concern we had is that the joint inclusion of these models might render the cognitive space model unidentifiable due to collinearity (i.e., the sum of the Stroop-Only and Simon-Only models could be collinear with the cognitive space model). Could the authors determine whether this is the case? This issue seems to be important, as the presence of such collinearity would suggest to us that the design is incapable of discriminating those hypotheses as parameterized.

      Response: We acknowledge that our design does not allow for a complete differentiation between the parallel encoding (StroopOnly+SimonOnly) model and the cognitive space model, given their high correlation (r = 0.85). However, it is important to note that the StroopOnly+SimonOnly model introduces more free parameters, making the model fitting poorer than the cognitive space model.

      Additionally, the cognitive space model also shows high correlations with the StroopOnly and SimonOnly models (both rs = 0.66). It is crucial to emphasize that our study’s primary goal does not involve testing the parallel encoding hypothesis (through the StroopOnly+SimonOnly model). As a result, we have chosen to remove the model comparison results with the StroopOnly, SimonOnly and StroopOnly+SimonOnly models. Instead, the cognitive space model shows lower correlation with the purely domain-general (r = −0.16) and domain-specific (r = 0.46) models.

      1. Issue of uniquely identifying conflict coding

      We certainly appreciate the efforts that authors have taken to address potential confounders for encoding of conflict in their original submission. We broach this question not because we wish authors to conduct additional control analyses, but because this issue seems to be central to the thesis of the manuscript and we would value reading the authors' thoughts on this issue in the discussion.

      To summarize our concerns, conflict seems to be a difficult variable to isolate within aggregate neural activity, at least relative to other variables typically studied in cognitive control, such as task-set or rule coding. This is because it seems reasonable to expect that many more nuisance factors covary with conflict -- such as univariate activation, level of cortical recruitment, performance measures, arousal --- than in comparison with, for example, a well-designed rule manipulation. Controlling for some of these factors post-hoc through regression is commendable (as authors have done here), but such a method will likely be incomplete and can provide no guarantees on the false positive rate.

      Relatedly, the neural correlates of conflict coding in fMRI and other aggregate measures of neural activity are likely of heterogeneous provenance, potentially including rate coding (Fu et al., 2022), temporal coding (Smith et al., 2019), modulation of coding of other more concrete variables (Ebitz et al., 2020, 10.1101/2020.03.14.991745; see also discussion and reviews of Tang et al., 2016, 10.7554/eLife.12352), or neuromodulatory effects (e.g., Aston-Jones & Cohen, 2005). Some of these origins would seem to be consistent with "explicit" coding of conflict (conflict as a representation), but others would seem to be more consistent with epiphenomenal coding of conflict (i.e., conflict as an emergent process). Again, these concerns could apply to many variables as measured via fMRI, but at the same time, they seem to be more pernicious in the case of conflict. So, if authors consider these issues to be germane, perhaps they could explicitly state in the discussion whether adopting their cognitive space perspective implies a particular stance on these issues, how they interpret their results with respect to these issues, and if relevant, qualify their conclusions with uncertainty on these issues.

      Response: We appreciate the reviewer’s insightful comments regarding the representation and process of conflict.

      First, we agree that the conflict is not simply a pure feature like a stimulus but often arises from the interaction (e.g., dimension overlap) between two or more aspects. For example, in the manual Stroop, conflict emerges from the inconsistent semantic information between color naming and word reading. Similarly, other higher-order cognitive processes such as task-set also underlie the relationship between concrete aspects. For instance, in a face/house categorization task, the taskset is the association between face/house and the responses. When studying these higher-order processes, it is often impossible to completely isolate them from bottomup features. Therefore, methods like the representational similarity analysis and regression models are among the limited tools available to attempt to dissociate these concrete factors from conflict representation. While not perfect, this approach has been suggested and utilized in practice (Freund et al., 2021).

      Second, we agree that conflict can be both a representation and an emerging process. These two perspectives are not necessarily contradictory. According to David Marr’s influential three-level theory (Marr, 1982), representation is the algorithm of the process to achieve a goal based on the input. Therefore, a representation can refer to not only a static stimulus (e.g., the visual representation of an image), but also a dynamic process. Building on this perspective, we posit that the representation of cognitive control consists of an array of dynamic representations embedded within the overall process. A similar idea has been proposed that the abstract task profiles can be progressively constructed as a representation in our brain (Kikumoto & Mayr, 2020).

      We have incorporated this discussion into the manuscript:

      "Recently an interesting debate has arisen concerning whether cognitive control should be considered as a process or a representation (Freund, Etzel, et al., 2021). Traditionally, cognitive control has been predominantly viewed as a process. However, the study of its representation has gained more and more attention. While it may not be as straightforward as the visual representation (e.g., creating a mental image from a real image in the visual area), cognitive control can have its own form of representation. An influential theory, Marr’s (1982) three-level model proposed that representation serves as the algorithm of the process to achieve a goal based on the input. In other words, representation can encompass a dynamic process rather than being limited to static stimuli. Building on this perspective, we posit that the representation of cognitive control consists of an array of dynamic representations embedded within the overall process. A similar idea has been proposed that the representation of task profiles can be progressively constructed with time in the brain (Kikumoto & Mayr, 2020)."

      Reference:

      Freund, M. C., Etzel, J. A., & Braver, T. S. (2021). Neural Coding of Cognitive Control: The Representational Similarity Analysis Approach. Trends in Cognitive Sciences, 25(7), 622-638. https://doi.org/10.1016/j.tics.2021.03.011

      Marr, D. C. (1982). Vision: A computational investigation into human representation and information processing. New York: W.H. Freeman.

      Kikumoto A, Mayr U. (2020). Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proceedings of the National Academy of Sciences, 117(19):10603-10608. https://doi.org/10.1073/pnas.1922166117.

      1. Interpretation of measured geometry in 8C

      We appreciate the inclusion of the measured similarity matrices of area 8C, the key area the results focus on, to the supplemental, as this allows for a relatively model-agnostic look at a portion of the data. Interestingly, the measured similarity matrix seems to mismatch the cognitive space model in a potentially substantive way. Although the model predicts that the "pure" Stroop and Simon conditions will have maximal self-similarity (i.e., the Stroop-Stroop and Simon-Simon cells on the diagonal), these correlations actually seem to be the lowest, by what appears to be a substantial margin (particularly the Stroop-Stroop similarities). What should readers make of this apparent mismatch? Perhaps authors could offer their interpretation on how this mismatch could fit with their conclusions.

      Response: We appreciate the reviewer for bringing this to our attention. It is essential to clarify that our conclusions were based on the significant similarity modulation effect observed in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four withinconflict conditions (Fig. 7A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here. Moreover, to specifically address the potential differences between the within-Stroop condition and the other within-conflict conditions, we conducted a mixed-effect model. In this analysis, the primary predictor was the cross-condition difference (0 for within-Stroop condition and 1 for other within-conflict conditions). The results showed no significant cross-condition difference in either the incongruent trials (t = 1.22, p = .23) or the congruent (t = 1.06, p = .29) trials. Thus, we believe the evidence for different similarities is inconclusive in our data and decided not to interpret this numerical difference.

      We have added this note in the revised figure caption for Figure S5.

      Author response image 5.

      Fig. S5. The stronger conflict type similarity effect in incongruent versus congruent conditions. (A) Summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of Stroop and other within-conflict cells (i.e., the diagonal) did not reach significance for either incongruent (t = 1.22, p = .23) or congruent (t = 1.06, p = .29) trials. (2) Scatter plot showing the averaged neural similarity (Pearson correlation) as a function of conflict type similarity in both conditions. The values in both A and B are calculated from raw Pearson correlation values, in contrast to the z-scored values in Fig. 4D.

      1. It would likely improve clarity if all of the competing models were displayed as summarized RSA matrices in a single figure, similar to (or perhaps combined with) Figure 7.

      Response: We appreciate the reviewer’s suggestion. We now have incorporated the domain-general and domain-specific models into the Figure 7 (now Figure 8).

      Author response image 6.

      Figure 8. Schematic of key RSMs. (A) and (B) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (C) and (D) show the two alternative models. Like the cosine model (A), within-group trial pairs resemble between-group trial pairs in these two models. The domain-specific model is an identity matrix. The domain-general model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0(lowest similarity)-1(highest similarity) to aid comparison. The plotted matrices here include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      1. Because this model comparison is key to the main inferences in the study, it might also be helpful for most readers to move all of these RSA model matrices to the main text, instead of in the supplemental.

      Response: We thank the reviewer for this suggestion. We have moved the Fig. S4 to the main text, labeled as the new Figure 7.

      1. It may be worthwhile to check how robust the observed brain-behavior association (Fig 4C) is to the exclusion of the two datapoints with the lowest neural representation strength measure, as these points look like they have high leverage.

      Response: We calculated the Pearson correlation after excluding the two points and found it does not affect the results too much, with the r = 0.50, p = .003 (compared to the original r = 0.52, p = .001).

      Additionally, we found the two axes were mistakenly shifted in Fig 4C. Therefore, we corrected this error in the revised manuscript. The correlation results would not be influenced.

      Author response image 7.

      Fig. 4. The conflict type effect. (A) Brain regions surviving the Bonferroni correction (p < 0.0001) across the regions (criterion 1). Labeled regions are those meeting the criterion 2. (B) Different encoding of conflict type in the incongruent with congruent conditions. * Bonferroni corrected p < .05. (C) The brain-behavior correlation of the right 8C (criterion 3). The x-axis shows the beta coefficient of the conflict type effect from the RSA, and the y-axis shows the beta coefficient obtained from the behavioral linear model using the conflict similarity to predict the CSE in Experiment 2. (D) Illustration of the different encoding strength of conflict type similarity in incongruent versus congruent conditions of right 8C. The y-axis is derived from the z-scored Pearson correlation coefficient, consistent with the RSA methodology. See Fig. S4B for a plot with the raw Pearson correlation measurement. l = left; r = right.

      Reviewer #3:

      Yang and colleagues investigated whether information on two task-irrelevant features that induce response conflict is represented in a common cognitive space. To test this, the authors used a task that combines the spatial Stroop conflict and the Simon effect. This task reliably produces a beautiful graded congruency sequence effect (CSE), where the cost of congruency is reduced after incongruent trials. The authors measured fMRI to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts. They applied univariate, multivariate, and connectivity analyses to fMRI data to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts. They further directly assessed the dimensionality of represented conflict space.

      The authors identified the right dlPFC (right 8C), which shows 1) stronger encoding of graded similarity of conflicts in incongruent trials and 2) a positive correlation between the strength of conflict similarity type and the CSE on behavior. The dlPFC has been shown to be important for cognitive control tasks. As the dlPFC did not show a univariate parametric modulation based on the higher or lower component of one type of conflict (e.g., having more spatial Stroop conflict or less Simon conflict), it implies that dissimilarity of conflicts is represented by a linear increase or decrease of neural responses. Therefore, the similarity of conflict is represented in multivariate neural responses that combine two sources of conflict.

      The strength of the current approach lies in the clear effect of parametric modulation of conflict similarity across different conflict types. The authors employed a clever cross-subject RSA that counterbalanced and isolated the targeted effect of conflict similarity, decorrelating orientation similarity of stimulus positions that would otherwise be correlated with conflict similarity. A pattern of neural response seems to exist that maps different types of conflict, where each type is defined by the parametric gradation of the yoked spatial Stroop conflict and the Simon conflict on a similarity scale. The similarity of patterns increases in incongruent trials and is correlated with CSE modulation of behavior.

      The main significance of the paper lies in the evidence supporting the use of an organized "cognitive space" to represent conflict information as a general control strategy. The authors thoroughly test this idea using multiple approaches and provide convincing support for their findings. However, the universality of this cognitive strategy remains an open question.

      (Public Reviews) Taken together, this study presents an exciting possibility that information requiring high levels of cognitive control could be flexibly mapped into cognitive map-like representations that both benefit and bias our behavior. Further characterization of the representational geometry and generalization of the current results look promising ways to understand representations for cognitive control.

      Response: We would like to thank the reviewer for the positive evaluation of our manuscript and for providing constructive comments. In response to your suggestions, we have acknowledged the potential limitation of the design and the cross-subject RSA approach, and incorporated the open questions to the discussions. Please find our detailed responses below.

      The task presented in the study involved two sources of conflict information through a single salient visual input, which might have encouraged the utilization of a common space.

      Response: We agree that the unified visual input in our design may have facilitated the utilization of a common space. However, we believe the stimuli are not necessarily unified in the construction of the common space. To further test the potential interaction between the concrete stimulus setting and the cognitive space representation, it is necessary to use varied stimuli in future research. We have left this as an open question in the discussion:

      Can we effectively map any sources of conflict with completely different stimuli into a single space?

      The similarity space was analyzed at the level of between-individuals (i.e., crosssubject RSA) to mitigate potential confounds in the design, such as congruency and the orientation of stimulus positions. This approach makes it challenging to establish a direct link between the quality of conflict space representation and the patterns of behavioral adaptations within individuals.

      Response: By setting the variables as random effects at the subject level, we have extracted the individual effects that incorporate both the group-level fixed effects and individual-level random effects. We believe this approach yields results that are as reliable, if not more, than effects calculated from individual data only. First, the mixed effect linear (LME) model has included all the individual data, forming the basis for establishing random effects. Therefore, the individual effects derived from this approach inherently reflect the individual-specific effects. To support this notion, we have included a simulation script (accessible in the online file “simulation_LME.mlx” at https://osf.io/rcq8w) to demonstrate the strong consistency between the two approaches (see Author response image 8). In this simulation, we generated random data (Y) for 35 subjects, each containing 20 repeated measurements across 5 conditions. To streamline the simulation, we only included one predictor (X), which was treated as both fixed and random effects at the subject level. We applied two methods to calculate the individual beta coefficient. The first involved extracting individual beta coefficients from the LME model by summing the fixed effect with the subject-specific random effect. The second method was entailed conducting a regression analysis using data from each subject to obtain the slope. We tested their consistency by calculating the Pearson correlation between the derived beta coefficients. This simulation was repeated 100 times.

      Author response image 8.

      The consistent individual beta coefficients between the mixed effect model and the individual regression analysis. A) The distribution of Pearson correlation between the two methods for 100 times. B) An example from the simulation showing the highly correlated results from the two methods. Each data point indicates a subject (n=35).

      Second, the potential difference between the two methods lies in that the LME model have also taken the group-level variance into account, such as the dissociable variances of the conflict similarity and orientation across subject groups. This enabled us to extract relatively cleaner conflict similarity effects for each subject, which we believe can be better linked to the individual behavioral adaptations. Moreover, we have extracted the behavioral adaptations scores (i.e., the similarity modulation effect on CSE) using a similar LME approach. Conducting behavioral analysis solely using individual data would have been less reliable, given the limited sample size of individual data (~32 points per subject). This also motivated us to maintain consistency by extracting individual neural effects using LME models.

      Furthermore, it remains unclear at which cognitive stages during response selection such a unified space is recruited. Can we effectively map any sources of conflict into a single scale? Is this unified space adaptively adjusted within the same brain region? Additionally, does the amount of conflict solely define the dimensions of this unified space across many conflict-inducing tasks? These questions remain open for future studies to address.

      Response: We appreciate the reviewer’s constructive open questions. We respond to each of them based on our current understanding.

      1) It remains unclear at which cognitive stages during response selection such a unified space is recruited.

      We anticipate that the cognitive space is recruited to guide the transference of behavioral CSE at two critical stages. The first stage involves the evaluation of control demands, where the representational distance/similarity between previous and current trials influences the adjustment of cognitive control. The second stage pertains to is control execution, where the switch from one control state to another follows a path within the cognitive space. It is worth noting that future studies aiming to address this question may benefit from methodologies with higher temporal resolutions, such as EEG and MEG, to provide more precise insights into the temporal dynamics of the process of cognitive space recruitment.

      2) Can we effectively map any sources of conflict into a single scale?

      It is possible that various sources of conflict can be mapped onto the same space based on their similarity, even if finding such an operational defined similarity may be challenging. However, our results may offer an approach to infer the similarity between two conflicts. One way is to examine their congruency sequence effect (CSE), with a stronger CSE suggesting greater similarity. The other way is to test their representational similarity within the dorsolateral prefrontal cortex.

      3) Is this unified space adaptively adjusted within the same brain region? We do not have an answer to this question. We showed that the cognitive space does not change with time (Note. S3). What have adjusted is the control demand to resolve the quickly changing conflict conditions from trial to trial. Though, it is an interesting question whether the cognitive space may be altered, for example, when the mental state changes significantly. And if yes, we can further test whether the change of cognitive space is also within the right dlPFC.

      4) Additionally, does the amount of conflict solely define the dimensions of this unified space across many conflict-inducing tasks?

      Our understanding of this comment is that the amount of conflict refers to the number of conflict sources. Based on our current finding, the dimensions of the space are indeed defined by how many different conflict sources are included. However, this would require the different conflict sources are orthogonal. If some sources share some aspects, the cognitive space may collapse to a lower dimension. We have incorporated the first question into the discussion:

      Moreover, we anticipate that the representation of cognitive space is most prominently involved at two critical stages to guide the transference of behavioral CSE. The first stage involves the evaluation of control demands, where the representational distance/similarity between previous and current trials influences the adjustment of cognitive control. The second stage pertains to control execution, where the switch from one control state to another follows a path within the cognitive space. However, we were unable to fully distinguish between these two stages due to the low temporal resolution of fMRI signals in our study. Future research seeking to delve deeper into this question may benefit from methodologies with higher temporal resolutions, such as EEG and MEG.

      We have included the other questions into the manuscript as open questions, calling for future research.

      Several interesting questions remains to be answered. For example, is the dimension of the unified space across conflict-inducing tasks solely determined by the number of conflict sources? Can we effectively map any sources of conflict with completely different stimuli into a single space? Is the cognitive space geometry modulated by the mental state? If yes, what brain regions mediate the change of cognitive space?

      Minor comments:

      • The original comment about out-of-sample predictions to examine the continuity of the space was a suggestion for testing neural representations, not behavior (I apologize for the lack of clarity). Given the low dimensionality of the conflict space shown by the participation ratio, we expect that linear separability exists only among specific combinations of conditions. For example, the pair of conflicts 1 and 5 together is not linearly separable from conflicts 2 and 3. But combined with other results, this is already implied.

      Response: We apologize for the misunderstanding. In fact, performing a prediction analysis using the extensive RSM in our study does presents certain challenges, primarily due to its substantial size (1400x1400) and the intricate nature of the mixed-effect linear model. In our efforts to simplify the prediction process by excluding random effects, we did observe a correlation between the predicted and original values, albeit a relatively small Pearson correlation coefficient of r = 0.024, p < .001. This small correlation can be attributed to two key factors. First, the exclusion of data points impacts not only the conflict similarity regressor but also other regressors within the model, thereby diminishing the predictive power. Secondly, the large amount of data points in the model heightens the risk of overfitting, subsequently reducing the model’s capacity for generalization and increasing the likelihood of unreliable predictions. Given these potential problems, we have opted not to include this prediction in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review): 

      The reviewer retained most of their comments from the previous reviewing round. In order to meet these comments and to further examine the dynamic nature of threat omission-related fMRI responses, we now re-analyzed our fMRI results using the single trial estimates. The results of these additional analyses are added below in our response to the recommendations for the authors of reviewer 1. However, we do want to reiterate that there was a factually incorrect statement concerning our design in the reviewer’s initial comments. Specifically, the reviewer wrote that “25% of shocks are omitted, regardless of whether subjects are told that the probability is 100%, 75%, 50%, 25%, or 0%.” We want to repeat that this is not what we did. 100% trials were always reinforced (100% reinforcement rate); 0% trials were never reinforced (0% reinforcement rate). For all other instructed probability levels (25%, 50%, 75%), the stimulation was delivered in 25% of the trials (25% reinforcement rate). We have elaborated on this misconception in our previous letter and have added this information more explicitly in the previous revision of the manuscript (e.g., lines 125-129; 223-224; 486-492).   

      Reviewer #1 (Recommendations For The Authors): 

      I do not have any further recommendations, although I believe an analysis of learning-related changes is still possible with the trial-wise estimates from unreinforced trials. The authors' response does not clarify whether they tested for interactions with run, and thus the fact that there are main effects does not preclude learning. I kept my original comments regarding limitations, with the exception of the suggestion to modify the title. 

      We thank the reviewer for this recommendation. In line with their suggestion, we have now reanalyzed our main ROI results using the trial-by-trial estimates we obtained from the firstlevel omission>baseline contrasts. Specifically, we extracted beta-estimates from each ROI and entered them into the same Probability x Intensity x Run LMM we used for the relief and SCR analyses. Results from these analyses (in the full sample) were similar to our main results. For the VTA/SN model, we found main effects of Probability (F = 3.12, p = .04), and Intensity (F = 7.15, p < .001) (in the model where influential outliers were rescored to 2SD from mean). There was no main effect of Run (F = 0.92, p = .43) and no Probability x Run interaction (F = 1.24, p = .28). If the experienced contingency would have interfered with the instructions, there should have been a Probability x Run interaction (with the effect of Probability only being present in the first runs). Since we did not observe such an interaction, our results indicate that even though some learning might still have taken place, the main effect of Probability remained present throughout the task.  

      There is an important side note regarding these analyses: For the first level GLM estimation, we concatenated the functional runs and accounted for baseline differences between runs by adding run-specific intercepts as regressors of no-interest. Hence, any potential main effect of run was likely modeled out at first level. This might explain why, in contrast to the rating and SCR results (see Supplemental Figure 5), we found no main effect of Run. Nevertheless, interaction effects should not be affected by including these run-specific intercepts.

      Note that when we ran the single-trial analysis for the ventral putamen ROI, the effect of intensity became significant (F = 3.89, p = .02). Results neither changed for the NAc, nor the vmPFC ROIs.  

      Reviewer #2 (Public Review): 

      Comments on revised version: 

      I want to thank the authors for their thorough and comprehensive work in revising this manuscript. I agree with the authors that learning paradigms might not be a necessity when it comes to study the PE signals, but I don't particularly agree with some of the responses in the rebuttal letter ("Furthermore, conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted."). This is of course correct description for the conditioning paradigm, but the same can be said for an instructed design: the aversive outcome was either delivered or not. That being said, adopting the instructed design itself is legitimate in my opinion. 

      We thank the reviewer for this comment. We have now modified the phrasing of this argument to clarify our reasoning (see lines 102-104: “First, these only included one level of aversive outcome: the electrical stimulation was either delivered at a fixed intensity, or omitted; but the intensity of the stimulation was never experimentally manipulated within the same task.”).  

      The reason why we mentioned that “the aversive outcome is either delivered or omitted” is because in most contemporary conditioning paradigms only one level of aversive US is used. In these cases, it is therefore not possible to investigate the effect of US Intensity. In our paradigm, we included multiple levels of aversive US, allowing us to assess how the level of aversiveness influences threat omission responding. It is indeed true that each level was delivered or not. However, our data clearly (and robustly across experiments, see Willems & Vervliet, 2021) demonstrate that the effects of the instructed and perceived unpleasantness of the US (as operationalized by the mean reported US unpleasantness during the task) on the reported relief and the omission fMRI responses are stronger than the effect of instructed probability.  

      My main concern, which the authors spent quite some length in the rebuttal letter to address, still remains about the validity for different instructed probabilities. Although subjects were told that the trials were independent, the big difference between 75% and 25% would more than likely confuse the subjects, especially given that most of us would fall prey to the Gambler's fallacy (or the law of small numbers) to some degree. When the instruction and subjective experience collides, some form of inference or learning must have occurred, making the otherwise straightforward analysis more complex. Therefore, I believe that a more rigorous/quantitative learning modeling work can dramatically improve the validity of the results. Of course, I also realize how much extra work is needed to append the computational part but without it there is always a theoretical loophole in the current experimental design. 

      We agree with the reviewer that some learning may have occurred in our task. However, we believe the most important question in relation to our study is: to what extent did this learning influence our manipulations of interest?  

      In our reply to reviewer 1, we already showed that a re-analysis of the fMRI results using the trial-by-trial estimates of the omission contrasts revealed no Probability x Run interaction, suggesting that – overall – the probability effect remained stable over the course of the experiment. However, inspired by the alternative explanation that was proposed by this reviewer, we now also assessed the role of the Gambler’s fallacy in a separate set of analyses. Indeed, it is possible that participants start to expect a stimulation more after more time has passed since the last stimulation was experienced. To test this alternative hypothesis, we specified two new regressors that calculated for each trial of each participant how many trials had passed since the last stimulation (or since the beginning of the experiment) either overall (across all trials of all probability types; hence called the overall-lag regressor) or per probability level (across trials of each probability type separately; hence called the lag-per-probability regressor). For both regressors a value of 0 indicates that the previous trial was either a stimulation trial or the start of experiment, a value of 1 means that the last stimulation trial was 2 trials ago, etc.  

      The results of these additional analyses are added in a supplemental note (see supplemental note 6), and referred to in the main text (see lines 231-236: “Likewise, a post-hoc trial-by-trial analysis of the omission-related fMRI activations confirmed that the Probability effect for the VTA/SN activations was stable over the course of the experiment (no Probability x Run interaction) and remained present when accounting for the Gambler’s fallacy (i.e., the possibility that participants start to expect a stimulation more when more time has passed since the last stimulation was experienced) (see supplemental note 6). Overall, these post-hoc analyses further confirm the PE-profile of omission-related VTA/SN responses”.  

      Addition to supplemental material (pages 16-18)

      Supplemental Note 6: The effect of Run and the Gambler’s Fallacy 

      A question that was raised by the reviewers was whether omission-related responses could be influenced by dynamical learning or the Gambler’s Fallacy, which might have affected the effectiveness of the Probability manipulation.  

      Inspired by this question, we exploratorily assessed the role of the Gambler’s Fallacy and the effects of Run in a separate set of analyses. Indeed, it is possible that participants start to expect a stimulation more when more time has passed since the last stimulation was experienced. To test this alternative hypothesis, we specified two new regressors that calculated for each trial of each participant how many trials had passed since the last stimulation (or since the beginning of the experiment) either overall (across all trials of all probability types; hence called the overall-lag regressor) or per probability level (across trials of each probability type separately; hence called the lag-per-probability regressor). For both regressors a value of 0 indicates that the previous trial was either a stimulation trial or the start of experiment, a value of 1 means that the last stimulation trial was 2 trials ago, etc.  

      The new models including these regressors for each omission response type (i.e., omission-related activations for each ROI, relief, and omission-SCR) were specified as follows:   

      (1) For the overall lag:

      Omission response ~ Probability * Intensity * Run + US-unpleasantness + Overall-lag + (1|Subject).  

      (2) For the lag per probability level:

      Omission response ~ Probability * Intensity * Run + US-unpleasantness + Lag-perprobability : Probability + (1|Subject).  

      Where US-unpleasantness scores were mean-centered across participants; “*” represents main effects and interactions, and “:” represents an interaction (without main effect). Note that we only included an interaction for the lag-per-probability model to estimate separate lag-parameters for each probability level.  

      The results of these analyses are presented in the tables below. Overall, we found that adding these lag-regressors to the model did not alter our main results. That is: for the VTA/SN, relief and omission-SCR, the main effects of Probability and Intensity remained. Interestingly, the overall-lag-effect itself was significant for VTA/SN activations and omission SCR, indicating that VTA/SN activations were larger when more time had passed since the last stimulation (beta = 0.19), whereas SCR were smaller when more time had passed (beta = -0.03). This pattern is reminiscent of the Perruchet effect, namely that the explicit expectancy of a US increases over a run of non-reinforced trials (in line with the gambler’s fallacy effect) whereas the conditioned physiological response to the conditional stimulus declines (in line with an extinction effect, Perruchet, 1985; McAndrew, Jones, McLaren, & McLaren, 2012). Thus, the observed dissociation between the VTA/SN activations and omission SCR might similarly point to two distinctive processes where VTA/SN activations are more dependent on a consciously controlled process that is subjected to the gambler’s fallacy, whereas the strength of the omission SCR responses is more dependent on an automatic associative process that is subjected to extinction. Importantly, however, even though the temporal distance to the last stimulation had these opposing effects on VTA/SN activations and omission SCRs, the main effects of the probability manipulation remained significant for both outcome variables. This means that the core results of our study still hold.   

      Next to the overall-lag effect, the lag-per-probability regressor was only significant for the vmPFC. A follow-up of the beta estimates of the lag-per-probability regressors for each probability level revealed that vmPFC activations increased with increasing temporal distance from the stimulation, but only for the 50% trials (beta = 0.47, t = 2.75, p < .01), and not the 25% (beta = 0.25, t = 1.49, p = .14) or the 75% trials (beta = 0.28, t = 1.62, p = .10).

      Author response table 1.

      F-statistics and corresponding p-values from the overall lag model. (*) F-test and p-values were based on the model where outliers were rescored to 2SD from the mean. Note that when retaining the influential outliers for this model, the p-value of the probability effect was p = .06. For all other outcome variables, rescoring the outliers did not change the results. Significant effects are indicated in bold.

      Author response table 2.

      F-statistics and corresponding p-values from the lag per probability level model. (*) F-test and p-values were based on the model where outliers were rescored to 2SD from the mean. Note that when retaining the influential outliers for this model, the p-value of the Intensity x Run interaction was p = .05. For all other outcome variables, rescoring the outliers did not change the results. Significant effects are indicated in bold.

      As the authors mentioned in the rebuttal letter, "selecting participants only if their anticipatory SCR monotonically increased with each increase in instructed probability 0% < 25% < 50% < 75% < 100%, N = 11 participants", only ~1/3 of the subjects actually showed strong evidence for the validity of the instructions. This further raises the question of whether the instructed design, due to the interference of false instruction and the dynamic learning among trials, is solid enough to test the hypothesis .  

      We agree with the reviewer that a monotonic increase in anticipatory SCR with increasing probability instructions would provide the strongest evidence that the manipulation worked. However, it is well known that SCR is a noisy measure, and so the chances to see this monotonic increase are rather small, even if the underlying threat anticipation increases monotonically. Furthermore, between-subject variation is substantial in physiological measures, and it is not uncommon to observe, e.g., differential fear conditioning in one measure, but not in another (Lonsdorf & Merz, 2017). It is therefore not so surprising that ‘only’ 1/3 of our participants showed the perfect pattern of monotonically increasing SCR with increasing probability instructions. That being said, it is also important to note that not all participants were considered for these follow-up analyses because valid SCR data was not always available.

      Specifically, N = 4 participants were identified as anticipation non-responders (i.e. participant with smaller average SCR to the clock on 100% than on 0% trials; pre-registered criterium) and were excluded from the SCR-related analyses, and N = 1 participant had missing data due to technical difficulties. This means that only 26 (and not 31) participants were considered for the post hoc analyses. Taking this information into account, this means that 21 out of 26 participants (approximately 80%) showed stronger anticipatory SCR following 75% instructions compared to 25% instructions and that  11 out of 26 participants (approximately 40%) even showed the monotonical increase in their anticipatory SCR (see supplemental figure 4). Furthermore, although anticipatory SCR gradually decreased over the course of the experiment, there was no Run x Probability interaction, indicating that the instructions remained stable throughout the task (see supplemental figure 3).  

      Reviewer #2 (Recommendations For The Authors):

      A more operational approach might be to break the trials into different sections along the timeline and examine how much the results might have been affected across time. I expect the manipulation checks would hold for the first one or two runs and the authors then would have good reasons to focus on the behavioral and imaging results for those runs. 

      This recommendation resembles the recommendation by reviewer 1. In our reply to reviewer 1, we showed the results of a re-analysis of the fMRI data using the trial-by-trial estimates of the omission contrasts, which revealed no Probability x Run interaction, suggesting that – overall - the probability effect remained (more or less) stable over the course of the experiment.  For a more in depth discussion of the results of this additional analysis, we refer to our answer to reviewer 1.  

      Reviewer #3 (Public Review): 

      Comments on revised version: 

      The authors were extremely responsive to the comments and provided a comprehensive rebuttal letter with a lot of detail to address the comments. The authors clarified their methodology, and rationale for their task design, which required some more explanation (at least for me) to understand. Some of the design elements were not clear to me in the original paper. 

      The initial framing for their study is still in the domain of learning. The paper starts off with a description of extinction as the prime example of when threat is omitted. This could lead a reader to think the paper would speak to the role of prediction errors in extinction learning processes. But this is not their goal, as they emphasize repeatedly in their rebuttal letter. The revision also now details how using a conditioning/extinction framework doesn't suit their experimental needs. 

      We thank the reviewer for pointing out this potential cause of confusion. We have now rewritten the starting paragraph of the introduction to more closely focus on prediction errors, and only discuss fear extinction as a potential paradigm that has been used to study the role of threat omission PE for fear extinction learning (see lines 40-55). We hope that these adaptations are sufficient to prevent any false expectations. However, as we have mentioned in our previous response letter, not talking about fear extinction at all would also not make sense in our opinion, since most of the knowledge we have gained about threat omission prediction errors to date is based on studies that employed these paradigms.  

      Adaptation in the revised manuscript (lines 40-55):  

      “We experience pleasurable relief when an expected threat stays away1. This relief indicates that the outcome we experienced (“nothing”) was better than we expected it to be (“threat”). Such a mismatch between expectation and outcome is generally regarded as the trigger for new learning, and is typically formalized as the prediction error (PE) that determines how much there can be learned in any given situation2. Over the last two decades, the PE elicited by the absence of expected threat (threat omission PE) has received increasing scientific interest, because it is thought to play a central role in learning of safety. Impaired safety learning is one of the core features of clinical anxiety4. A better understanding of how the threat omission PE is processed in the brain may therefore be key to optimizing therapeutic efforts to boost safety learning. Yet, despite its theoretical and clinical importance, research on how the threat omission PE is computed in the brain is only emerging.  

      To date, the threat omission PE has mainly been studied using fear extinction paradigms that mimic safety learning by repeatedly confronting a human or animal with a threat predicting cue (conditional stimulus, CS; e.g. a tone) in the absence of a previously associated aversive event (unconditional stimulus, US; e.g., an electrical stimulation). These (primarily non-human) studies have revealed that there are striking similarities between the PE elicited by unexpected threat omission and the PE elicited by unexpected reward.”

      It is reasonable to develop a new task to answer their experimental questions. By no means is there a requirement to use a conditioning/extinction paradigm to address their questions. As they say, "it is not necessary to adopt a learning paradigm to study omission responses", which I agree with.  But the authors seem to want to have it both ways: they frame their paper around how important prediction errors are to extinction processes, but then go out of their way to say how they can't test their hypotheses with a learning paradigm.

      Part of their argument that they needed to develop their own task "outside of a learning context" goes as follows: 

      (1) "...conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted. As a result, the magnitude-related axiom cannot be tested." 

      (2) "....in conditioning tasks people generally learn fast, rendering relatively few trials on which the prediction is violated. As a result, there is generally little intra-individual variability in the PE responses" 

      (3) "...because of the relatively low signal to noise ratio in fMRI measures, fear extinction studies often pool across trials to compare omission-related activity between early and late extinction, which further reduces the necessary variability to properly evaluate the probability axiom" 

      These points seem to hinge on how tasks are "generally" constructed. However, there are many adaptations to learning tasks:

      (1) There is no rule that conditioning can't include different levels of aversive outcomes following different cues. In fact, their own design uses multiple cues that signal different intensities and probabilities. Saying that conditioning "generally only include one level of aversive outcome" is not an explanation for why "these paradigms are not tailored" for their research purposes. There are also several conditioning studies that have used different cues to signal different outcome probabilities. This is not uncommon, and in fact is what they use in their study, only with an instruction rather than through learning through experience, per se.

      (2) Conditioning/extinction doesn't have to occur fast. Just because people "generally learn fast" doesn't mean this has to be the case. Experiments can be designed to make learning more challenging or take longer (e.g., partial reinforcement). And there can be intra-individual differences in conditioning and extinction, especially if some cues have a lower probability of predicting the US than others. Again, because most conditioning tasks are usually constructed in a fairly simplistic manner doesn't negate the utility of learning paradigms to address PEaxioms.

      (3) Many studies have tracked trial-by-trial BOLD signal in learning studies (e.g., using parametric modulation). Again, just because other studies "often pool across trials" is not an explanation for these paradigms being ill-suited to study prediction errors. Indeed, most computational models used in fMRI are predicated on analyzing data at the trial level. 

      We thank the reviewer for these remarks. The “fear conditioning and extinction paradigms” that we were referring to in this paragraph were the ones that have been used to study threat omission PE responses in previous research (e.g., Raczka et al., 2011; Thiele et al. 2021; Lange et al. 2020; Esser et al., 2021; Papalini et al., 2021; Vervliet et al. 2017). These studies have mainly used differential/multiple-cue protocols where either one (or two) CS+  and one CS- are trained in an acquisition phase and extinguished in the next phase. Thus, in these paradigms: (1) only one level of aversive US is used; and (2) as safety learning develops over the course of extinction, there are relatively few omission trials during which “large” threat omission PEs can be observed (e.g. from the 24 CS+ trials that were used during extinction in Esser et al., the steepest decreases in expectancy – and thus the largest PE – were found in first 6 trials); and (3) there was never absolute certainty that the stimulation will no longer follow. Some of these studies have indeed estimated the threat omission PE during the extinction phase based on learning models, and have entered these estimates as parametric modulators to CS-offset regressors. This is very informative. However, the exact model that was used differed per study (e.g. Rescorla-Wagner in Raczka et al. and Thiele et al.; or a Rescorla- Wagner–Pearce- Hall hybrid model in Esser et al.). We wanted to analyze threat omission-responses without commitment to a particular learning model. Thus, in order to examine how threat omissionresponses vary as a function of probability-related expectations, a paradigm that has multiple probability levels is recommended (e.g. Rutledge et al., 2010; Ojala et al., 2022)

      The reviewer rightfully pointed out that conditioning paradigms (more generally) can be tailored to fit our purposes as well. Still, when doing so, the same adaptations as we outlined above need to be considered: i.e. include different levels of US intensity; different levels of probability; and conditions with full certainty about the US (non)occurrence. In our attempt to keep the experimental design as simple and straightforward as possible, we decided to rely on instructions for this purpose, rather than to train 3 (US levels) x 5 (reinforcement levels) = 15 different CSs. It is certainly possible to train multiple CSs of varying reinforcement rates (e.g. Grings et al. 1971, Ojala et al., 2022). However, given that US-expectation on each trial would primarily depend on the individual learning processes of the participants, using a conditioning task would make it more difficult to maintain experimental control over the level of USexpectation elicited by each CS. As a result, this would likely require more extensive training, and thus prolong the study procedure considerably. Furthermore, even though previous studies have trained different CSs for different reinforcement rates, most of these studies have only used one level of US. Thus, in order to not complexify our task to much, we decided to rely on instructions rather than to train CSs for multiple US levels (in addition to multiple reinforcement rates).

      We have tried to clarify our reasoning in the revised version of the manuscript (see introduction, lines 100-113):  

      “The previously discussed fear conditioning and extinction studies have been invaluable for clarifying the role of the threat omission PE within a learning context. However, these studies were not tailored to create the varying intensity and probability-related conditions that are required to systematically evaluate the threat omission PE in the light of the PE axioms. First, these only included one level of aversive outcome: the electrical stimulation was either delivered or omitted; but the intensity of the stimulation was never experimentally manipulated within the same task. As a result, the magnitude-related axiom could not be tested. Second, as safety learning progressively developed over the course of extinction learning, the most informative trials to evaluate the probability axiom (i.e. the trials with the largest PE) were restricted to the first few CS+ offsets of the extinction phase, and the exact number of these informative trials likely differed across participants as a result of individually varying learning rates. This limited the experimental control and necessary variability to systematically evaluate the probability axiom. Third, because CS-US contingencies changed over the course of the task (e.g. from acquisition to extinction), there was never complete certainty about whether the US would (not) follow. This precluded a direct comparison of fully predicted outcomes. Finally, within a learning context, it remains unclear whether brain responses to the threat omission are in fact responses to the violation of expectancy itself, or whether they are the result of subsequent expectancy updating.”

      Again, the authors are free to develop their own task design that they think is best suited to address their experimental questions. For instance, if they truly believe that omission-related responses should be studied independent of updating. The question I'm still left puzzling is why the paper is so strongly framed around extinction (the word appears several times in the main body of the paper), which is a learning process, and yet the authors go out of their way to say that they can only test their hypotheses outside of a learning paradigm. 

      As we have mentioned before, the reason why we refer to extinction studies is because most evidence on threat omission PE to date comes from fear extinction paradigms.  

      The authors did address other areas of concern, to varying extents. Some of these issues were somewhat glossed over in the rebuttal letter by noting them as limitations. For example, the issue with comparing 100% stimulation to 0% stimulation, when the shock contaminates the fMRI signal. This was noted as a limitation that should be addressed in future studies, bypassing the critical point. 

      It is unclear to us what the reviewer means with “bypassing the critical point”. We argued in the manuscript that the contrast we initially specified and preregistered to study axiom 3 (fully predicted outcomes elicit equivalent activation) could not be used for this purpose, as it was confounded by the delivery of the stimulation. Because 100% trials aways included the stimulation and 0% trials never included stimulation, there was no way to disentangle activations related to full predictability from activations related to the stimulation as such.   

      Reviewer #3 (Recommendations For The Authors): 

      I'm not sure the new paragraph explaining why they can't use a learning task to test their hypotheses is very convincing, as I noted in my review. Again, it is not a problem to develop a new task to address their questions. They can justify why they want to use their task without describing (incorrectly in my opinion) that other tasks "generally" are constructed in a way that doesn't suit their needs. 

      For an overview of the changes we made in response to this recommendation, we refer to our reply to the public review.   

      We look forward to your reply and are happy to provide answers to any further questions or comments you may have.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      We would like to first thank the Editor as well as the three reviewers for their enthusiasm and conducting another careful evaluation of our manuscript. We appreciate their thoughtful and constructive comments and suggestions. Some concerns regarding experimental design, data analysis, and over-interpretation of our findings still remains unresolved after the initial revision. Here we endeavored to address these remaining concerns through further refinement of our writing, and inclusion of these concerns in the discussion session. We hope our response can better explain the rationale of our experimental design and data interpretation. In addition, we also acknowledge the limitations of our present study, so that it will benefit future investigations into this topic. Our detail responses are provided below.

      Reviewer #1 (Public Review):

      This study examines whether the human brain uses a hexagonal grid-like representation to navigate in a non-spatial space constructed by competence and trustworthiness. To test this, the authors asked human participants to learn the levels of competence and trustworthiness for six faces by associating them with specific lengths of bar graphs that indicate their levels in each trait. After learning, participants were asked to extrapolate the location from the partially observed morphing bar graphs. Using fMRI, the authors identified brain areas where activity is modulated by the angles of morphing trajectories in six-fold symmetry. The strength of this paper lies in the question it attempts to address. Specifically, the question of whether and how the human brain uses grid-like representations not only for spatial navigation but also for navigating abstract concepts, such as social space, and guiding everyday decision-making. This question is of emerging importance.

      I acknowledge the authors' efforts to address the comments received. However, my concerns persist:

      Thanks very much again for the re-evaluation and comments. Please find our revision plans to each comment below.

      (1) The authors contend that shorter reaction times correlated with increased distances between individuals in social space imply that participants construct and utilize two-dimensional representations. This method is adapted from a previous study by Park et al. Yet, there is a fundamental distinction between the two studies. In the prior work, participants learned relationships between adjacent individuals, receiving feedback on their decisions, akin to learning spatial locations during navigation. This setup leads to two different predictions: If participants rely on memory to infer relationships, recalling more pairs would be necessary for distant individuals than for closer ones. Conversely, if participants can directly gauge distances using a cognitive map, they would estimate distances between far individuals as quickly as for closer ones. Consequently, as the authors suggest, reaction times ought to decrease with increasing decision value, which, in this context, corresponds to distances. However, the current study allowed participants to compare all possible pairs without restricting learning experiences, rendering the application of the same methodology for testing two-dimensional representations inappropriate. In this study, the results could be interpreted as participants not forming and utilizing two-dimensional representations.

      We apologize for not being clear enough about our task design, we have made relevant changes in the methodology section in the manuscript to make it clearer. The reviewer’s concern is that participants learned about all the pairs in the comparison task which makes the distance effect invalid. We would like to clarify that during all the memory test tasks (the comparison task, the collect task and the recall task outside and inside scanner), participants never received feedback on whether their responses were correct or not. Therefore, the comparison task in our study is similar to the previous study by Park et al. (2021). Participants do not have access to correct responses for all possible pairs of comparison prior to or during this task, they would need to make inference based on memory retrieval.

      (2) The confounding of visual features with the value of social decision-making complicates the interpretation of this study's results. It remains unclear whether the observed grid-like effects are due to visual features or are genuinely indicative of value-based decision-making, as argued by the authors. Contrary to the authors' argument, this issue was not present in the previous study (Constantinescu et al.). In that study, participants associated specific stimuli with the identities of hidden items, but these stimuli were not linked to decision-making values (i.e., no image was considered superior to another). The current study's paradigm is more akin to that of Bao et al., which the authors mention in the context of RSA analysis. Indeed, Bao et al. controlled the length of the bars specifically to address the problem highlighted here. Regrettably, in the current paradigm, this conflation remains inseparable.

      We’d like to thank the reviewer for facilitating the discussion on the question of ‘social space’ vs. ‘sensory space’. The task in scanner did not require value-based decision making. It is akin to both the Bao et al. (2019) study and Constantinescu et al. (2016) study in a sense that all three tasks are trying to ask participants to imagine moving along a trajectory in an abstract, non-physical space and the trajectory is grounded in sensory cue. Participants were trained to associate the sensory cue with abstract (social/nonsocial) concepts. We think that the paradigm is a relatively faithful replication of the study by Constantinescu et al. Nonetheless, we agreed that a design similar to Bao et al. (2019) which controls for sensory confounds would be more ideal to address this concern, or adopting a value-based decision-making task in the scanner similar to that by Park et al. (2021), and we have included this limitation in the discussion section.

      (3) While the authors have responded to comments in the public review, my concerns noted in the Recommendation section remain unaddressed. As indicated in my recommendations, there are aspects of the authors' methodology and results that I find difficult to comprehend. Resolving these issues is imperative to facilitate an appropriate review in subsequent stages.

      Considering that the issues raised in the previous comments remain unresolved, I have retained my earlier comments below for review.

      We apologize for not addressing the recommendations properly, please find detailed our response and plans for revision.

      I have some comments. I hope that these can help.

      (1) While the explanation of Fig.4A-C is lacking in both the main text and figure legend, I am not sure if I understand this finding correctly. Did the authors find the effects of hexagonal modulation in the medial temporal gyrus and lingual gyrus correlate with the individual differences in the extent to which their reaction times were associated with the distances between faces when choosing a better collaborator? If so, I am not sure what argument the authors try to draw from these findings. Do the authors argue that these brain areas show hexagonal modulation, which was not supported in the previous analysis (Fig.3)? What is the level of correlation between these behavioral measures and the grid consistency effects in the vmPFC and EC, where the authors found actual grid-like activity? How do the authors interpret this finding? More importantly, how does this finding associate with other findings and the argument of the study?

      We apologize for not being clear enough in the manuscript and we will improve the clarity in our revision. This exploratory analysis reported in Figure 4 aims to use whole-brain analysis to examine: 1) if there is any correlation between the strength of grid-like representation of social value map and behavioral indicators of map-like representation; and 2) if there are any correlation between the strength of grid-like representation of this social value map and participants’ social trait.

      To be more specific, for the behavioral indicator, we used the distance effect in the reaction time of the comparison task outside the scanner. We interpreted stronger distance effect as a behavioral index of having better internal map-like representation. We interpreted stronger grid consistency effect as a neural index of better representation of the 2D social space. Therefore, we’d like to see if there exists correlation between behavioral and neural indices of map-like representation.

      To achieve this goal, behavioral indicators are entered as covariates in second-level analysis of the GLM testing grid consistency effect (GLM2). Figure3 showed results from GLM2 without the covariates. Figure4 showed results of clusters whose neural indices of map-like representation covaried with that from behavior and survived multiple-comparison correction. Indeed, in these regions, the grid consistency effect was not significant at group level (so not shown in Figure 3). We tried to interpret this finding in our discussion (line 374-289 for temporal lobe correlation, line 395-404 for precuneus correlation).

      Finally, we would like to point out that including the covariates in GLM2 did not change results in Figure3, the clusters in Figure3 still survives correction. Meanwhile, these clusters in Figure 3 did not show correlation with behavioral indicators of map-like representation.

      Author response image 1.

      (2) There are no behavioral results provided. How accurately did participants perform each of the tasks? How are the effects of grid consistency associated with the level of accuracy in the map test?

      Why did participants perform the recall task again outside the scanner?

      We will endeavor to improve signposting the corresponding figures in the main text. For the behavioral results, we reported the stats in section “Participants construct social value map after associative learning of avatars and corresponding characteristics” in the main text, and the plots are shown in Figure 1. Particularly, figure 1F showed accuracy of tasks in training, as well as the recall task in the scanner. For the correlation, we did not find significant correlation between behavioural accuracy and grid consistency effect. We will make it clearer in the result section.

      (3) The methods did not explain how the grid orientation was estimated and what the regressors were in GLM2. I don't think equations 2 and 3 are quite right.

      For the grid orientation estimation method, we provided detailed description in the Supplementary methods 2.2.2. We will add links to this section in the main text.

      Equation 2 and 3 describes how the parametric regressors entered into GLM2 were formed and provided prerequisites on calculation of grid orientations. Equation 2 was the results of directly applying the angle addition and subtraction theorems so they should be correct. We will try to make the rationale clearer in the supplementary text.

      (4) With the increase in navigation distances, more grid cells would activate. Therefore, in theory, the activity in the entorhinal cortex should increase with the Euclidean distances, which has not been found here. I wonder if there was enough variability in the Euclidean distances that can be captured by neural correlates. This would require including the distributions of Euclidean distances according to their trajectory angles. Regarding how Fig.1E is generated, I don't understand what this heat map indicates. Additionally, it needs to be confirmed if the grid effects remain while controlling for the Euclidean distances of navigation trajectories.

      We did not specifically control for the trajectory length, we only controlled for the distribution of trajectory to be uniform. We have included a figure of the distribution of Euclidean distances in Figure S9 and the distribution of trajectory direction in Figure S8.

      Author response image 2.

      As for Figure 1E, we aim to reproduce the findings from Figure 1F in Constantinescu et al. (2016) where they showed that participants progressively refined the locations of the outcomes through training. We divided the space into 15×15 subregions and computed the amount of time spent in each subregion and plotted Figure 1E. Brighter color in Figure 1E indicate greater amount of time spent in the corresponding subregion. Note that all these timing indices were computed as a percentage of the total time spent in the explore task in a given session. If participants were well-acquainted with the space and avatars, they would spend more time at the avatar (brighter color in avatar locations) in the review session compared to the learning session.

      As for the effect of distances on grid-like representation, we did not include the distance as a parametric modulator in grid consistency effect GLM (GLM2) due to insufficient trials in each bin (6-8 trials). But there is side evidence that could potentially rule out this confound. In the distance representation analysis, we did not find distance representation in any of the clusters that have significant grid-like representation (regions in Figure 2).

      Reviewer #2 (Public Review):

      Summary:

      In this work, Liang et al. investigate whether an abstract social space is neurally represented by a grid-like code. They trained participants to 'navigate' around a two-dimensional space of social agents characterized by the traits warmth and competence, then measured neural activity as participants imagined navigating through this space. The primary neural analysis consisted of three procedures: 1) identifying brain regions exhibiting the hexagonal modulation characteristic of a grid-like code, 2) estimating the orientation of each region's grid, and 3) testing whether the strength of the univariate neural signal increases when a participant is navigating in a direction aligned with the grid, compared to a direction that is misaligned with the grid. From these analyses, the authors find the clearest evidence of a grid-like code in the prefrontal cortex and weaker evidence in the entorhinal cortex.

      Strengths:

      The work demonstrates the existence of a grid-like neural code for a socially-relevant task, providing evidence that such coding schemes may be relevant for a variety of two-dimensional task spaces.

      Weaknesses:

      In the revised manuscript, the authors soften their claims about finding a grid code in the entorhinal cortex and provide additional caveats about limitations in their findings. It seems that the authors and reviewers are in agreement about the following weaknesses, which were part of my original review: Claims about a grid code in the entorhinal cortex are not well-supported by the analyses presented. The whole-brain analysis does not suggest that the entorhinal cortex exhibits hexagonal modulation; the strength of the entorhinal BOLD signal does not track the putative alignment of the grid code there; multivariate analyses do not reveal any evidence of a grid-like representational geometry.

      In the authors' response to reviews, they provide additional clarification about their exploratory analyses examining whether behavior (i.e., reaction times) and individual difference measures (i.e., social anxiety and avoidance) can be predicted by the hexagonal modulation strength in some region X, conditional on region X having a similar estimated grid alignment with some other region Y. My guess is that readers would find it useful if some of this language were included in the main text, especially with regard to an explanation regarding the rationale for these exploratory studies.

      Thank you very much again for your careful re-evaluation and suggestions. We have tried to improve our writing and incorporate the suggestions in the new revision.

      Reviewer #3 (Public Review):

      Liang and colleagues set out to test whether the human brain uses distance and grid-like codes in social knowledge using a design where participants had to navigate in a two-dimensional social space based on competence and warmth during an fMRI scan. They showed that participants were able to navigate the social space and found distance-based codes as well as grid-like codes in various brain regions, and the grid-like code correlated with behavior (reaction times).

      On the whole, the experiment is designed appropriately for testing for distant-based and grid-like codes, and is relatively well powered for this type of study, with a large amount of behavioral training per participant. They revealed that a number of brain regions correlated positively or negatively with distance in the social space, and found grid-like codes in the frontal polar cortex and posterior medial entorhinal cortex, the latter in line with prior findings on grid-like activity in entorhinal cortex. The current paper seems quite similar conceptually and in design to previous work, most notably Park et al., 2021, Nature Neuroscience.

      (1) The authors claim that this study provides evidence that humans use a spatial / grid code for abstract knowledge like social knowledge.

      This data does specifically not add anything new to this argument. As with almost all studies that test for a grid code in a similar "conceptual" space (not only the current study), the problem is that, when the space is not a uniform, square/circular space, and 2-dimensional then there is no reason the code will be perfectly grid like, i.e., show six-fold symmetry. In real world scenarios of social space (as well as navigation, semantic concepts), it must be higher dimensional - or at least more than two dimensional. It is unclear if this generalizes to larger spaces where not all part of the space is relevant. Modelling work from Tim Behrens' lab (e.g., Whittington et al., 2020) and Bradley Love's lab (e.g., Mok & Love, 2019) have shown/argued this to be the case. In experimental work, like in mazes from the Mosers' labs (e.g., Derdikman et al., 2009), or trapezoid environments from the O'Keefe lab (Krupic et al., 2015), there are distortions in mEC cells, and would not pass as grid cells in terms of the six-fold symmetry criterion.

      The authors briefly discuss the limitations of this at the very end but do not really say how this speaks to the goal of their study and the claim that social space or knowledge is organized as a grid code and if it is in fact used in the brain in their study and beyond. This issue deserves to be discussed in more depth, possibly referring to prior work that addressed this, and raise the issue for future work to address the problem - or if the authors think it is a problem at all.

      Thanks very much again for your careful re-evaluation and comments. We have tried to incorporate some of the suggested papers into our discussion. In summary, we agree that there is more to six-fold symmetric code that can be utilized to represent “conceptual space”. We think that the next step for a stronger claim would be to find the representation of more spontaneous non-spatial maps.

      References

      Bao, X., Gjorgieva, E., Shanahan, L. K., Howard, J. D., Kahnt, T., & Gottfried, J. A. (2019). Grid-like Neural Representations Support Olfactory Navigation of a Two-Dimensional Odor Space. Neuron, 102(5), 1066-1075 e1065. https://doi.org/10.1016/j.neuron.2019.03.034

      Constantinescu, A. O., O'Reilly, J. X., & Behrens, T. E. J. (2016). Organizing conceptual knowledge in humans with a gridlike code. Science, 352(6292), 1464-1468. https://doi.org/10.1126/science.aaf0941

      Park, S. A., Miller, D. S., & Boorman, E. D. (2021). Inferences on a multidimensional social hierarchy use a grid-like code. Nat Neurosci, 24(9), 1292-1301. https://doi.org/10.1038/s41593-02100916-3

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Summary:

      This paper presents a compelling and comprehensive study of decision-making under uncertainty. It addresses a fundamental distinction between belief-based (cognitive neuroscience) formulations of choice behavior with reward-based (behavioral psychology) accounts. Specifically, it asks whether active inference provides a better account of planning and decision making, relative to reinforcement learning. To do this, the authors use a simple but elegant paradigm that includes choices about whether to seek both information and rewards. They then assess the evidence for active inference and reinforcement learning models of choice behavior, respectively. After demonstrating that active inference provides a better explanation of behavioral responses, the neuronal correlates of epistemic and instrumental value (under an optimized active inference model) are characterized using EEG. Significant neuronal correlates of both kinds of value were found in sensor and source space. The source space correlates are then discussed sensibly, in relation to the existing literature on the functional anatomy of perceptual and instrumental decision-making under uncertainty.

      We are deeply grateful for your careful review of our work and your suggestions. Your insights have helped us identify areas where we can strengthen the arguments and clarify the methodology. We hope to apply the idea of active inference to our future work, emphasizing the integrity of perception and action.

      Reviewer #1 (Recommendations For The Authors):

      Many thanks for attending to my previous suggestions. I think your presentation is now much clearer and nicely aligned with the active inference literature.

      There is one outstanding issue. I think you have overinterpreted the two components of epistemic value in Equation 8. The two components that you have called the value of reducing risk and the value of reducing ambiguity are not consistent with the normal interpretation. These two components are KL divergences that measure the expected information gain about parameters and states respectively.

      If you read the Schwartenbeck et al paper carefully, you will see that the first (expected information gain about parameters) is usually called novelty, while the second (expected information gain about states) is usually called salience.

      This means you can replace "the value of reducing ambiguity" with "novelty" and "the value of reducing risk" with "salience".

      For your interest, "risk" and "ambiguity" are alternative ways of decomposing expected free energy. In other words, you can decompose expected free energy into (negative) expected information gain and expected value (as you have done). Alternatively, you can rearrange the terms and express expected free energy as risk and ambiguity. Look at the top panel of Figure 4 in:

      https://www.sciencedirect.com/science/article/pii/S0022249620300857

      I hope that this helps.

      We deeply thank you for your recommendations about the interpretation of the epistemic value in Equation 8. We have now corrected them to Novelty and Salience:

      In addition, in order to avoid terminology conflicts with active inference and to describe these two different uncertainties, we replaced Ambiguity in the article with Novelty, referring to the uncertainty that can be reduced by sampling, and replaced Risk with Variability, referring to the uncertainty inherent in the environment (variance).

      Reviewer # 2 (Public Review):

      Summary:

      Zhang and colleagues use a combination of behavioral, neural, and computational analyses to test an active inference model of exploration in a novel reinforcement learning task..

      Strengths:

      The paper addresses an important question (validation of active inference models of exploration). The combination of behavior, neuroimaging, and modeling is potentially powerful for answering this question.

      I appreciate the addition of details about model fitting, comparison, and recovery, as well as the change in some of the methods.

      We are deeply grateful for your careful review of our work and your suggestions. And we are also very sorry that in our last responses, there were a few suggestions from you that we did not respond them appropriately in our manuscript. We hope to be able to respond to these suggestions well in this revision. Thank you for your contribution to ensuring the scientificity and reproducibility of the work.

      The authors do not cite what is probably the most relevant contextual bandit study, by Collins & Frank (2018, PNAS), which uses EEG.

      The authors cite Collins & Molinaro as a form of contextual bandit, but that's not the case (what they call "context" is just the choice set). They should look at the earlier work from Collins, starting with Collins & Frank (2012, EJN).

      We deeply thank you for your comments. Now we add the relevant citations in the manuscript (line 46):

      “These studies utilized different forms of multi-armed bandit tasks, e.g the restless multi-armed bandit tasks (Daw et al., 2006; Guha et al., 2010), risky/safe bandit tasks (Tomov et al., 2020; Fan et al., 2022; Payzan et al., 2013), contextual multi-armed bandit tasks (Collins & Frank, 2018; Schulz et al., 2015; Collins & Frank, 2012)”

      Daw, N. D., O'doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876-879.

      Guha, S., Munagala, K., & Shi, P. (2010). Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1), 1-50.

      Tomov, M. S., Truong, V. Q., Hundia, R. A., & Gershman, S. J. (2020). Dissociable neural correlates of uncertainty underlie different exploration strategies. Nature communications, 11(1), 2371.

      Fan, H., Gershman, S. J., & Phelps, E. A. (2023). Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty. Nature Human Behaviour, 7(1), 102-113.

      Payzan-LeNestour, E., Dunne, S., Bossaerts, P., & O’Doherty, J. P. (2013). The neural representation of unexpected uncertainty during value-based decision making. Neuron, 79(1), 191-201.

      Collins, A. G., & Frank, M. J. (2018). Within-and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory. Proceedings of the National Academy of Sciences, 115(10), 2502-2507.

      Schulz, E., Konstantinidis, E., & Speekenbrink, M. (2015, April). Exploration-exploitation in a contextual multi-armed bandit task. In International conference on cognitive modeling (pp. 118-123).

      Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 1024-1035.

      Placing statistical information in a GitHub repository is not appropriate. This needs to be in the main text of the paper. I don't understand why the authors refer to space limitations; there are none for eLife, as far as I'm aware.

      We deeply thank you for your comments. We calculated the average t-value of the brain regions with significant results over the significant time, and added the t-value results to the main text and supplementary materials.

      In answer to my question about multiple comparisons, the authors have added the following: "Note that we did not attempt to correct for multiple comparisons; largely, because the correlations observed were sustained over considerable time periods, which would be almost impossible under the null hypothesis of no correlations." I'm sorry, but this does not make sense. Either the authors are doing multiple comparisons, in which case multiple comparison correction is relevant, or they are doing a single test on the extended timeseries, in which case they need to report that. There exist tools for this kind of analysis (e.g., Gershman et al., 2014, NeuroImage). I'm not suggesting that the authors should necessarily do this, only that their statistical approach should be coherent. As a reference point, the authors might look at the aforementioned Collins & Frank (2018) study.

      We deeply thank you for your comments. We have now replaced all our results with the results after false discovery rate correction and added relevant descriptions (line 357,358):

      “The significant results after false discovery rate (FDR) (Benjamini et al., 1995, Gershman et al., 2014) correction were shown in shaded regions. Additional regression results can be found in Supplementary Materials.”

      Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300.

      Gershman, S. J., Blei, D. M., Norman, K. A., & Sederberg, P. B. (2014). Decomposing spatiotemporal brain patterns into topographic latent sources. NeuroImage, 98, 91-102.

      After FDR correction, our results have changed slightly. We have updated our Results and Discussion section.

      It should be acknowledged that the changes in these results may represent a certain degree of error in our data (perhaps because the EEG data is too noisy or because of the average template we used, ‘fsaverage’). Therefore, we added relevant discussion in the Discussion section (line527-529):

      “It should be acknowledged that our EEG-based regression results are somewhat unstable, and the brain regions with significant regression are inconsistent before and after FDR correction. In future work, we should collect more precise neural data to reduce this instability.”

      I asked the authors to show more descriptive comparison between the model and the data. Their response was that this is not possible, which I find odd given that they are able to use the model to define a probability distribution on choices. All I'm asking about here is to show predictive checks which build confidence in the model fit. The additional simulations do not address this. The authors refer to figures 3 and 4, but these do not show any direct comparison between human data and the model beyond model comparison metrics.

      We deeply thank you for your comments. We now compare the participants’ behavioral data and the model’s predictions trial by trial (Figure 5). We can clearly see the participants’ behavioral strategies in different states and trials and the model’s prediction accuracy. We have added the discussion related to Figure 5 (line 309-318):

      “Figure 5 shows the comparison between the active inference model and the behavioral data, where we can see that the model can fit the participants behavioral strategies well. In the “Stay-Cue" choice, participants always tend to choose to ask the ranger and rarely choose not to ask. When the context was unknown, participants chose the “Safe" option or the “Risky" option very randomly, and they did not show any aversion to variability. When given “Context 1", where the “Risky" option gave participants a high average reward, participants almost exclusively chose the “Risky" option, which provided more information in the early trials and was found to provide more rewards in the later rounds. When given “Context 2", where the “Risky" option gave participants a low average reward, participants initially chose the “Risky" option and then tended to choose the “Safe" option. We can see that participants still occasionally chose the “Risky" option in the later trials of the experiment, which the model does not capture. This may be due to the influence of forgetting. Participants chose the “Risky" option again to establish an estimate of the reward distribution.”

      Reviewer # 2 (Recommendations For The Authors):

      In the supplement, there are missing references ("[?]").

      Thank you very much for pointing out this. We have now fixed this error.

      Reviewer # 3 (Public review):

      Summary:

      This paper aims to investigate how the human brain represents different forms of value and uncertainty that participate in active inference within a free-energy framework, in a two-stage decision task involving contextual information sampling, and choices between safe and risky rewards, which promotes shifting between exploration and exploitation. They examine neural correlates by recording EEG and comparing activity in the first vs second half of trials and between trials in which subjects did and did not sample contextual information, and perform a regression with free-energy-related regressors against data "mapped to source space."

      Strengths:

      This two-stage paradigm is cleverly designed to incorporate several important processes of learning, exploration/exploitation and information sampling that pertain to active inference. Although scalp/brain regions showing sensitivity to the active-inference related quantities do not necessary suggest what role they play, they are illuminating and useful as candidate regions for further investigation. The aims are ambitious, and the methodologies impressive. The paper lays out an extensive introduction to the free energy principle and active inference to make the findings accessible to a broad readership.

      Weaknesses:

      In its revised form the paper is complete in providing the important details. Though not a serious weakness, it is important to note that the high lower-cutoff of 1 Hz in the bandpass filter, included to reduce the impact of EEG noise, would remove from the EEG any sustained, iteratively updated representation that evolves with learning across trials, or choice-related processes that unfold slowly over the course of the 2-second task windows.

      We are deeply grateful for your careful review of our work and your suggestions. We are very sorry that we did not modify our filter frequency (it would be a lot of work to modify it). Thank you very much for pointing this out. We noticed the shortcoming of the high lower-cutoff of 1 Hz in the bandpass filter. We will carefully consider the filter frequency when preprocessing data in future work. Thank you very much!

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a new and valuable theoretical account of spatial representational drift in the hippocampus. The evidence supporting the claims is convincing, with a clear and accessible explanation of the phenomenon. Overall, this study will likely attract researchers exploring learning and representation in both biological and artificial neural networks.

      We would like to ask the reviewers to consider elevating the assessment due to the following arguments. As noted in the original review, the study bridges two different fields (machine learning and neuroscience), and does not only touch a single subfield (representational drift in neuroscience). In the revision, we also analysed data from four different labs, strengthening the evidence and the generality of the conclusions.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors start from the premise that neural circuits exhibit "representational drift" -- i.e., slow and spontaneous changes in neural tuning despite constant network performance. While the extent to which biological systems exhibit drift is an active area of study and debate (as the authors acknowledge), there is enough interest in this topic to justify the development of theoretical models of drift.

      The contribution of this paper is to claim that drift can reflect a mixture of "directed random motion" as well as "steady state null drift." Thus far, most work within the computational neuroscience literature has focused on the latter. That is, drift is often viewed to be a harmless byproduct of continual learning under noise. In this view, drift does not affect the performance of the circuit nor does it change the nature of the network's solution or representation of the environment. The authors aim to challenge the latter viewpoint by showing that the statistics of neural representations can change (e.g. increase in sparsity) during early stages of drift. Further, they interpret this directed form of drift as "implicit regularization" on the network.

      The evidence presented in favor of these claims is concise. Nevertheless, on balance, I find their evidence persuasive on a theoretical level -- i.e., I am convinced that implicit regularization of noisy learning rules is a feature of most artificial network models. This paper does not seem to make strong claims about real biological systems. The authors do cite circumstantial experimental evidence in line with the expectations of their model (Khatib et al. 2022), but those experimental data are not carefully and quantitatively related to the authors' model.

      We thank the reviewer for pushing us to present stronger experimental evidence. We now analysed data from four different labs. Two of those are novel analyses of existing data (Karlsson et al, Jercog et al). All datasets show the same trend - increasing sparsity and increasing information per cell. We think that the results, presented in the new figure 3, allow us to make a stronger claim on real biological systems.

      To establish the possibility of implicit regularization in artificial networks, the authors cite convincing work from the machine-learning community (Blanc et al. 2020, Li et al., 2021). Here the authors make an important contribution by translating these findings into more biologically plausible models and showing that their core assumptions remain plausible. The authors also develop helpful intuition in Figure 4 by showing a minimal model that captures the essence of their result.

      We are glad that these translation efforts are appreciated.

      In Figure 2, the authors show a convincing example of the gradual sparsification of tuning curves during the early stages of drift in a model of 1D navigation. However, the evidence presented in Figure 3 could be improved. In particular, 3A shows a histogram displaying the fraction of active units over 1117 simulations. Although there is a spike near zero, a sizeable portion of simulations have greater than 60% active units at the end of the training, and critically the authors do not characterize the time course of the active fraction for every network, so it is difficult to evaluate their claim that "all [networks] demonstrated... [a] phase of directed random motion with the low-loss space." It would be useful to revise the manuscript to unpack these results more carefully. For example, a histogram of log(tau) computed in panel B on a subset of simulations may be more informative than the current histogram in panel A.

      The previous figure 3A was indeed confusing. In particular, it lumped together many simulations without proper curation. We redid this figure (now Figure 4), and added supplementary figures (Figures S1, S2) to better explain our results. It is now clear that the simulations with a large number of active units were either due to non-convergence, slow timescale of sparsification or simulations featuring label noise in which the fraction of active units is less affected. Regarding the log(tau) calculation, while it could indeed be an informative plot, it could not be calculated in a simple manner for all simulations. This is because learning curves are not always exponential, but sometimes feature initial plateaus (see also Saxe et al 2013, Schuessler et al 2020). We added a more detailed explanation of this limitation in the methods section, and we believe the current figure exemplifies the effect in a satisfactory manner.

      Reviewer #2 (Public Review):

      Summary:

      In the manuscript "Representational drift as a result of implicit regularization" the authors study the phenomenon of representational drift (RD) in the context of an artificial network that is trained in a predictive coding framework. When trained on a task for spatial navigation on a linear track, they found that a stochastic gradient descent algorithm led to a fast initial convergence to spatially tuned units, but then to a second very slow, yet directed drift which sparsified the representation while increasing the spatial information. They finally show that this separation of timescales is a robust phenomenon and occurs for a number of distinct learning rules.

      Strengths:

      This is a very clearly written and insightful paper, and I think people in the community will benefit from understanding how RD can emerge in such artificial networks. The mechanism underlying RD in these models is clearly laid out and the explanation given is convincing.

      We thank the reviewer for the support.

      Weaknesses:

      It is unclear how this mechanism may account for the learning of multiple environments.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      The process of RD through this mechanism also appears highly non-stationary, in contrast to what is seen in familiar environments in the hippocampus, for example.

      The non-stationarity noted by the reviewer is indeed a major feature of our observations, and is indeed linked to familiarity. We divide learning into three phases (now more clearly stated in Table 1 and Figure 4C). The first, rapid phase, consists of improvement of performance - corresponding to initial familiarity with the environment. The third phase, often reported in the literature of representational drift, is indeed stationary and obtained after prolonged familiarity. Our work focuses on the second phase, which is not as immediate as the first one, and can take several days. We note in the discussion that experiments which include a long familiarization process can miss this phase (see also Table 3). Furthermore, we speculate that real life is less stationary than a lab environment, and this second phase might actually be more relevant there.

      Reviewer #3 (Public Review):

      Summary:

      Single-unit neural activity tuned to environmental or behavioral variables gradually changes over time. This phenomenon, called representational drift, occurs even when all external variables remain constant, and challenges the idea that stable neural activity supports the performance of well-learned behaviors. While a number of studies have described representational drift across multiple brain regions, our understanding of the underlying mechanism driving drift is limited. Ratzon et al. propose that implicit regularization - which occurs when machine learning networks continue to reconfigure after reaching an optimal solution - could provide insights into why and how drift occurs in neurons. To test this theory, Ratzon et al. trained a Feedforward Network trained to perform the oft-utilized linear track behavioral paradigm and compare the changes in hidden layer units to those observed in hippocampal place cells recorded in awake, behaving animals.

      Ratzon et al. clearly demonstrate that hidden layer units in their model undergo consistent changes even after the task is well-learned, mirroring representational drift observed in real hippocampal neurons. They show that the drift occurs across three separate measures: the active proportion of units (referred to as sparsification), spatial information of units, and correlation of spatial activity. They continue to address the conditions and parameters under which drift occurs in their model to assess the generalizability of their findings.

      However, the generalizability results are presented primarily in written form: additional figures are warranted to aid in reproducibility.

      We added figures, and a Github with all the code to allow full reproducibility.

      Last, they investigate the mechanism through which sparsification occurs, showing that the flatness of the manifold near the solution can influence how the network reconfigures. The authors suggest that their findings indicate a three-stage learning process: 1) fast initial learning followed by 2) directed motion along a manifold which transitions to 3) undirected motion along a manifold.

      Overall, the authors' results support the main conclusion that implicit regularization in machine learning networks mirrors representational drift observed in hippocampal place cells.

      We thank the reviewer for this summary.

      However, additional figures/analyses are needed to clearly demonstrate how different parameters used in their model qualitatively and quantitatively influence drift.

      We now provide additional figures regarding parameters (Figures S1, S2).

      Finally, the authors need to clearly identify how their data supports the three-stage learning model they suggest.

      Their findings promise to open new fields of inquiry into the connection between machine learning and representational drift and generate testable predictions for neural data.

      Strengths:

      (1) Ratzon et al. make an insightful connection between well-known phenomena in two separate fields: implicit regularization in machine learning and representational drift in the brain. They demonstrate that changes in a recurrent neural network mirror those observed in the brain, which opens a number of interesting questions for future investigation.

      (2) The authors do an admirable job of writing to a large audience and make efforts to provide examples to make machine learning ideas accessible to a neuroscience audience and vice versa. This is no small feat and aids in broadening the impact of their work.

      (3) This paper promises to generate testable hypotheses to examine in real neural data, e.g., that drift rate should plateau over long timescales (now testable with the ability to track single-unit neural activity across long time scales with calcium imaging and flexible silicon probes). Additionally, it provides another set of tools for the neuroscience community at large to use when analyzing the increasingly high-dimensional data sets collected today.

      We thank the reviewer for these comments. Regarding the hypotheses, these are partially confirmed in the new analyses we provide of data from multiple labs (new Figure 3 and Table 3) - indicating that prolonged exposure to the environment leads to more stationarity.

      Weaknesses:

      (1) Neural representational drift and directed/undirected random walks along a manifold in ML are well described. However, outside of the first section of the main text, the analysis focuses primarily on the connection between manifold exploration and sparsification without addressing the other two drift metrics: spatial information and place field correlations. It is therefore unclear if the results from Figures 3 and 4 are specific to sparseness or extend to the other two metrics. For example, are these other metrics of drift also insensitive to most of the Feedforward Network parameters as shown in Figure 3 and the related text? These concerns could be addressed with panels analogous to Figures 3a-c and 4b for the other metrics and will increase the reproducibility of this work.

      We note that the results from figures 3 and 4 (original manuscript) are based on abstract tasks, while in figure 2 there is a contextual notion of spatial position. Spatial position metrics are not applicable to the abstract tasks as they are simple random mapping of inputs, and there isn’t necessarily an underlying latent variable such as position. This transition between task types is better explained in the text now. In essence the spatial information and place field correlation changes are simply signatures of the movements in parameter space. In the abstract tasks their change becomes trivial, as the spatial information becomes strongly correlated with sparsity and place fields are simply the activity vectors of units. These are guaranteed to change as long as there are changes in the activity statistics. We present here the calculation of these metrics averaged over simulations for completeness.

      Author response image 1.

      PV correlation between training time points averaged over 362 simulations. (B) Mean SI of units normalized to first time step, averaged over 362 simulations. Red line shows the average time point of loss convergence, the shaded area represents one standard deviation.

      (2) Many caveats/exceptions to the generality of findings are mentioned only in the main text without any supporting figures, e.g., "For label noise, the dynamics were qualitatively different, the fraction of active units did not reduce, but the activity of the units did sparsify" (lines 116-117). Supporting figures are warranted to illustrate which findings are "qualitatively different" from the main model, which are not different from the main model, and which of the many parameters mentioned are important for reproducing the findings.

      We now added figures (S1, S2) that show this exactly. We also added a github to allow full reproduction.

      (3) Key details of the model used by the authors are not listed in the methods. While they are mentioned in reference 30 (Recanatesi et al., 2021), they need to be explicitly defined in the methods section to ensure future reproducibility.

      The details of the simulation are detailed in the methods sections. We also added a github to allow full reproducibility.

      (4) How different states of drift correspond to the three learning stages outlined by the authors is unclear. Specifically, it is not clear where the second stage ends, and the third stage begins, either in real neural data or in the figures. This is compounded by the fact that the third stage - of undirected, random manifold exploration - is only discussed in relation to the introductory Figure 1 and is never connected to the neural network data or actual brain data presented by the authors. Are both stages meant to represent drift? Or is only the second stage meant to mirror drift, while undirected random motion along a manifold is a prediction that could be tested in real neural data? Identifying where each stage occurs in Figures 2C and E, for example, would clearly illustrate which attributes of drift in hidden layer neurons and real hippocampal neurons correspond to each stage.

      Thanks for this comment, which urged us to better explain these concepts.

      The different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      Recommendations for the authors:

      The reviewers have raised several concerns. They concur that the authors should address the specific points below to enhance the manuscript.

      (1) The three different phases of learning should be clearly delineated, along with how they are determined. It remains unclear in which exact phase the drift is observed.

      This is now clearly explained in the new Table 1 and Figure 4C. Note that the different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      (2) The term "sparsification" of unit activity is not fully clear. Its meaning should be more explicitly explained, especially since, in the simulations, a significant number of units appear to remain active (Fig. 3A).

      We now define precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

      (3) While the study primarily focuses on one aspect of representational drift-the proportion of active units-it should also explore other features traditionally associated with representational drift, such as spatial information and the correlation between place fields.

      This absence of features is related to the abstract nature of some of the tasks simulated in our paper. In our original submission the transition between a predictive coding task to more abstract tasks was not clearly explained, creating some confusion regarding the measured metrics. We now clarified the motivation for this transition.

      Both the initial simulation and the new experimental data analysis include spatial information (Figures 2,3). The following simulations (Figure 4) with many parameter choices use more abstract tasks, for which the notion of correlation between place cells and spatial information loses its meaning as there is no spatial ordering of the inputs, and every input is encountered only once. Spatial information becomes strongly correlated with the inverse of the active fraction metric. The correlation between place cells is also directly linked to increase in sparseness for these tasks.

      (4) There should be a clearer illustration of how labeling noise influences learning dynamics and sparsification.

      This was indeed confusing in the original submission. We removed the simulations with label noise from Figure 4, and added a supplementary figure (S2) illustrating the different effects of label noise.

      (5) The representational drift observed in this study's simulations appears to be nonstationary, which differs from in vivo reports. The reasons for this discrepancy should be clarified.

      We added experimental results from three additional labs demonstrating a change in activity statistics (i.e. increase in spatial information and increase in sparseness) over a long period of time. We suggest that such a change long after the environment is already familiar is an indication for the second phase, and stress that this change seems to saturate at some point, and that most drift papers start collecting data after this saturation, hence this effect was missed in previous in vivo reports. Furthermore, these effects are become more abundant with the advent on new calcium imaging methods, as the older electrophysiological regording methods did not usually allow recording of large amounts of cells for long periods of time. The new Table 3 surveys several experimental papers, emphasizing the degree of familiarity with the environment.

      (6) A distinctive feature of the hippocampus is its ability to learn different spatial representations for various environments. The study does not test representational drift in this context, a topic of significant interest to the community. Whether the authors choose to delve into this is up to them, but it should at least be discussed more comprehensively, as it's only briefly touched upon in the current manuscript version.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      (7) The methods section should offer more details about the neural nets employed in the study. The manuscript should be explicit about the terms "hidden layer", "units", and "neurons", ensuring they are defined clearly and not used interchangeably..

      We changed the usage of these terms to be more coherent and made our code publicly available. Specifically, “units” refer to artificial networks and “neurons” to biological ones.

      In addition, each reviewer has raised both major and minor concerns. These are listed below and should be addressed where possible.

      Reviewer #1 (Recommendations For The Authors):

      I recommend that the authors edit the text to soften their claims. For example:

      In the abstract "To uncover the underlying mechanism, we..." could be changed to "To investigate, we..."

      Agree. Done

      On line 21, "Specifically, recent studies showed that..." could be changed to "Specifically, recent studies suggest that..."

      Agree. Done

      On line 100, "All cases" should probably be softened to "Most cases" or more details should be added to Figure 3 to support the claim that every simulation truly had a phase of directed random motion.

      The text was changed in accordance with the reviewer’s suggestion. In addition, the figure was changed and only includes simulations in which we expected unit sparsity to arise (without label noise). We also added explanations and supplementary figures for label noise.

      Unless I missed something obvious, there is no new experimental data analysis reported in the paper. Thus, line 159 of the discussion, "a phenomenon we also observed in experimental data" should be changed to "a phenomenon that recently reported in experimental data."

      We thank the reviewer for drawing our attention to this. We now analyzed data from three other labs, two of which are novel analyses on existing data. All four datasets show the same trends of sparseness with increasing spatial information. The new Figure 3 and text now describe this.

      On line 179 of the Discussion, "a family of network configurations that have identical performance..." could be softened to "nearly identical performance." It would be possible for networks to have minuscule differences in performance that are not detected due to stochastic batch effects or limits on machine precision.

      The text was changed in accordance with the reviewer’s suggestion.

      Other minor comments:

      Citation 44 is missing the conference venue, please check all citations are formatted properly.

      Corrected.

      In the discussion on line 184, the connection to remapping was confusing to me, particularly because the cited reference (Sanders et al. 2020) is more of a conceptual model than an artificial network model that could be adapted to the setting of noisy learning considered in this paper. How would an RNN model of remapping (e.g. Low et al. 2023; Remapping in a recurrent neural network model of navigation and context inference) be expected to behave during the sparsifying portion of drift?

      We now clarified this section. The conceptual model of Sanders et al includes a specific prediction (Figure 7 there) which is very similar to ours - a systematic change in robustness depending on duration of training. Regarding the Low et al model, using such mechanistic models is an exciting avenue for future research.

      Reviewer #2 (Recommendations For The Authors):

      I only have two major questions.

      (1) Learning multiple representations: Memory systems in the brain typically must store many distinct memories. Certainly, the hippocampus, where RD is prominent, is involved in the ongoing storage of episodic memories. But even in the idealized case of just two spatial memories, for example, two distinct linear tracks, how would this learning process look? Would there be any interference between the two learning processes or would they be largely independent? Is the separation of time scales robust to the number of representations stored? I understand that to answer this question fully probably requires a research effort that goes well beyond the current study, but perhaps an example could be shown with two environments. At the very least the authors could express their thoughts on the matter.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      (2) Directed drift versus stationarity: I could not help but notice that the RD illustrated in Fig.2D is not stationary in nature, i.e. the upper right and lower left panels are quite different. This appears to contrast with findings in the hippocampus, for example, Fig.3e-g in (Ziv et al, 2013). Perhaps it is obvious that a directed process will not be stationary, but the authors note that there is a third phase of steady-state null drift. Is the RD seen there stationary? Basically, I wonder if the process the authors are studying is relevant only as a novel environment becomes familiar, or if it is also applicable to RD in an already familiar environment. Please discuss the issue of stationarity in this context.

      The non-stationarity noted by the reviewer is indeed a major feature of our observations, and is indeed linked to familiarity. We divide learning into three phases (now more clearly stated in Table 1 and Figure 4C). The first, rapid, phase consists of improvement of performance - corresponding to initial familiarity with the environment. The third phase, often reported in the literature of representational drift, is indeed stationary and obtained after prolonged familiarity. Our work focuses on the second phase, which is not as immediate as the first one, and can take several days. We note in the discussion that experiments which include a long familiarization process can miss this phase (see also Table 3). Furthermore, we speculate that real life is less stationary than a lab environment, and this second phase might actually be more relevant there.

      Reviewer #3 (Recommendations For The Authors):

      Most of my general recommendations are outlined in the public review. A large portion of my comments regards increasing clarity and explicitly defining many of the terms used which may require generating more figures (to better illustrate the generality of findings) or modifying existing figures (e.g., to show how/where the three stages of learning map onto the authors' data).

      Sparsification is not clearly defined in the main text. As I read it, sparsification is meant to refer to the activity of neurons, but this needs to be clearly defined. For example, lines 262-263 in the methods define "sparseness" by the number of active units, but lines 116-117 state: "For label noise, the dynamics were qualitatively different, the fraction of active units did not reduce, but the activity of the units did sparsify." If the fraction of active units (defined as "sparseness") did not change, what does it mean that the activity of the units "sparsified"? If the authors mean that the spatial activity patterns of hidden units became more sharply tuned, this should be clearly stated.

      We now defined precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

      Likewise, it is unclear which of the features the authors outlined - spatial information, active proportion of units, and spatial correlation - are meant to represent drift. The authors should clearly delineate which of these three metrics they mean to delineate drift in the main text rather than leave it to the reader to infer. While all three are mentioned early on in the text (Figure 2), the authors focus more on sparseness in the last half of the text, making it unclear if it is just sparseness that the authors mean to represent drift or the other metrics as well.

      The main focus of our paper is on the non-stationarity of drift. Namely that features (such as these three) systematically change in a directed manner as part of the drift process. This is in The new analyses of experimental data show sparseness and spatial information.

      The focus on sparseness in the second half of the paper is because we move to more abstract These are also easy to study in the more abstract tasks in the second part of the paper. In our original submission the transition between a predictive coding task to more abstract tasks was not clearly explained, creating some confusion regarding the measured metrics. We now clarified the motivation for this transition.

      It is not clear if a change in the number of active units alone constitutes "drift", especially since Geva et al. (2023) recently showed that both changes in firing rate AND place field location drive drift, and that the passage of time drives changes in activity rate (or # cells active).

      Our work did not deal with purely time-dependent drift, but rather focused on experience-dependence. Furthermore, Geva et al study the stationary phase of drift, where we do not expect a systematic change in the total number of cells active. They report changes in the average firing rate of active cells in this phase, as a function of time - which does not contradict our findings.

      "hidden layer", "units", and "neurons" seem to be used interchangeably in the text (e.g., line 81-85). However, this is confusing in several places, in particular in lines 83-85 where "neurons" is used twice. The first usage appears to refer to the rate maps of the hidden layer units simulated by the authors, while the second "neurons" appears to refer to real data from Ziv 2013 (ref 5). The authors should make it explicit whether they are referring to hidden layer units or actual neurons to avoid reader confusion.

      We changed the usage of these terms to be more coherent. Specifically, “units” refer to artificial networks and “neurons” to biological ones.

      The authors should clearly illustrate which parts of their findings support their three-phase learning theory. For example, does 2E illustrate these phases, with the first tenth of training time points illustrating the early phase, time 0.1-0.4 illustrating the intermediate phase, and 0.4-1 illustrating the last phase? Additionally, they should clarify whether the second and third stages are meant to represent drift, or is it only the second stage of directed manifold exploration that is considered to represent drift? This is unclear from the main text.

      The different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      Line 45 - It appears that the acronym ML is not defined above here anywhere.

      Added.

      Line 71: the ReLU function should be defined in the text, e.g., sigma(x) = x if x > 0 else 0.

      Added.

      106-107: Figures (or supplemental figures) to demonstrate how most parameters do not influence sparsification dynamics are warranted. As written, it is unclear what "most parameters" mean - all but noise scale. What about the learning rule? Are there any interactions between parameters?

      We now removed the label noise from Figure 4, and added two supplementary figures to clearly explain the effect of parameters. Figure 4 itself was also redone to clarify this issue.

      2F middle: should "change" be omitted for SI?

      The panel was replaced by a new one in Figure 3.

      116-119: A figure showing how results differ for label noise is warranted.

      This is now done in Figure S1, S2.

      124: typo, The -> the

      Corrected.

      127-129: This conclusion statement is the first place in the text where the three stages are explicitly outlined. There does not appear to be any support or further explanation of these stages in the text above.

      We now explain this earlier at the end of the Introduction section, along with the new Table 1 and marking on Figure 4C.

      132-133 seems to be more of a statement and less of a prediction or conclusion - do the authors mean "the flatness of the loss landscape in the vicinity of the solution predicts the rate of sparsification?"

      We thank the reviewer for this observation. The sentence was rephrased:

      Old: As illustrated in Fig. 1, different solutions in the zero-loss manifold might vary in some of their properties. The specific property suggested from theory is the flatness of the loss landscape in the vicinity of the solution.

      New: As illustrated in Fig. 1, solutions in the zero-loss manifold have identical loss, but might vary in some of their properties. The authors of [26] suggest that noisy learning will slowly increase the flatness of the loss landscape in the vicinity of the solution.

      135: typo, it's -> its

      Corrected.

      Line 135-136 "Crucially, the loss on the 136 entire manifold is exactly zero..." This appears to contradict the Figure 4A legend - the loss appears to be very high near the top and bottom edges of the manifold in 4A. Do the authors mean that the loss along the horizontal axis of the manifold is zero?

      The reviewer is correct. The manifold mentioned in the sentence is indeed the horizontal axis. We changed the text and the figure to make it clearer.

      Equation 6: This does not appear to agree with equation 2 - should there be an E_t term for an expectation function?

      Corrected.

      Line 262-263: "Sparseness means that a unit has become inactive for all inputs." This should also be stated explicitly as the definition of sparseness/sparsification in the main text.

      We now define precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Weaknesses:

      The comparison of affinity predictions derived from AlphaFold2 and H3-opt models, based on molecular dynamics simulations, should have been discussed in depth. In some cases, there are huge differences between the estimations from H3-opt models and those from experimental structures. It seems that the authors obtained average differences of the real delta, instead of average differences of the absolute value of the delta. This can be misleading, because high negative differences might be compensated by high positive differences when computing the mean value. Moreover, it would have been good for the authors to disclose the trajectories from the MD simulations.

      Thanks for your careful checks. We fully understand your concerns about the large differences when calculating affinity. To understand the source of these huge differences, we carefully analyzed the trajectories of the input structures during MD simulations. We found that the antigen-antibody complex shifted as it transited from NVT to NPT during pre-equilibrium, even when restraints are used to determine the protein structure. To address this issue, we consulted the solution provided on Amber's mailing list (http://archive.ambermd.org/202102/0298.html) and modified the top file ATOMS_MOLECULE item of the simulation system to merge the antigen-antibody complexes into one molecule. As a result, the number of SOLVENT_POINTERS was also adjusted. Finally, we performed all MD simulations and calculated affinities of all complexes.

      We have corrected the “Afterwards, a 25000-step NVT simulation with a time step of 1 fs was performed to gradually heat the system from 0 K to 100 K. A 250000-step NPT simulation with a time step of 2 fs was carried out to further heat the system from 100 K to 298 K.” into “Afterwards, a 400-ps NVT simulation with a time step of 2 fs was performed to gradually heat the system from 0 K to 298 K (0–100 K: 100 ps; 100-298 K: 200 ps; hold 298 K: 100 ps), and a 100-ps NPT simulation with a time step of 2 fs was performed to equilibrate the density of the system. During heating and density equilibration, we constrained the antigen-antibody structure with a restraint value of 10 kcal×mol-1×Å-2.” and added the following sentence in the Method section of our revised manuscript: “The first 50 ns restrains the non-hydrogen atoms of the antigen-antibody complex, and the last 50 ns restrains the non-hydrogen atoms of the antigen, with a constraint value of 10 kcal×mol-1×Å-2”

      In addition, we have corrected the calculation of mean deltas using absolute values and have demonstrated that the average affinities of structures predicted by H3-OPT were closer to those of experimentally determined structures than values obtained through AF2. These results have been updated in the revised manuscript. However, significant differences still exist between the estimations of H3-OPT models and those derived from experimental structures in few cases. We found that antibodies moved away from antigens both in AF2 and H3-OPT predicted complexes during simulations, resulting in RMSDbackbone (RMSD of antibody backbone) exceeding 20 Å. These deviations led to significant structural changes in the complexes and consequently resulted in notable differences in affinity calculations. Thus, we removed three samples (PDBID: 4qhu, 6flc, 6plk) from benchmark because these predicted structures moved away from the antigen structure during MD simulations, resulting in huge energy differences from the native structures.

      Author response table 1.

      We also appreciate your reminder, and we have calculated all RMSDbackbone during production runs (SI Fig. 5).

      Author response image 1.

      Reviewer #3 (Public Review):

      Weaknesses:

      The proposed method lacks of a confidence score or a warning to help guiding the users in moderate to challenging cases.

      We were sorry for our mistakes. We have updated our GitHub code and added following sentences to clarify how we train this confidence score module in Method Section: “Confidence score prediction module

      We apply an MSE loss for confidence prediction, label error was calculated as the Cα deviation of each residue after alignment. The inputs of this module are the same as those used for H3-OPT, and it generates a confidence score ranging from 0 to 100. The dropout rates of H3-OPT were set to 0.25. The learning rate and weight decay of Adam optimizer are set to 1 × 10−5 and 1 × 10−4, respectively.”

      Reviewer #2 (Recommendations For The Authors):

      I would strongly suggest that the authors deepen their discussion on the affinity prediction based on Molecular Dynamics. In particular, why do the authors think that some structures exhibit huge differences between the predictions from the experimental structure and the predicted by H3-opt? Also, please compute the mean deltas using the absolute value and not the real value; the letter can be extremely misleading and hidden very high differences in different directions that are compensating when averaging.

      I would also advice to include graphical results of the MD trajectories, at least as Supp. Material.

      We gratefully thank you for your feedback and fully understand your concerns. We found the source of these huge differences and solved this problem by changing method of MD simulations. Then, we calculated all affinities and corrected the mean deltas calculation using the absolute value. The RMSDbackbone values were also measured to enable accurate affinity predictions during production runs (SI Fig. 5). There are still big differences between the estimations of H3-OPT models and those from experimental structures in some cases. We found that antibodies moved away from antigens both in AF2 and H3-OPT predicted complexes during simulations, resulting in RMSDbackbone exceeding 20 Å. These deviations led to significant structural changes in the complexes and consequently resulted in notable differences in affinity calculations. Thus, we removed three samples (PDBID: 4qhu, 6flc, 6plk) from benchmark.

      Thanks again for your professional advice.

      Reviewer #3 (Recommendations For The Authors):

      (1) I am pleased with the most of the answers provided by the authors to the first review. In my humble opinion, the new manuscript has greatly improved. However, I think some answers to the reviewers are worth to be included in the main text or supporting information for the benefit of general readers. In particular, the requested statistics (i.e. p-values for Cα-RMSD values across the modeling approaches, p-values and error bars in Fig 5a and 5b, etc.) should be introduced in the manuscript.

      We sincerely appreciate your advice. We have added the statistics values to Fig. 4 and Fig. 5 to our manuscript.

      Author response image 2.

      Author response image 3.

      (2) Similarly, authors state in the answers that "we have trained a separate module to predict the confidence score of the optimized CDR-H3 loops". That sounds a great improvement to H3-OPT! However, I couldn't find any reference of that new module in the reviewed version of the manuscript, nor in the available GitHub code. That is the reason for me to hold the weakness "The proposed method lacks of a confidence score".

      We were really sorry for our careless mistakes. Thank you for your reminding. We have updated our GitHub code and added following sentences to clarify how we train this confidence score module in Method Section:

      “Confidence score prediction module

      We apply an MSE loss for confidence prediction, label error was calculated as the Cα deviation of each residue after alignment. The inputs of this module are the same as those used for H3-OPT, and it generates a confidence score ranging from 0 to 100. The dropout rates of H3-OPT were set to 0.25. The learning rate and weight decay of Adam optimizer are set to 1 × 10−5 and 1 × 10−4, respectively.”

      (3) I acknowledge all the efforts made for solving new mutant/designed nanobody structures. Judging from the solved structures, mutants Y95F and Q118N seems critical to either crystallographic or dimerization contacts stabilizing the CDR-H3 loop, hence preventing the formation of crystals. Clearly, solving a molecular structure is a challenge, hence including the following comment in the manuscript is relevant for readers to correctly asset the magnitude of the validation: "The sequence identities of the VH domain and H3 loop are 0.816 and 0.647, respectively, comparing with the best template. The CDR-H3 lengths of these nanobodies are both 17. According to our classification strategy, these nanobodies belong to Sub1. The confidence scores of these AlphaFold2 predicted loops were all higher than 0.8, and these loops were accepted as the outputs of H3-OPT by CBM."

      We appreciate your kind recommendations and have revised “Although Mut1 (E45A) and Mut2 (Q14N) shared the same CDR-H3 sequences as WT, only minor variations were observed in the CDR-H3. H3-OPT generated accurate predictions with Cα-RMSDs of 1.510 Å, 1.541 Å and 1.411 Å for the WT, Mut1, and Mut2, respectively.” into “Although Mut1 (E45A) and Mut2 (Q14N) shared the same CDR-H3 sequences as WT (LengthCDR-H3 = 17), only minor variations were observed in the CDR-H3. H3-OPT generated accurate predictions with Cα-RMSDs of 1.510 Å, 1.541 Å and 1.411 Å for the WT, Mut1, and Mut2, respectively (The confidence scores of these AlphaFold2 predicted loops were all higher than 0.8, and these loops were accepted as the outputs of H3-OPT by CBM). ”. In addition, we have added following sentence in the legend of Figure 4 to ensure that readers can appropriately evaluate the significance and reliability of our validations: “The sequence identities of the VH domain and H3 loop are 0.816 and 0.647, respectively, comparing with the best template.”.

      (4) As pointed out in the first review, I think the work https://doi.org/10.1021/acs.jctc.1c00341 is worth acknowledging in section "2.2 Molecular dynamics (MD) simulations could not provide accurate CDR-H3 loop conformations" of supplementary material, as it constitutes a clear reference (and probably one of the few) to the MD simulations that authors pretend to perform. Similarly, the work https://doi.org/10.3390/molecules28103991 introduces a former benchmark on AI algorithms for predicting antibody and nanobody structures that readers may find interest to contrast with the present work. Indeed, this later reference is used by authors to answer a reviewer comment.

      Thanks a lot for your valuable comments. We have added these references in the proper positions in our manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review)

      Summary:

      Huang and colleagues present a method for approximation of linkage disequilibrium (LD) matrices. The problem of computing LD matrices is the problem of computing a correlation matrix. In the cases considered by the authors, the number of rows (n), corresponding to individuals, is small compared to the number of columns (m), corresponding to the number of variants. Computing the correlation matrix has cubic time complexity , which is prohibitive for large samples. The authors approach this using three main strategies:

      1. they compute a coarsened approximation of the LD matrix by dividing the genome into variant-wise blocks which statistics are effectively averaged over;

      2. they use a trick to get the coarsened LD matrix from a coarsened genomic relatedness matrix (GRM), which, with time complexity, is faster when n << m;

      3. they use the Mailman algorithm to improve the speed of basic linear algebra operations by a factor of log(max(m,n)). The authors apply this approach to several datasets.

      Strengths:

      The authors demonstrate that their proposed method performs in line with theoretical explanations.

      The coarsened LD matrix is useful for describing global patterns of LD, which do not necessarily require variant-level resolution.

      They provide an open-source implementation of their software.

      Weaknesses:

      The coarsened LD matrix is of limited utility outside of analyzing macroscale LD characteristics. The method still essentially has cubic complexity--albeit the factors are smaller and Mailman reduces this appreciably. It would be interesting if the authors were able to apply randomized or iterative approaches to achieve more fundamental gains. The algorithm remains slow when n is large and/or the grid resolution is increased.

      Thanks for your positive and accurate evaluation! We acknowledge the weakness and include some sentences in Discussion.

      “The weakness of the proposed method is obvious that the algorithm remains slow when the sample size is large or the grid resolution is increased. With the availability of such as UK Biobank data (Bycroft et al., 2018), the proposed method may not be adequate, and much advanced methods, such as randomized implementation for the proposed methods, are needed.”  

      Reviewer #2 (Public Review)

      Summary:

      In this paper, the authors point out that the standard approach of estimating LD is inefficient for datasets with large numbers of SNPs, with a computational cost of , where n is the number of individuals and m is the number of SNPs. Using the known relationship between the LD matrix and the genomic- relatedness matrix, they can calculate the mean level of LD within the genome or across genomic segments with a computational cost of . Since in most datasets, n<<m, this can lead to major computational improvements. They have produced software written in C++ to implement this algorithm, which they call X-LD. Using the output of their method, they estimate the LD decay and the mean extended LD for various subpopulations from the 1000 Genomes Project data.

      Strengths:

      Generally, for computational papers like this, the proof is in the pudding, and the authors appear to have been successful at their aim of producing an efficient computational tool. The most compelling evidence of this in the paper is Figure 2 and Supplementary Figure S2. In Figure 2, they report how well their X- LD estimates of LD compare to estimates based on the standard approach using PLINK. They appear to have very good agreement. In Figure S2, they report the computational runtime of X-LD vs PLINK, and as expected X-LD is faster than PLINK as long as it is evaluating LD for more than 8000 SNPs.

      Weakness:

      While the X-LD software appears to work well, I had a hard time following the manuscript enough to make a very good assessment of the work. This is partly because many parameters used are not defined clearly or at all in some cases. My best effort to intuit what the parameters meant often led me to find what appeared to be errors in their derivation. As a result, I am left worrying if the performance of X-LD is due to errors cancelling out in the particular setting they consider, making it potentially prone to errors when taken to different contexts.

      Thanks for you critical reading and evaluation. We do feel apologize for typos, which have been corrected and clearly defined now (see Eq 1 and Table 1). In addition, we include more detailed mathematical steps, which explain how LD decay regression is constructed and consequently finds its interpretation (see the detailed derivation steps between Eq 3 and Eq 4).

      Impact:

      I feel like there is value in the work that has been done here if there were more clarity in the writing. Currently, LD calculations are a costly step in tools like LD score regression and Bayesian prediction algorithms, so a more efficient way to conduct these calculations would be useful broadly. However, given the difficulty I had following the manuscript, I was not able to assess when the authors’ approach would be appropriate for an extension such as that.

      See our replies below in responding to your more detailed questions.

      Reviewer #1 (Recommendations For The Authors)

      There are numerous linguistic errors throughout, making it challenging to read.

      It is unclear how the intercepts were chosen in Figure S2. Since theory only gives you the slopes, it seems like it would make more sense to choose the intercept such that it aligns with the empirical results in some way.

      Thanks for your critical evaluation. We do feel apologize some typos, and we have read it through and clarify the text as much as possible. In addition, we included Table 1, which introduces mathematical symbols of the paper.

      In Figure S2, the two algorithms being compared have different software implementations, PLINK vs X-LD. Their real performance not only depended on the time complexity of the algorithms (right-side y-axis), but also how the software was coded. PLINK is known for its excellent programming. If we could have programmed as well as Chris Chang, the performance of X-LD should have been even better and approach the ratio m/n. However, even under less skilled programming, X-LD outperformed plink.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for the chance to review your manuscript. It looks like compelling work that could be improved by greater detail. Providing the level of detail necessary may require creating a Supplementary Note that does a lot of hand-holding for readers like me who are mathematically literate but who don’t have the background that you do. Then you can refer readers to the Supplement if they can’t follow your work.

      We fix the problems and style issues as possible as we can.

      Regarding the weakness section in the public review, here are a few examples of where I got confused, though this list is not exhaustive.

      1) Consider Equation 1 (line 100), which I believe must be incorrect. Imagine that g consists of two SNPs on different chromosomes with correlation rho. Then ell_g (which is defined as the average squared elements of the correlation matrix) would be

      ell_g = 1/4 (1 + 1 + rho^2 + rho^2) = (1+rho^2)/2.

      But ell_1=1 and ell_2=1 and ell_12=rho^2 (The average squared elements of the chromosome-specific correlation matrices and the cross-chromosome correlation matrix, respectively). So

      sum(ell_i)+sum(ell_ij) = 1 + 1 + rho^2 + rho^2 = (1+rho^2)*2.

      I believe your formulas would hold if you defined your LD values as the sum of squared correlations instead of the mean, but then I don’t know if the math in the subsequent sections holds. I think this problem also holds for Eq 2 and therefore makes Eqs 3 and 4 difficult to interpret.

      Thanks for your attentive review and invaluable suggestions. We acknowledge the typo in calculating the mean in Eq 1, resulting in difficulties in understanding the equations. We sincerely apologize for this oversight. To address this issue and ensure clarity in the interpretation of Eq 3 and Eq 4, we have provided more detailed explanations (see the derivation between Eq 3 and Eq 4).

      2) I didn’t know what the parameters are in Equation 3. The vector ell needs to be defined. Is it the vector of ell_i for each chromosomal segment i? I’m also confused by the definition of m_i, which is defined on line 113 as the “SNP number of the i-th chromosome.” Do the authors mean the number of SNPs on the i-th chromosomal segment? If so, it wasn’t clear to me how Eq 2 and Eq 3 imply Eq 4. Further, it wasn’t clear to me why E(b1) quantifies the average LD decay of the genome. I’m used to seeing plots of average LD as a function of distance between SNPs to calculate this, though I’m admittedly not a population geneticist, so maybe this is standard. Standard or not, readers deserve to have their hands held a bit more through this either in the text or in a Supplementary Note.

      Thanks for your insightful feedback. When we were writing this paper, our actually focus was Eq 3 and to establish the relationship between chromosomal LD and the reciprocal of the length of chromosome (Fig 6A) – which was surrogated by the number of SNPs, the correlation between ell_i and 1/m_i.

      We asked around our friends who are population geneticists, who anticipated the correlation between chromosomal LD (ell) and 1/m. The rationale simple if one knows the very basis of population genetics. A long chromosome experiences more recombination, which weakens LD for a pair of loci. In particular, for a pair of loci D_t=D_0 (1-c)^t. D_t the LD at the t generation, D_0 at the 0 generation, and c the recombination fraction. As recombination hotspots are nearly even distributed along the genome, such as reported by Science 2019;363:eaau8861, the chromosome will be broken into the shape in Author response image 1 (Fig 1C, newly added). Along the diagonal you see tight LD block, which will be vanished in the further as predicted by D_t equation, and any loci far away from each other will not be in LD otherwise raised by such as population structure. Ideally, we assume the diagonal block of aveage size of m×m and average LD of a SNP with other SNPs inside the diagonal block (red) is l_u; and, in contrast, off-diagonal average LD (light red) to be l_uv. This logic is hidden but employed in such as ld score regression and prs refinement using LD structure.

      Author response image 1.

      But, how to estimate chromosomal LD (ell), which is overwhelming as our friends said! So, the Figure 6A is logically anticipated by a seasoned population geneticist, but has never been realized because of is nightmare. Often, those signature patterns should have been employed as showcases in releasing new reference data, such as HapMap. However, to our knowledge, this signature linear relationship has never been illustrated in those reference data.

      If you further test a population geneticist, if any chromosome will deviate from this line (Fig 6A)? The answer most likely will be chromosome 6 because of the LD tight HLA region. However, it is chromosome 11 because of its most completed sequenced centromere. Chr 11 is a surprise! With T2T sequenced population, Chr 11 will not deviate much. We predict!

      However, we suspect whether people appreciate this point, we shift our focus to efficient computation of LD—which is more likely understood. We acknowledge the lack of clarity in notation definitions and the absence of the derivation for the interpretation of b1 and b0 for LD decay regression. So, we have added a table to provide an explanation of the notation (see the Table 1) and provided additional derivations, which explained how LD decay regression was derived (see the derivation between Eq 3 and Eq 4). Figure 1C provides illustration for the underlying assumption under LD.

      The technique to bridge Eq 2~3 to Eq 4 is called “building interpretation”. It once was one of the kernel tasks for population genetics or statistical genetics, and a classical example is Haseman-Elston regression (Behavior Genetics, 1972, 2:3-19). When it is moving towards a data-driven style, the culture becomes “shut up, calculate”. Finding interpretation for a regression is a vanishing craftmanship, and people often end up with unclear results!

      3) In line 135, it’s not clear to me what is meant by . If it is , then wouldn’t the resulting matrix be a matrix of zeros since is zero everywhere except the lower off-diagonal? So maybe it is ? But then later in that line, you say that the square of this matrix is the sum of several terms of the form . Are these the scalar elements of the G matrix? But then the sum is a scalar, which can’t be true since is a matrix.

      Thanks for your attentive review. We indeed confused the definition of matrices and their elements, and should refer to the stacked off-diagonal elements of matrix . So, is a vector for variable – the relationship between sample i and j. We assume the reviewer use R software, then corresponds to mean .

      See the text between Eq 5 and Eq 6.

      “We extract two vectors , which stacks the off-diagonal elements of , and , which takes the diagonal elements of .”

      In addition, , so the ground truth is that , but not zero.

      To clarify these math symbols, we replace G with K, so as to be consistent with our other works (see Table 1).

      To derive the means and the sampling variances for and , the Eq 7 can be established by some modifications on the Delta method as exampled in Appendix I of Lynch and Walsh’s book (Lynch and Walsh, 1998). We added this sentence near Eq 7 in the main text.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Please make corrections as suggested by reviewer 1 to improve the manuscript. Specifically, reviewer 1 suggests making changes to p values in Figure 5, and the importance of citing original scholarly works related to effects of increase in excitability of sympathetic neurons by M1 receptors, and the terminology for M currents and KCNQ currents. These changes will improve the manuscript and are strongly recommended.

      The section dealing with Aging Reduces KCNQ currents seems to contain a lot of extraneous information especially in the last part of the long paragraph and this section should be rewritten for improved clarity and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates. The apparent lack of correlation between KCNQ current and KCNQ2 protein needs to be better explained. This is a central part of the study and this result undercuts the premise of the paper. Additionally, the poor specificity of Linordipine for KCNQ should be pointed out in the limitations.

      Finally, the editor notes that the author response should not contain ambiguities in what was addressed in the revision. In the original summary of consolidated revisions that were requested, one clearly and separately stated point (point 4) was that experiments in slice cultures should be strongly considered to extend the significance of the work to an intact brain preparation. The author response letter seems to imply that this was done, but this is not the case. The author response seems to have combined this point with another separate point (point 3) about using KCNQ drugs, and imply that all concerns were addressed. Authors should be clear about what revisions were in fact addressed.

      Summary of recommendations from the three reviewers:

      Please make corrections as suggested by reviewer 1 to improve the manuscript.

      Specifically, reviewer 1 suggests making changes to p values in Figure 5,

      As a team, we have decided to keep p values. Here is our rationale:

      Our lab favors reporting p-values for all statistical comparisons to help readers identify what we consider statistically significant. We color-coded the p-values, with red for p-value < 0.05 and black for p-value > 0.05. As a reader, seeing a p-value=0.7 allows me to know that the authors performed an analysis comparing these conditions and found the mean not to be different. Not presenting the p-value makes me wonder whether the authors even analyzed those groups. We value the ability to analyze the data by seeing all p-values than not being distracted by non-significant p-values.

      and the importance of citing original scholarly works related to effects of increase in excitability of sympathetic neurons by M1 receptors, and the terminology for M currents and KCNQ currents. These changes will improve the manuscript and are strongly recommended.

      We cited original papers on that area and changed the terminology for M current. I kept KCNQ when referring to the channel protein or abundance.

      The section dealing with Aging Reduces KCNQ currents seems to contain a lot of extraneous information especially in the last part of the long paragraph and this section should be rewritten for improved clarity… and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates.

      I separated the long paragraph in two. I also removed extraneous information in that section. It now reads:

      Previous work by our group and others demonstrated that cholinergic stimulation leads to a decrease in M current and increases the excitability of sympathetic motor neurons at young ages.67-71 The molecular determinants of the M current are channels formed by KCNQ2 and KCNQ3 in these neurons.70, 76, 77 Thus, Figure 6A shows a voltage response (measured in current-clamp mode) and a consecutive M current recording (measured in voltage-clamp mode) in the same neuron upon stimulation of cholinergic type 1 muscarinic receptors. It illustrates the temporal correlation between the decrease of M current with the increase in excitability and firing of APs. This strong dependence led us to hypothesize that aging decreases M current, leading to a depolarized RMP and hyperexcitability (Figure 6B). For these experiments, we measured the RMP and evoked activity using perforated patch, followed by the amplitude of M current using a whole-cell voltage clamp in the same cell. We also measured the membrane capacitance as a proxy for cell size. Interestingly, M current density was smaller by 29% in middle age (7.5 ± 0.7 pA/pF) and by 55% in old (4.8 ± 0.7 pA/pF) compared to young (10.6 ± 1.5 pA/pF) neurons (Figure 6C-D). The average capacitance was similar in young (30.8 ± 2.2 pF), middle-aged (27.4 ± 1.2 pF), and old (28.8 ± 2.3 pF) neurons (Figure 6E), suggesting that aging is not associated with changes in cell size of sympathetic motor neurons, and supporting the hypothesis that aging alters the levels of M current. Next, we tested the effect on the abundance of the channels mediating M current. Contrary to our expectation, we observed that KCNQ2 protein levels were 1.5 ± 0.1 -fold higher in old compared to young neurons (Figure 6F-G). Unfortunately, we did not find an antibody to detect consistently KCNQ3 channels. We concluded that the decrease in M current is not caused by a decrease in the abundance of KCNQ2 protein.

      B. and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates.

      I am not sure to understand the request in the section on the correlation of KCNQ with AP firing rate. I divided the long paragraph.

      The apparent lack of correlation between KCNQ current and KCNQ2 protein needs to be better explained. This is a central part of the study and this result undercuts the premise of the paper.

      Indeed, total KCNQ2 protein abundance increases while M current decreases. We do not claim in our work that changes in excitability are caused by a reduction in the expression or density of KCNQ2 channels. On the contrary, our current working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. I have modified the description in the results section and discussion to clarify this concept. We also note that the discussion section contains a paragraph discussing this discrepancy.

      Additionally, the poor specificity of Linordipine for KCNQ should be pointed out in the limitations.

      Thank you for the suggestion. I have added the following sentences to the Limitations section. It reads: “We want to point out that linopirdine has been reported to affect other ionic currents besides M current (Neacsu and Babes, 2010; Lamas et al., 1997). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.”

      Finally, the editor notes that the author response should not contain ambiguities in what was addressed in the revision. In the original summary of consolidated revisions that were requested, one clearly and separately stated point (point 4) was that experiments in slice cultures should be strongly considered to extend the significance of the work to an intact brain preparation. The author response letter seems to imply that this was done, but this is not the case. The author response seems to have combined this point with another separate point (point 3) about using KCNQ drugs, and imply that all concerns were addressed. Authors should be clear about what revisions were in fact addressed.

      We apologize for this omission. After reviewing this comment, I realized I did not respond to the Major points in the section of the Recommendations for the authors from Reviewer 3. We missed that entire section. Our previous responses addressed the Public review of Reviewer 3. When doing so, we did not separate the sentences, omitting the request to perform the experiment in slices.

      The proposed experiments will require an upward microscope coupled to an electrophysiology rig; unfortunately, we do not have the equipment to do these experiments. We agree that our findings need to be tested in intact preparations to understand how the hyperactivity of sympathetic motor neurons affects systemic responses and the function of controlling organ function. This is a crucial step to move the field forward. Our laboratory is trying to find the appropriate experimental design to address this problem. We believe we must go beyond redoing these experiments in slices.

      Reviewer #1 (Recommendations For The Authors):

      (1) The significance values greater than p < 0.05 do not add anything and distract focus from the results that are meaningful. Fig. 5 is a good example. What does p = 0.7 mean? Or p = 0.6? Does this help the reader with useful information?

      We thank Reviewer 1 for raising this question. We have attempted different versions of how we report p values, as we want to make sure to address rigor and transparency in reporting data.

      Our lab favors reporting p-values for all statistical comparisons to help readers identify what we consider statistically significant. We color-coded the p-values, with red for p-value < 0.05 and black for p-value > 0.05. As a reader, seeing a p-value=0.7 allows me to know that the authors performed an analysis comparing these conditions and found the mean not to be different. Not presenting the p-value makes me wonder whether the authors even analyzed those groups. We value the ability to analyze the data by seeing all p-values than not being distracted by non-significant p-values.

      (2) Fig. 1 is not informative and should be removed.

      Although we agree with the reviewer that this figure is not informative, it was created to guide the reader in identifying the problem addressed in our manuscript in the physiological context. Our colleagues who read the first drafts of the manuscript recommended this, so we prefer to keep the figure.

      (3) The emphasis on a particular muscarinic agonist favored by many ion channel physiologists, oxotremorine, is not meaningful (lines 192, 198). The important point is stimulation of muscarinic AChRs, which physiologically are stimulated by acetylcholine. The particular muscarinic agonist used is unimportant. Unless mandated by eLife, "cholinergic type 1 muscarinic receptors" are usually referred to as M1 mAChRs, or even better is "Gq-coupled M1 mAChRs." I don't think that Kruse and Whitten, 2021 were the first to demonstrate the increase in excitability of sympathetic neurons from stimulation of M1 mAChRs. Please try and cite in a more scholarly fashion.

      A) We have modified lines 192 and 198, removing the mention of oxotremorine.

      B) We have modified the nomenclature used to refer to cholinergic type 1 muscarinic receptors.

      C) We cited references on the role of M current on sympathetic motor neuron excitability.

      (4) The authors may want to use the term "M current" (after defining it) as the current produced by KCNQ2&3-containing channels in sympathetic neurons, and reserve "KCNQ" or "Kv7" currents as those made by cloned KCNQ/Kv7 channels in heterologous systems. A reason for this is to exclude currents KCNQ1-containing channels, which most definitely do not contribute to the "KCNQ" current in these cells. I am not mandating this, but rather suggesting it to conform with the literature.

      Thank you for the suggestion. I have modified the text to use the term M current. I maintained the use of KCNQ only when referring to KCNQ channel, such as in the section describing the abundance of KCNQ2.

      (5) The section in the text on "Aging reduces KCNQ current" is confusing. Can the authors describe their results and their interpretation more directly?

      (6) Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case?? What about KCNQ3? It would be very enlightening if the authors would just quantify the ratio of KCNQ2:KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves (see Shapiro et al., JNS, 2000; Selyanko et al., J. Physiol., Hadley et al., Br. J. Pharm., 2001 and a great many more). It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry.

      We have divided this paragraph in sections.

      A. Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case??

      Our interpretation is that the decrease in M current is not caused by a decrease in the abundance of KCNQ (2) channels. We do not claim that changes in excitability are caused by a reduction in the expression or density of KCNQ2 channels. On the contrary, our working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. We have modified the description in the results section to clarify this concept. “We concluded that the decrease in M current is not caused by a decrease in the abundance of KCNQ2 protein.”

      B. What about KCNQ3?

      Unfortunately, we did not find an antibody to detect KCNQ3 channels. I have added a sentence to state this.

      C. KCNQ2: KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves.

      Our laboratory is working to deeply understand the mechanism behind the changes in M current and its regulation by mAChRs in young and old ages. However, it is part of different research to attend to the complexity of the question. We think pharmacology experiments are insufficient to understand the question's complexity as we described in the next answer.

      D. It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry.

      As mentioned, our laboratory is working to understand the mechanism behind M current and its regulation in young and old ages deeply. Our preliminary data show that M currents recorded in old neurons show two behaviors with the activation of mAChR: 1) they do not respond (blue line), or 2) they show a smaller and slower current inhibition than young neurons (red line). This data shows the complexity of the mechanism behind the M current in old neurons where changes in basal levels of PIP2, phospholipids metabolism, KCNQ2/3 changes in traffic/degradation, and M current pharmacology need to be addressed together for a proper interpretation. Showing only one part of this set of experiments in this article may lead to misinterpretation of results.

      Author response image 1.

      (7) Why do the authors use linopirdine instead of XE-991? Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error?

      A. Why do the authors use linopirdine instead of XE-991?

      We use linopiridine with the experimental goal of observing the recovery phase during the washout. The main difference between the effects of XE991 and linopiridine on Kv7.2/3 is associated with the recovery phase. Currents under XE991 treatment recover 30% after 10 min compared to 93.4% with linopiridine in expression systems at -30 mV (Greene DL et al., 2017, J Pharmacol Exp Ther). After validation of KCNQ2/3 inhibition by linopirdine (IC50 value of 2.4 µM), we found linopirdine the most appropriate drug for our experiments.

      Unfortunately, we were not able to observe a recovery in our experiments. The limited recovery after washout may be associated with the membrane potential of our conditions (-60 to -50 mV).

      B. Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so.

      We understand the concern of the reviewer. The specificity of XE-991 and linopiridine is not absolute. Linopiridine has been reported to activate TRPV1 channels (EC50 =115 µM, Neacsu and Babes, 2010, J Pharmacol Sci) or nicotinic acetylcholine receptors and GABA-induced Cl- currents (EC50 =7.6 µM and 8.1 µM respectively; Lamas et al, 1997, Eur J Neurosci).

      To clarify this limitation in the article, we have added the following sentence in the section Limitations and Conclusions. “We want to point out that linopirdine has been reported to affect other ionic currents besides M current (Neacsu and Babes, 2010; Lamas et al., 1997). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.”

      C. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error?

      Thank you for pointing out this. We have added information for both retigabine and linopirdine in the Methods section; both were missing.

      (8) Can the authors use a more scientific explanation of RTG action than "activating KCNQ channels?" For instance, RTG induces both a negative-shift in the voltage-dependance of activation and a voltage-independent increase in the open probability, both of which differing in detail between KCNQ2 and KCNQ3 subunits. The authors are free to use these exact words. Thus, the degree of "activation" is very dependent upon voltage at any voltages negative to the saturating voltages for channel activation.

      We have modified the text to reflect your suggestion. Thank you.

      (9) Methods: did the authors really use "poly-l-lysine-coated coverslips?" Almost all investigators use poly-D-lysine as a coating for mammalian tissue-culture cells and more substantial coatings such as poly-D-lysine + laminin or rat-tail collagen for peripheral neurons, to allow firm attachment to the coverslip.

      That is correct. We used poly-L-lysine-coated coverslips. Sympathetic motor neurons do not adhere to poly-D-Lysine.

      (10) As a suggestion, sampling M-type/KCNQ/Kv7 current at 2 kHz is not advised, as this is far faster than the gating kinetics of the channels. Were the signals filtered?

      Signals were not filtered. Currents were sampled at 2KHz. Our conditions are not far from what is reported by others. Some sample at 10KHz and even 50 KHz. Others do not report the sample frequency.

      Reviewer #2:

      Weaknesses:

      None, the revised version of the manuscript has addressed all my concerns.

      We are very appreciative and glad that our responses satisfied your previous concerns.

      Reviewer #3:

      The main weakness is that this study is a descriptive tabulation of changes in the electrophysiology of neurons in culture, and the effects shown are correlative rather than establishing causality.

      In the previous revision, Reviewer 3 wrote: “It is difficult to know from the data presented whether the changes in KCNQ channels are in fact directly responsible for the observed changes in membrane excitability.” And suggested the “use of blockers and activators to provide greater relevance.”

      Attending this recommendation, we performed experiments in Fig. 8. Young neurons exposed to linopirdine depolarize membrane potential and promote action potential firing. In contrast, the old neurons treated with retigabine repolarize membrane potential and stop firing action potentials. This new set of experiments suggests age-related electrophysiological changes in old neurons are associated with changes in M current. The main finding of our article.

      If Reviewer 3 refers to establishing causality between aging and a reduction in M current, I would like to emphasize that our laboratory is working toward a better understanding of the molecular mechanism of how M current is affected by aging; however, it will be part of a different article.  One of our attempts was to reverse aging with rapamycin, but the previous recommendation was to remove those experiments.

      … but the specifics of the effects and relevance to intact preparations are unclear.

      Additional experiments in slice cultures would provide greater significance on the potential relevance of the findings for intact preparations.

      I apologize for missing this point in the previous revision. The proposed experiments will require an upward microscope coupled to an electrophysiology rig. Unfortunately, I do not

      have the equipment to do these experiments.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Response to reviewer’s comments

      Reviewer #2 (Public Review):

      Summary: 

      The manuscript focuses on comparison of two PLP-dependent enzyme classes that perform amino acyl decarboxylations. The goal of the work is to understand the substrate specificity and factors that influence catalytic rate in an enzyme linked to theanine production in tea plants.

      Strengths: 

      The work includes x-ray crystal structures of modest resolution of the enzymes of interest. These structures provide the basis for design of mutagenesis experiments to test hypotheses about substrate specificity and the factors that control catalytic rate. These ideas are tested via mutagenesis and activity assays, in some cases both in vitro and in plants. 

      Weaknesses:

      Although improved in a revision, the manuscript could be more clear in explaining the contents of the x-ray structures and how the complexes studied relate to the reactant and product complexes. The manuscript could also be more concise, with a discussion section that is largely redundant with the results and lacking in providing scholarly context from the literature to help the reader understand how the current findings fit in with work to characterize other PLP-dependent enzymes or protein engineering efforts. Some of the figures lack sufficient clarity and description. Some of the claims about the health benefits of tea are not well supported by literature citations.

      Thank you for your insightful comments on our manuscript and your recognition of the strengths of our study. We understand your concerns about the weaknesses mentioned, and we have addressed them appropriately in the revised manuscript. We acknowledge that the discussion section needs to be improved for conciseness and context. We have revised this part by removing the redundant content. We also acknowledge your comments concerning the clarity and description of some figures. We have revisited these figures and revised them, ensuring they are clear and adequately described. Lastly, concerning the claims about the health benefits of tea, we understand your concern about the lack of supporting citations. We ensure to back such claims with valid literature or, if necessary, omit these statements.

      Reviewer #2 (Recommendations For The Authors):

      (1) Line 21: Alanine Decarboxylase should not be capitalized.

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (2) Line 31: Grammatical error. Also not clear what "evolution analysis" means here. Revise to "Structural comparisons led us to..."

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (3) Line 34: Revise to "Combining a double mutant of CsAlaDC"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (4) Line 35: Change word order to "increased theanine production 672%"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (5) Line 37: meaning unclear. Revise to "provides a route to more efficient biosynthesis of theanine."

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (6) Line 44: I'm not sure that the "health effects" of tea have been proven in placebo controlled studies. And the references provided (2-4 and 5) do not describe original research articles supporting these claims. I would suggest removing these statements from the introduction and at later points in the manuscript.

      Thank you for your thoughtful feedback and suggestions. Based on your suggestion, we have removed these statements: "The popularity of tea is determined by its favorable flavor and numerous health benefits (2-4). The flavor and health-beneficial effects of tea are conferred by the abundant secondary metabolites, including catechins, caffeine, theanine, volatiles, etc (5). " As for the subsequent statement: " It has also many health-promoting functions, including neuroprotective effects, enhancement of immune functions, and potential anti-obesity capabilities, among others. " the referenced literature cited can substantiate this conclusion.

      (7) Line 58: insert "the" between provided and basis

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (8) Line 100: Not clear what this phrase means, "As expected, CsSerDC was closer to AtSerDC" Please clarify - closer to what?

      We apologize for any confusion caused by the unclear phrasing. When referring to "CsSerDC was closer to AtSerDC," we intended to convey that CsSerDC exhibits a higher degree of sequence homology with AtSerDC than it does with the other enzymes evaluated in our investigation. However, a 1.29% difference between 86.21% and 84.92% in amino acid similarity is not statistically significant (Figure 1B and Supplementary table 1 in the original manuscript), we have deleted the relevant descriptions in the revised manuscript.

      (9) Line 112: "were constructed into" makes no sense. It would be better to say the genes for the proteins of interest were inserted into the overexpression plasmid.

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (10) Line 115: missing the word "the" between generated and recombinant

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (11) Line 121: catalyze not catalyzed

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (12) Lines 129 and 130: The reported Km values are really large - in the mM range. Do these values make sense in terms of the available concentrations of the substrates inside the cell?

      The content of alanine in tea plant roots ranges from 0.28 to 4.18 mg/g DW (Yu et al., 2021; Cheng et al., 2017). Correspondingly, the physiological concentration of alanine is 3.14 mM to 46.92 mM, in tea plant roots. The content of serine in plants ranges from 0.014 to 17.6 mg/g DW (Kumar et al., 2017). Correspondingly, the physiological concentration of serine is 0.13 mM to 167.48 mM in plants. Therefore, in this study, the Km values are within the range of available substrate concentrations inside the cell.

      Yu, Y. et al. (2021) Glutamine synthetases play a vital role in high accumulation of theanine in tender shoots of albino tea germplasm "Huabai 1". J. Agric. Food Chem. 69 (46),13904-13915.

      Cheng, S. et al. (2017) Studies on the biochemical formation pathway of the amino acid L-theanine in tea (Camellia sinensis) and other plants. J. Agric. Food Chem. 65 (33), 7210-7216.

      Kumar, V. et al. (2017) Differential distribution of amino acids in plants. Amino Acids. 49(5), 821-869.

      (13) Line 211: it is unclear what the phrase "as opposed to wild-type" means. Please clarify.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We intend to communicate that the wild-type CsAlaDC and AtSerDC demonstrate decarboxylase activity, while the mutated proteins have experienced a loss of decarboxylation activity. We have already modified this concern in the revised version of the manuscript.

      (14) Line 222: residues not residue

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (15) Line 227 and Figure 4B: It is not clear what the different sequence logos mean in this part of the figure. The caption is too brief and not helpful. And the sentences describing this figure panel are also not sufficiently clear.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have provided a more detailed explanation of this section in the revised manuscript and added additional annotations in the figure caption to provide further clarity.

      (16) Lines 233 and 234: "in the substrate specificity" is awkwardly worded. I would revise to "in selective binding of the appropriate substrate."

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have meticulously revised the description of this section.

      (17) Line 243: a word is missing in this sentence - but I can't figure out the intended meaning or what the missing word is. Rephrase to improve clarity.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised this sentence to: " These findings indicate the essential role of Phe106 in the selective binding of alanine for CsAlaDC. "

      (18) Line 255: The "expression system...was carried out" is not correct. I would say the expression system was used - but you probably also want to rearrange the sentences to more directly say what it was used for. Later, the word "the" is also missing.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised this sentence to: "To further verify that Phe106 of CsAlaDC and Tyr111 of AtSerDC were key amino acid residues determining its substrate recognition in planta, we employed the Nicotiana benthamiana transient expression system. "

      (19) Line 273: use "understand" instead of "elucidate" and instead of "we proposed a prediction test:" say "we designed a test of the prediction that..."

      Thank you very much for your careful reading of the manuscript. We have revised this sentence to: “In light of this observation, we postulated a hypothesis:”

      (20) Line 301: I don't think "effectuate" is a word. Replace with something else.

      Thank you very much for your careful reading of the manuscript. We have revised the sentence as: " The biosynthetic pathway of theanine in tea plants comprises two consecutive enzymatic steps: alanine decarboxylase facilitates the decarboxylation of alanine to generate EA, while theanine synthetase catalyzes the condensation reaction between EA and Glu to synthesize theanine. "

      (21) Line 307: replace "activity" with "ability"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (22) Line 322: I didn't find the discussion very useful. Much of it is simply a recap of the results - which is not necessary. The structural comparisons are overly descriptive without providing appropriate rationale or topic sentence structure so that the reader understands why certain details are emphasized. I think the manuscript would be much stronger if this section were not included or integreted more concisely into the results section where appropriate.

      Thank you for your constructive comments. We understand your concerns about the discussion section of our manuscript. We acknowledge that the discussion section has redundancies with the result. In response to this, we have revised this section to eliminate unnecessary repetition of the results.

      (23) Line 369: "an amino acid devoid of the hydroxyl moiety present in Lys" - what does this mean? Lys does not have a hydroxyl functional group. Please correct so that the sentence makes sense.

      Thank you very much for your careful reading of the manuscript. This sentence states that the amino acid occupying the corresponding position in CsAlaDC is Phe, which lacks one hydroxyl functional group as compared to Lys. We have made modifications to the sentence as follows: "In contrast, the equivalent position in CsAlaDC is occupied by Phe, an amino acid lacking the hydroxyl group. This substitution enhances the hydrophobic nature of the substrate-binding pocket. "

      (24) Line 370: "This structural nuance portends a predisposition for CsAlaDC to select the comparatively hydrophobic amino acid alanine as its suitable substrate." This sentence also makes no sense - please revise to use simpler language so the meaning is more clear.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised the sentence as follows: " Consequently, CsAlaDC demonstrates a unique predilection, selectively binding Ala (an amino acid with comparatively hydrophobic properties) as its preferred substrate."

      (25) Lines 376-384: This section makes several references to "catalytic rings." I have no idea what this term means? If the authors mean a loop structure in the enzyme - please use the term "loop"

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected it in the revised manuscript.

      (26) Line 396-397: The authors reference data that is not shown in the manuscript. Either show the data in the results section or do not mention.

      Thank you for your insightful comment regarding the unshown data referenced in the manuscript. We have included Supplementary figure 9 in the revised manuscript to display this data.

      (27) Line 445-446: what is "mutation technology" - if the authors mean site-directed mutagenesis - please use the simpler and more recognizable terminology.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised the sentence as follows: "Based on the findings of this study, site-directed mutagenesis can be employed to modify enzymes involved in theanine synthesis. This modification enhances the capacity of bacteria, yeast, model plants, and other organisms to synthesize theanine, thereby facilitating its application in industrial theanine production."

      Reviewer #3 (Public Review):

      In the manuscript titled "Structure and Evolution of Alanine/Serine Decarboxylases and the Engineering of Theanine Production," Wang et al. solved and compared the crystal structures of Alanine Decarboxylase (AlaDC) from Camellia sinensis and Serine Decarboxylase (SerDC) from Arabidopsis thaliana. Based on this structural information, the authors conducted both in vitro and in vivo functional studies to compare enzyme activities using site-directed mutagenesis and subsequent evolutionary analyses. This research has the potential to enhance our understanding of amino acid decarboxylase evolution and the biosynthetic pathway of the plant specialized metabolite theanine, as well as to further its potential applications in the tea industry.

      Thank you very much for taking the time to review this manuscript. We appreciate all your insightful comments.

      Reviewer #3 (Recommendations For The Authors):

      The additional material added by the authors addresses some of the previously raised questions and enhances the manuscript's quality. However, certain critical issues we pointed out earlier remain unaddressed. Some of the new data also raises new questions. To provide readers with more comprehensive data, the authors should include additional quantitative data and convert the data presented in the reviewer's comments into supplemental figure format.

      Thank you for acknowledging the improvements in the revised manuscript and providing further valuable feedback. We understand your concern about the critical issues that have not been fully addressed and the new questions raised by some of the newly added data. We have strived to address these issues with additional analysis and clarification in our subsequent revision. Regarding your suggestion for more quantitative data and converting the data mentioned in the reviewer's comments into a supplemental figure format, we agree that this would provide a more comprehensive view of the results. We have reformatted the relevant data into supplemental figures to enhance the clarity and accessibility of information. We are grateful for the time and effort you have dedicated to improving our manuscript.

      * Page 5 & Figure 1B

      "As expected, CsSerDC was most closed to AtSerDC, which implies that they shared similar functions. However, CsAlaDC is relatively distant from CsSerDC."

      : In Figure 1B, CsSerDC and AtSerDC are in different clades, and this figure does not show that the two enzymes are closest. To provide another quantitative comparison, please provide a matrix table showing amino acid sequence similarities as a supplemental table. 

      Comment: I don't believe that a 1.29% difference between 86.21% and 84.92% in amino acid similarity is statistically significant. Although the authors have rephrased the original sentence, it's improbable that this small 1.29% difference can explain the observed distinction.

      Many thanks. We have carefully considered your comments. Indeed, the 1.29% difference in amino acid similarity cannot reflect the functional difference between the AlaDC and SerDC proteins. We have deleted the relevant descriptions in the revised manuscript.

      * Page 6, Figure 2, Page 23 (Methods)

      "The supernatants were purified with a Ni-Agarose resin column followed by size-exclusion chromatography."

      : What kind of SEC column did the authors use? Can the authors provide the SEC elution profile comparison results and size standard curve?

      Comment: The authors should include the SEC elution profiles as a supplemental figure or incorporate them as a panel in Figure 2. Furthermore, they should provide a description of the oligomeric state of each protein in this experiment. Additionally, there is a significant difference between CsSerDC (65.38 mL) and CsAlaDC (74.37 mL) elution volumes. Can this difference be explained structurally? In comparison to the standard curve of molecular weight provided by the authors, it appears that these proteins are at least homo-tetramers, which contradicts the description in the text. This should be re-evaluated and clarified.  

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have included the SEC elution profile in Supplemental figure 1A and added descriptions of the oligomeric states of proteins in the revised manuscript. CsSerDC was eluted at 65.38 mL, corresponding to a molecular weight of 292 kDa, which is five times the monomeric protein (54.7 kDa). However, due to the absence of CsSerDC crystal structure, it remains uncertain whether the protein forms a pentamer. AtSerDC was eluted at 72.25 mL, with a corresponding molecular weight of 155 kDa, which is 3.3 times the monomer (47.3 kDa). CsAlaDC was eluted at 74.37 mL, with a corresponding molecular weight of 127 kDa, which is 2.7 times the monomer (47.3 kDa). The elution profiles suggest that AtSerDC and CsAlaDC potentially exist in homotrimeric form. This observation stands in contradiction to our subsequent findings where the protein manifests in a dimeric structure. A plausible explanation could be the non-ideal spherical shape of the protein. Under such circumstances, the hydrodynamic radius of the protein could supersede its actual size, potentially leading to an overestimation of the molecular weight on the size-exclusion chromatography [ref].

      References:

      Burgess, R. R. (2018) A brief practical review of size exclusion chromatography: Rules of thumb, limitations, and troubleshooting. Protein Expression and Purification. 150, 81-85.

      Erdner J. M., et al. (2006) Size-Exclusion Chromatography Using Deuterated Mobile Phases. Journal of Chromatography A. 1129(1):41–46.

      * Page 6 & Page 24 (Methods)

      "The 100 μL reaction mixture, containing 20 mM substrate (Ala or Ser), 100 mM potassium phosphate, 0.1 mM PLP, and 0.025 mM purified enzyme, was prepared and incubated at standard conditions (45 {degree sign}C and pH 8.0 for CsAlaDC, 40 {degree sign}C and pH 8.0 for AtSerDC for 30 min)."

      (1) The enzymatic activities of CsAldDC and AtSerDC were measured at two different temperatures (45 and 40 {degree sign}C), but their activities were directly compared. Is there a reason for experimenting at different temperatures?

      (2) Enzyme activities were measured at temperatures above 40{degree sign}C, which is not a physiologically relevant temperature and may affect the stability or activity of the proteins. At the very least, the authors should provide temperature-dependent protein stability data (e.g., CD spectra analysis) or, if possible, temperature-dependent enzyme activities, to show that their experimental conditions are suitable for studying the activities of these enzymes.

      Comment: I appreciate the authors for including temperature-dependent enzyme activity data in their study. However, it remains puzzling that plant enzymes were tested at a physiologically irrelevant temperature of 40 and 45 degrees Celsius. Additionally, it may not be appropriate to directly compare enzyme activity measurements at different temperatures. Furthermore, the data at 45 degrees in panel A appears to be an outlier, which contrasts with the overall trend observed in the graph.

      We appreciate your point regarding the testing temperatures for plant enzymes. We fully appreciate the importance of conducting experiments under physiologically relevant conditions. But the intent behind operating at these elevated temperatures was to assess the thermal stability of the enzymes, which can be a valuable characteristic in certain applications, such as industrial production processes, and does not necessarily reflect their physiological conditions. Our findings indicate that CsAlaDC exhibits its peak activity at 45 °C. This result aligns with previously reported data in the literature [Bai, P. et al. (2021) figure 4e], thus bolstering our confidence in the reliability of our experimental outcomes.

      Author response image 1.

      Relative activity of CsAlaDC at different temperatures.

      * Pages 6-7 & Table 1

      (1) Use the correct notation for Km and Vmax. Also, the authors show kinetic parameters and use multiple units (e.g., mmol/L or mM for Km).

      (2) When comparing the catalytic efficiency of enzymes, kcat/Km (or Vmax/Km) is generally used. The authors present a comparison of catalytic activity from results to conclusion. A clarification of what results are being compared is needed.

      Comment: The authors are still comparing catalytic efficiency solely based on the Vmax values. As previously suggested, it would be advisable to calculate kcat/Km and employ it for comparing catalytic efficiencies. Furthermore, based on the data provided by the authors, I conducted a rough calculation of these catalytic efficiencies and did not observe a significant difference, which contrasts with the authors' statement, "These findings indicated that the catalytic efficiency of CsAlaDC is considerably lower than that of both CsSerDC and AtSerDC." This discrepancy requires clarification.  

      We want to express our sincere appreciation for your meticulous review and constructive suggestions. We understand the importance of accurately comparing catalytic efficiencies using Kcat/Km values, rather than solely relying on Vmax values. Following your suggestion, we recalculated Kcat/Km to reanalyze our results. The computed Kcat/Km for CsSerDC and AtSerDC are 152.7 s-1 M-1 and 184.6 s-1 M-1, respectively. For CsAlaDC, the calculated Kcat/Km is 55.7 s-1 M-1. Therefore, the catalytic efficiency of CsSerDC and AtSerDC is approximately three times that of CsAlaDC.  What we intended to convey was that the Vmax of CsAlaDC is lower than that of CsSerDC and AtSerDC.  Our description in the manuscript was not accurate, and we have addressed this in the revised version.

      * Pages 9 & 10

      "This result suggested this Tyr is required for the catalytic activity of CsAlaDC and AtSerDC."

      : The author's results are interesting, but it is recommended to perform the experiments in a specific order. First, experiments should determine whether mutagenesis affects the protein's stability (e.g., CD, as discussed earlier), and second, whether mutagenesis affects ligand binding (e.g., ITC, SPR, etc.), before describing how site-directed mutagenesis alters enzyme activity. In particular, the authors' hypothesis would be much more convincing if they could show that the ligand binding affinity is similar between WT and mutants.

      Comments: While it is appreciated that you have included CD and UV-vis absorption spectra data, it would be more beneficial to provide quantitative data to address the previously proposed binding affinity. I also recommend presenting the data mentioned in the reviewer's comments as a supplementary figure for better clarity and reference.  

      Thank you for your valuable feedback and suggestions. I agree that providing quantitative data would lend more support to our findings and better address the proposed binding affinity.

      It is generally acknowledged that proteins complexed with PLP exhibit a yellow hue, and the ligand PLP forms a Schiff base structure with the ε-amino group of a lysine residue in the protein, with maximum absorbance around 420 nm. However, during our protein purification process, we observed that the purified protein retained its yellow coloration, even when PLP wasn't introduced into the purification buffer. Subsequent absorbance measurements revealed that the protein exhibited absorbance within the aforementioned wavelength (420 nm) (the experimental results are shown in the following figures), implying an inherent presence of the PLP ligand within the protein. This could have resulted from binding with PLP during the protein's expression in E. coli. Consequently, due to this inseparability between the protein and the ligand, obtaining quantitative data through experimental means becomes unfeasible.

      Author response image 2.

      (A) Absorption Spectra of CsAlaDC (WT) and CsAlaDC (Y336F). (B) Absorption Spectra of AtSerDC (WT) and AtSerDC (Y341F).

      Regarding your suggestion about presenting the data mentioned in the reviewer's comments as a supplementary figure, we agree that it is an excellent idea. We have prepared supplementary figure 7 and supplementary figure 8 accordingly, ensuring that they present the required data.

      * Page 10

      "The results showed that 5 mM L-DTT reduced the relative activity of CsAlaDC and AtSerDC to 22.0% and 35.2%, respectively"

      : The authors primarily use relative activity to compare WT and mutants. Can the authors specify the exact experiments, units, and experimental conditions? Is it Vmax or catalytic efficiency? If so, under what specific experimental conditions?

      Response: "However, due to the unknown mechanism of DTT inhibition on protein activity, we have removed this part of the content in the revised manuscript."

      Comment: I believe this requires a more comprehensive explanation rather than simply removing it from the text.  

      Although we have observed that DTT is capable of inhibiting enzyme activity, at present, we are unable to offer a comprehensive explanation for the inhibitory effect of DTT on enzyme activity in terms of its structural and catalytic mechanisms. Further research is required to elucidate the mechanism of action of DTT. It is worth noting, however, that our study does not emphasize investigating the specific inhibitory mechanisms of DTT on enzyme activity. Furthermore, the existing findings do not provide an adequate explanation for the observed phenomenon, leading us to exclude this particular aspect from the content.

      * Pages 10-12

      : The identification of 'Phe106 in CsAlaDC' and 'Tyr111 in AtSerDC,' along with the subsequent mutagenesis and enzymatic activity assays, is intriguing. However, the current manuscript lacks an explanation and discussion of the underlying reasons for these results. As previously mentioned, it would be helpful to gain insights and analysis from WT-ligand and mutant-ligand binding studies (e.g., ITC, SPR, etc.). Furthermore, the authors' analysis would be more convincing with accompanying structural analysis, such as steric hindrance analysis.

      Comment: While it is appreciated that you have included UV-vis absorption spectra data, it would be more beneficial to provide quantitative data to address the previously proposed binding affinity. I also recommend presenting the data mentioned in the reviewer's comments as a supplementary figure for better clarity and reference.  

      Response: Thank you for your valuable feedback and suggestions. Given that the protein forms a complex with PLP during its expression in E. coli and cannot be dissociated from it, obtaining quantitative data via experimental protocols is rendered impracticable.

      Author response image 3.

      (A) Absorption Spectra of CsAlaDC (WT) and CsAlaDC (F106Y). (B) Absorption Spectra of AtSerDC (WT) and AtSerDC (Y111F).

      Mutant proteins and wild-type proteins exhibited absorption bands at 420 nm, suggesting the formation of a Schiff base between PLP and the active-site lysine residue.

      Regarding your suggestion about presenting the data mentioned in the reviewer's comments as a supplementary figure, we have prepared supplementary figure 7 and supplementary figure 8 accordingly, ensuring that they present the required data.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This paper investigates host and viral factors influencing transmission of alpha and delta SARS-CoV-2 variants in the Syrian hamster model and fundamentally increases knowledge regarding transmission of the virus via the aerosol route. The strength of evidence is solid and could be improved with a clearer presentation of the data.

      We thank the editors for their assessment. We are excited to present a revised version of the manuscript with improved data presentation and an improved discussion addressing the reviewer’s concerns.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the submitted manuscript, Port et al. investigated the host and viral factors influencing the airborne transmission of SARS-CoV-2 Alpha and Delta variants of concern (VOC) using a Syrian hamster model. The authors analyzed the viral load profiles of the animal respiratory tracts and air samples from cages by quantifying gRNA, sgRNA, and infectious virus titers. They also assessed the breathing patterns, exhaled aerosol aerodynamic profile, and size distribution of airborne particles after SARS-CoV-2 Alpha and Delta infections. The data showed that male sex was associated with increased viral replication and virus shedding in the air. The relationship between co-infection with VOCs and the exposure pattern/timeframe was also tested. This study appears to be an expansion of a previous report (Port et al., 2022, Nature Microbiology). The experimental designs were rigorous, and the data were solid. These results will contribute to the understanding of the roles of host and virus factors in the airborne transmission of SARS-CoV-2 VOCs.

      Reviewer #2 (Public Review):

      This manuscript by Port and colleagues describes rigorous experiments that provide a wealth of virologic, respiratory physiology, and particle aerodynamic data pertaining to aerosol transmission of SARS-CoV-2 between infected Syrian hamsters. The data is particularly significant because infection is compared between alpha and delta variants, and because viral load is assessed via numerous assays (gRNA, sgRNA, TCID) and in tissues as well as the ambient environment of the cage. The paper will be of interest to a broad range of scientists including infectious diseases physicians, virologists, immunologists and potentially epidemiologists. The strength of evidence is relatively high but limited by unclear presentation in certain parts of the paper.

      Important conclusions are that infectious virus is only detectable in air samples during a narrow window of time relative to tissue samples, that airway constriction increases dynamically over time during infection limiting production of fine aerosol droplets, that variants do not appear to exclude one another during simultaneous exposures and that exposures to virus via the aerosol route lead to lower viral loads relative to direct inoculation suggesting an exposure dose response relationship.

      While the paper is valuable, I found certain elements of the data presentation to be unclear and overly complex.

      Reviewer #1 (Recommendations For The Authors):

      We thank the reviewer for their comments and their attention to detail. We have taken the following steps to address their suggestions and concerns.

      However, the following concerns need to be issued.

      1. Summary seems to be too simple, and some results are not clearly described in the summary.

      We have edited the summary and hope to have addressed the concerns raised by providing more information. We think that the summary includes all relevant findings.

      “It remains poorly understood how SARS-CoV-2 infection influences the physiological host factors important for aerosol transmission. We assessed breathing pattern, exhaled droplets, and infectious virus after infection with Alpha and Delta variants of concern (VOC) in the Syrian hamster. Both VOCs displayed a confined window of detectable airborne virus (24-48 h), shorter than compared to oropharyngeal swabs. The loss of airborne shedding was linked to airway constriction resulting in a decrease of fine aerosols (1-10µm) produced, which are suspected to be the major driver of airborne transmission. Male sex was associated with increased viral replication and virus shedding in the air. Next, we compared the transmission efficiency of both variants and found no significant differences. Transmission efficiency varied mostly among donors, 0-100% (including a superspreading event), and aerosol transmission over multiple chain links was representative of natural heterogeneity of exposure dose and downstream viral kinetics. Co-infection with VOCs only occurred when both viruses were shed by the same donor during an increased exposure timeframe (24-48 h). This highlights that assessment of host and virus factors resulting in a differential exhaled particle profile is critical for understanding airborne transmission.”

      1. Aerosol transmission experiment should be described in Materials and Methods although it is cited as Reference 21#;

      We have modified Line 433:

      “Aerosol caging

      Aerosol cages as described by Port et al. [2] were used for transmission experiments and air sampling as indicated. The aerosol transmission system consisted of plastic hamster boxes (Lab Products) connected by a plastic tube. The boxes were modified to accept a 7.62 cm (3') plastic sanitary fitting (McMaster-Carr), which enabled the length between the boxes to be changed. Airflow was generated with a vacuum pump (Vacuubrand) attached to the box housing the naïve animals and was controlled with a float-type meter/valve (McMaster-Carr).”

      And Line 458: “During the first 5 days, hamsters were housed in modified aerosol cages (only one hamster box) hooked up to an air pump.”.

      Especially, one superspreading event of Alpha VOC (donor animal) was observed in iteration A (Figure 4). What causes that event, experiment system?

      Based on the observed variation in airborne shedding (of the cages from which this was directly measured), we believe that one plausible explanation for the super-spreading event was that the Alpha-infected donor shed considerably more virus during the exposure than other donors, and thus more readily infected the sentinels. That said, it is also conceivable that other factors such as hamster behavior (e.g., closeness to the cage outlet, sleeping) or variable sentinel susceptibility could affect the distribution of transmissions.

      1. Same reference is repeatedly listed as Refs 2 and 21#.

      Addressed. We thank the reviewer for their attention to detail. We have also removed reference 53, which was the same as 54.

      1. Two forms of described time (hour and h) are used in the manuscript. Single form should be chosen.

      This has been addressed.

      5) Virus designation located in line 371 and line 583 is inconsistent, and it needs to be revised.

      For consistency we have chosen this nomenclature for the viruses used: SARS-CoV-2 variant Alpha (B.1.1.7) (hCoV320 19/England/204820464/2020, EPI_ISL_683466) and variant Delta (B.1.617.2/) (hCoV-19/USA/KY-CDC-2-4242084/2021, EPI_ISL_1823618).

      1. In Figure 5F, what time were lung and nasal turbinate tissues collected after virus infection?

      This has been added to the legend. Day 5. Line 904.

      1. Line 562-563, what is the coating antigen (spike protein, generated in-house)? purified or recombinant protein?

      It is in-house purified recombinant protein. This has been added to the methods.

      1. Line 575 and line 578: 10,000x is not standard description, and it should be revised.

      Done.

      Reviewer #2 (Recommendations For The Authors):

      We thank the reviewer for their comments and suggestions to improve the manuscript, and hope we have addressed all concerns adequately.

      • Direct interpretation of the linear regression slope in Figure 3 is challenging. Is the most relevant parameter for transmission known? Intuitively, it would be the absolute number of small droplets at a given timepoint rather than the slope and it would be easier to interpret if the data were reported in this fashion.

      We decided to show a percentage of counts to normalize the data among animals, as we observed large inter-individual variation in counts. The reviewer is correct that it is most likely the number of particles that would be most relevant to transmission, though much (including the role of particle size) remains to be determined. We have added a sentence to the results which explains this in L157.

      Therefore, we decided in this first analysis to utilize the slope measurement and not raw counts. The focus was on the slopes and how particle profiles were changing post inoculation. Because we have focused on percentages, it seems not appropriate to present particle counts within each diameter range because the analysis, model, and results are based on these percentages of particles.

      Use of regression to compute slope is a useful measure because it uses data from all timepoints to estimate the regression line and, therefore, the % of particles on each day. We decided on these methods because efficiency is especially important in a study with a relatively small number of animals and slopes are also a good surrogate for how animal particle profiles are changing post-inoculation.

      To assist with the interpretation: 1) We removed Figure 3C and D and replaced Figure 3B with individual line plots for all conditions to visualize the slopes. The figure legend was corrected to reflect these changes.

      2) We replaced L169 onwards to read: (Figure 3B). Females had a steeper decline at an average rate of 2.2 per day after inoculation in the percent of 1-10 μm particles (and a steeper incline for <0.53 μm) when compared to males, while holding variant group constant. When we compared variant group while holding sex constant, we found that the Delta group had a steeper decline at an average rate of 5.6 per day in the percent of 1-10 μm particles (and a steeper incline for <0.53 μm); a similar trend, but not as steep, was observed for the Alpha group.

      The estimated difference in slopes for Delta vs. controls and Alpha vs. controls in the percent of <0.53 μm particles was 5.4 (two-sided adjusted p= 0.0001) and 2.4 (two-sided adjusted p = 0.0874), respectively. The estimated difference in slopes for percent of 1-10 μm particles was not as pronounced, but similar trends were observed for Delta and Alpha. Additionally, a linear mixed model was considered and produced virtually the same results as the simpler analysis described above; the corresponding linear mixed model estimates were the same and standard errors were similar.

      • Fig 4: what is "limit of quality" mentioned in the legend? Are these samples undetectable?

      We have clarified this in the legend: “3.3 = limit of detection for RNA (<10 copies/rxn)”. If samples have below 10 copy numbers per reaction, they are determined to be below the limit of detection. The limit of detection is 10 copy number/rxn. All samples below 10 copies/rxn are taken to be negative and set = 10 copies/rxn, which equals 3.3. Log10 copies/mL oral swab.

      • Fig 4C would be easier to process in graphical rather than tabular form. The meaning of the colors is unclear.

      We agree with the reviewer that this is difficult to interpret, but we are uncertain if the same data in a tabular format would be easier to digest. We realized that the legend was misplaced and have added this back into the figure, which we hope clarifies the colors and the limit of detection.

      • Figure 4D & E are uninterpretable. What do the pie charts represent?

      We have remodeled this part of the figure to a schematic representation of the majority variant which transmitted for each individual sentinel, and have added a table (Table S1) which summarizes the exact sequencing results for the oral swabs. The reviewer is correct that it was difficult to interpret the pie charts, considering most values are either 0 or close to 100%. We hope this addresses the question. The legend states:

      Author response image 1.

      Airborne attack rate of Alpha and Delta SARS-CoV-2 variants. Donor animals (N = 7) were inoculated with either the Alpha or Delta variant with 103 TCID50 via the intranasal route and paired together randomly (1:1 ratio) in 7 attack rate scenarios (A-G). To each pair of donors, one day after inoculation, 4-5 sentinels were exposed for a duration of 4 h (i.e., h 24-28 post inoculation) in an aerosol transmission set-up at 200 cm distance. A. Schematic figure of the transmission set-up. B. Day 1 sgRNA detected in oral swabs taken from each donor after exposure ended. Individuals are depicted. Wilcoxon test, N = 7. Grey = Alpha, teal = Delta inoculated donors. C. Respiratory shedding measured by viral load in oropharyngeal swabs; measured by sgRNA on day 2, 3, and 5 for each sentinel. Animals are grouped by scenario. Colors refer to legend below. 3.3 = limit of detection of RNA (<10 copies/rxn). D. Schematic representation of majority variant for each sentinel as assessed by percentage of Alpha and Delta detected in oropharyngeal swabs taken at day 2 and day 5 post exposure by deep sequencing. Grey = Alpha, teal = Delta, white = no transmission.

      • Fig S2G is uninterpretable. Please label and explain.

      We have now included an explanations of the figure S2F. The figure is a graphic representation of the neutralization data depicted in Figure S2F. The spacing between grid lines is 1 unit of antigenic distance, corresponding to a twofold dilution of serum in the neutralization assay. The resulting antigenic distance depicted between Alpha and Delta is roughly a 4-fold difference in neutralization between homologous (e.g., Alpha sera with the Alpha virus vs. heterologous, Alpha sera with the Delta virus).

      • I would consider emphasizing lines 220-225 in the summary and abstract. The important implication is that aerosol transmission is more representative of natural heterogeneity of exposure dose and downstream viral kinetics. This is an often-overlooked point.

      We agree with the reviewer and have added this in Line 43.

      • Fig 5: A cartoon similar to Fig 4A showing timing of sentinel exposure with number of animals would be helpful.

      We have added this as a new panel A for Figure 5. See the redrafted Figure 5 below.

      • For Fig 5E & F It would be helpful to use a statistical test to more formally assess whether proportion at exposure predicts proportion of variants in downstream sentinel infection.

      This has been added as a new Figure 5 panel H and I, which we hope addresses the reviewer’s comment.

      Author response image 2.

      Airborne competitiveness of Alpha and Delta SARS-CoV-2 variants. A. Schematic. Donor animals (N = 8) were inoculated with Alpha and Delta variant with 5 x 102 TCID50, respectively, via the intranasal route (1:1 ratio), and three groups of sentinels (Sentinels 1, 2, and 3) were exposed subsequently at a 16.5 cm distance. Animals were exposed at a 1:1 ratio; exposure occurred on day 1 (Donors  Sentinels 1) and day 2 (Sentinels  Sentinels). B. Respiratory shedding measured by viral load in oropharyngeal swabs; measured by gRNA, sgRNA, and infectious titers on days 2 and day 5 post exposure. Bar-chart depicting median, 96% CI and individuals, N = 8, ordinary two-way ANOVA followed by Šídák's multiple comparisons test. C/D/E. Corresponding gRNA, sgRNA, and infectious virus in lungs and nasal turbinates sampled five days post exposure. Bar-chart depicting median, 96% CI and individuals, N = 8, ordinary two-way ANOVA, followed by Šídák's multiple comparisons test. Dark orange = Donors, light orange = Sentinels 1, grey = Sentinels 2, dark grey = Sentinels 3, p-values indicated where significant. Dotted line = limit of quality. F. Percentage of Alpha and Delta detected in oropharyngeal swabs taken at days 2 and day 5 post exposure for each individual donor and sentinel, determined by deep sequencing. Pie-charts depict individual animals. Grey = Alpha, teal = Delta. G. Lung and nasal turbinate samples collected on day 5 post inoculation/exposure. H. Summary of data of variant composition, violin plots depicting median and quantiles for each chain link (left) and for each set of samples collected (right). Shading indicates majority of variant (grey = Alpha, teal = Delta). I. Correlation plot depicting Spearman r for each chain link (right, day 2 swab) and for each set of samples collected across all animals (left). Colors refer to legend on right. Abbreviations: TCID, Tissue Culture Infectious Dose.”

      We have additionally added to the results section: L284: “Combined a trend, while not significant, was observed for increased replication of Delta after the first transmission event, but not after the second, and in the oropharyngeal cavity (swabs) as opposed to lungs (Figure 5H) (Donors compared to Sentinels 1: p = 0.0559; Donors compared to Sentinels 2: p = >0.9999; Kruskal Wallis test, followed by Dunn’s test). Swabs taken at 2 DPI/DPE did significantly predict variant patterns in swabs on 5 DPI/DPE (Spearman’s r = 0.623, p = 0.00436) and virus competition in the lower respiratory tract (Spearman’s r = 0.60, p = 0.00848). Oral swab samples taken on day 5 strongly correlate with both upper (Spearman’s r = 0.816, p = 0.00001) and lower respiratory tract tissue samples (Spearman’s r = 0.832, p = 0.00002) taken on the same day (Figure 5I).”

      • Fig 1A: how are pfu/hour inferred? This is somewhat explained in the supplement, but I found the inclusion of model output as the first panel confusing and am still not 100% clear how this was done. Consider, explaining this in the body of the paper.

      We have added a more detailed explanation of the PFU/h inference to the main text: The motivation for the model was to link more readily measurable quantities such as RNA measured in oral swabs to the quantity of greatest interest for transmission (infectious virus per unit time in the air). To do this, we jointly infer the kinetics of shed airborne virus and parameters relating observable quantities (infected sentinels, plaques from purified air sample filters) to the actual longitudinal shedding. The inferential model uses mechanistic descriptions of deposition of infectious virus into the air, uptake from the air, and loss of infectious virus in the environment to extract estimates of the key kinetic parameters, as well as the resultant airborne shedding, for each animal.

      We have added this information to L106 in the results and hope this clarifies the rationale and execution of the model.

      More minor points:

      • Line 292: "poor proxy" seems too strong as peak levels of viral RNA correlate with positive airway cultures. It might be more accurate to say that high levels of viral RNA during early infection only somewhat correlate with positive airway cultures.

      We have rephrased this to clarify that while peak RNA viral loads are predictive of positive cultures, measuring RNA, especially early during infection and only once, may not be sufficient to infer the magnitude or time-dependence of infectious virus shedding into the air. See Line 308: “We found that swab viral load measurements are a valuable but imperfect proxy for the magnitude and timing of airborne shedding. Crucially, there is a period early in infection (around 24 h post-infection in inoculated hamsters) when oral swabs show high infectious virus titers, but air samples show low or undetectable levels of virus. Viral shedding should not be treated as a single quantity that rises and falls synchronously throughout the host; spatial models of infection may be required to identify the best correlates of airborne infectiousness [32]. Attempts to quantify an individual’s airborne infectiousness from swab measurements should thus be interpreted with caution, and these spatiotemporal factors should be considered carefully.”

      • Line 352: Re is dependent on time of an outbreak (population immunity) and cannot be specified for a given variant as it depends on multiple other variables

      We agree that the current phrasing here could be interpreted to suggest, incorrectly, that Re is an intrinsic property of a variant. We have deleted that language and reworded the section to emphasize that the critical question is heterogeneity in transmission, not mean reproduction number. Line 348: “Moreover, at the time of emergence of Delta, a large part of the human population was either previously exposed to and/or vaccinated against SARS-CoV-2; that underlying host immune landscape also affects the relative fitness of variants. Our naïve animal model does not capture the high prevalence of pre-existing immunity present in the human population and may therefore be less relevant for studying overall variant fitness in the current epidemiological context. Analyses of the cross-neutralization between Alpha and Delta suggest subtly different antigenic profiles [35], and Delta’s faster kinetics in humans may have also helped it cause more reinfections and “breakthrough” infections [36].

      Our two transmission experiments yielded different outcomes. When sentinel hamsters were sequentially exposed, first to Alpha and then to Delta, generally no dual infections—both variants detectable—were observed. In contrast, when we exposed hamsters simultaneously to one donor infected with Alpha and another infected with Delta, we were able to detect mixed-variant virus populations in sentinels in one of the cages (Cage F, see Appendix figures S1, S2). The fact that we saw both single-lineage and multi-lineage transmission events suggests that virus population bottlenecks at the point of transmission do indeed depend on exposure mode and duration, as well as donor host shedding. Notably, our analysis suggests that the Alpha-Delta co-infections observed in the Cage F sentinels could be due to that being the one cage in which both the Alpha and the Delta donor shed substantially over the course of the exposure (Appendix figures S2, S3). Mixed variant infections were not retained equally, and the relative variant frequencies differed between investigated compartments of the respiratory tract, suggesting roles for randomness or host-and-tissue specific differences in virus fitness.

      A combination of host, environmental and virus parameters, many of which vary through time, play a role in virus transmission. These include virus phenotype, shedding in air, individual variability and sex differences, changes in breathing patterns, and droplet size distributions. Alongside recognized social and environmental factors, these host and viral parameters might help explain why the epidemiology of SARS-CoV-2 exhibits classic features of over-dispersed transmission [37]. Namely, SARS-CoV-2 circulates continuously in the human population, but many transmission chains are self-limiting, while rarer superspreading events account for a substantial fraction of the virus’s total transmission. Heterogeneity in the respiratory viral loads is high and some infected humans release tens to thousands of SARS-CoV-2 virions/min [38, 39]. Our findings recapitulate this in an animal model and provide further insights into mechanisms underlying successful transmission events. Quantitative assessment of virus and host parameters responsible for the size, duration and infectivity of exhaled aerosols may be critical to advance our understanding of factors governing the efficiency and heterogeneity of transmission for SARS-CoV-2, and potentially other respiratory viruses. In turn, these insights may lay the foundation for interventions targeting individuals and settings with high risk of superspreading, to achieve efficient control of virus transmission [40].”

      • The limitation section should mention that this animal model does not capture the large prevalence of pre-existing immunity at present in the population and may therefore be less relevant in the current epidemiologic context.

      We agree and have added this more clearly, see response above.

      • Limitation: it is unclear if airway and droplet dynamics in the hamster model are representative of humans.

      We have added the following sentence: Line 331: “It remains to be determined how well airway and particle size distribution dynamics in Syrian hamsters model those in humans.”

      • The mathematical model is termed semi-mechanistic but I think this is not accurate as the model appears to have no mechanistic assumptions.

      We describe the model as semi-mechanistic because it uses mechanistic descriptions of the shedding and uptake process (as described above), incorporating factors including respiration rate and environmental loss, and makes the mechanistic assumption that measurable swab and airborne shedding all stem from a shared within-host infection process that produces exponential growth of virus up to a peak, followed by exponential decay. The model is only semi-mechanistic, however, as we do not attempt a full model of within-host viral replication and shedding (e.g. a target-cell limited virus kinetics model).

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for their reading of the manuscript, and their suggestions. We have extensively addressed all these concerns in the text, and also included several new data and figures in the revised version of the manuscript. We hope that our response and the new experimental data fully address the concerns raised by the reviewers. We include a detailed, pointby-point response to each of the reviewer concerns, pointing to new data and specific changes made in the main manuscript.

      Note: Do note that these new data have resulted in a new figure-figure 6, a new supplementary figure -figure 2-figure supplement 2, and an increase in the number of panels in each figure, as well as supplementary figures.

      General response comments, highlighting a few aspects missed by the reviewers

      This manuscript has an enormous amount of data in it. This is understandable, since in part we are proposing an entirely new hypothesis, and way to think about mitochondrial repression, built around substantial circumstantial evidences from diverse literature sources. But to keep the narrative readable and the main idea understandable, a lot of information had to be only very briefly mentioned in the text, and is therefore included as supplemental information. Due to that, it may not always be apparent that this study has set several technical benchmarks. These experiments are extremely challenging to perform, took many iterations to standardize, and in themselves are a first in the field. Yeast cells have the highest known rate of glycolytic flux for any organism. Measuring this glycolytic rate using the formation of intermediates is hard, and all current estimates have been in vitro, and using a stop-flow type set up. In this study, we optimized and directly measured the glycolytic flux using isotope labelled glucose (13C-glucose), which has never been reported before in highly glycolytic cells such as yeast. This is due to the very rapid label saturation (within seconds) after 13C glucose pulse (as is now shown in the figure 2-figure supplement 1). For brevity, this is summarized in this study with sufficient information to reproduce the method, but we will put out a more detailed, associated methodology paper describing several challenges, infrastructure requirements, and resources to be able to carry out these types of experiments using yeast. An added highlight of these experiments with WT and Ubp3 deletion strains is the most direct till date experimental demonstration that glycolytic flux in yeast in high glucose follows zero-order kinetics, and depends entirely on the amounts of the glycolytic enzymes (presumably operating at maximal activity). This nicely complements the recent study by Grigatis 2022 (cited in the discussion), that suggests this possibility.

      Separately, this study required the estimation of total inorganic phosphates, as well as mitochondrial pools of phosphates. Till date, there are no studies that have estimated mitochondrial pools of phosphate (for a variety of reasons). In this study, we also experimentally determined the changes in mitochondrial phosphate pools. For this, we had to establish and standardize a rapid mitochondrial isolation method in yeast. Thus, this study provides the first quantitative estimates of mitochondrial Pi amounts (in the context of measured mitochondrial outputs), as shown now in Figure 4. This component on mitochondrial isolation in yeast to assess metabolites may also be explored in future as a methods paper.

      Specific responses to the Reviews:

      Reviewer #1 (Public Review):

      The study by Vengayil et al. presented a role for Ubp3 for mediating inorganic phosphate (Pi) compartmentalization in cytosol and mitochondria, which regulates metabolic flux between cytosolic glycolysis and mitochondrial processes. Although the exact function of increased Pi in mitochondria is not investigated, findings have valuable implications for understanding the metabolic interplay between glycolysis and respiration under glucose-rich conditions. They showed that UBP3 KO cells regulated decreased glycolytic flux by reducing the key Pidependent-glycolytic enzyme abundances, consequently increasing Pi compartmentalization to mitochondria. Increased mitochondria Pi increases oxygen consumption and mitochondrial membrane potential, indicative of increased oxidative phosphorylation. In conclusion, the authors reported that the Pi utilization by cytosolic glycolytic enzymes is a key process for mitochondrial repression under glucose conditions.

      (1) However, the main claims are only partially supported by the low number of repeats and utilizing only one strain background, which decreased the overall rigor of the study. The fullpower yeast model could be utilized with testing findings in different backgrounds with increased biological repeats in many assays described in this study. In the yeast model, it has been well established that many phenotypes are genotype/strain dependent (Liti 2019, Gallone 2016, Boekout 2021, etc...). with some strains utilizing mitochondrial respiration even under high glucose conditions (Kaya 2021). It would be conclusive to test whether wild strains with increased respiration under high glucose conditions would also be characterized by increased mitochondrial Pi.

      “However, the main claims are only partially supported by the low number of repeats and utilizing only one strain background, which decreased the overall rigor of the study. The full-power yeast model could be utilized with testing findings in different backgrounds with increased biological repeats in many assays described in this study.”

      Thank you for the suggestion. We agree that a larger, universal statement cannot be made with data from a single strain, since yeasts do have substantial diversity. In this study, we had originally used a robust, prototrophic industrial strain (CEN.PK background). We have now utilized multiple, diverse strains of S. cerevisiae to test our findings. This includes strains from the common laboratory backgrounds – W303 and BY4742 – which have different auxotrophies, as well as another robust, highly flocculent strain from a prototrophic Σ1278 background. Using all these strains, we now comprehensively find that the role of altered Pi budgeting as a constraint for mitochondrial respiration, and the role of Ubp3 as a regulator of mitochondrial repression is very well conserved. In all tested strains of S. cerevisiae the loss of Ubp3 increases mitochondrial activity (as shown by increased mitochondrial membrane potential and increased Cox2 levels in Figure 6A, B). These data now expand the generality of our findings, and strengthen the manuscript. These results are included in the revised manuscript as a new figure- Figure 6 and the associated text.

      Some of the included data in the revised manuscript are shown below:

      Author response image 1.

      Mitochondrial activity and Cox2 levels in ubp3Δ in different genetic backgrounds

      We also used the W303 strain to assess Pi levels, and its role in increasing mitochondrial respiration. We find that the loss of Ubp3 in this genetic background also increases Pi levels and that the increased Pi is necessary for increasing mitochondrial respiration (Figure 6C, D).

      Author response image 2.

      Basal OCR in WT vs ubp3Δ (W303 strain background) in normal vs low Pi

      These experiments collectively have strengthened our findings on the critical role of intracellular Pi budgeting as a general constraint for mitochondrial respiration in high glucose.

      “It would be conclusive to test whether wild strains with increased respiration under high glucose conditions would also be characterized by increased mitochondrial Pi.”

      Addressed partially above. Right now the relative basal respiration in glucose across different strains is not well known. We measured mitotracker activity in high glucose in multiple WT strains of S. cerevisiae (W303, Σ1278, S288C and BY4742, compared to the CEN.PK strain). These strains all largely had similar mitotracker potential, except for a slight increase in mitochondrial membrane potential in Σ1278 strain, but not in other strains. We further characterized this using Cox2 protein levels as well as basal OCR, and found that these do not increase. These data is shown below, and is not included in the main text since it does not add any new component to the study.

      Author response image 3.

      Mitochondrial respiration in different WT strains

      We did find this suggestion very interesting though, and are exploring directions for future research based on this suggestion. Since we have now identified a role for intracellular Pi allocation in regulating the Crabtree effect, an interesting direction can be to understand the glucose dependent mitochondrial Pi transport in Crabtree negative yeast strains. We will have to bring in a range of new tools and strains for this, so these experiments are beyond the focus of this current study.

      We hope that these new experiments in different genetic backgrounds increases the breadth and generality of our findings, and stimulates new lines of thinking to address how important the role of Pi budgeting as a constraint for mitochondrial repression in high glucose might be.

      (2) It is not described whether the drop in glycolytic flux also affects TCA cycle flux. Are there any changes in the pyruvate level? If the TCA cycle is also impaired, what drives increased mitochondrial respiration?

      Thank you for pointing this out, and we agree this should be included. We have addressed these concerns in the revised version of the manuscript

      Since glucose derived pyruvate must enter the mitochondrial TCA cycle, one possibility is that a decrease in glycolytic rate could decrease the TCA flux. An alternate possibility is that the cells coincidently increase the pyruvate transport to mitochondria, to thereby maintain the TCA cycle flux comparable to that of WT cells. To test both these possibilities, we first measured the steady state levels of pyruvate and TCA cycle intermediates in WT vs ubp3Δ cells. We do not observe any significant change in the levels of pyruvate, or TCA cycle intermediates (except malate, which showed a significant decrease in ubp3Δ cells). This data is now included in the revised manuscript as Figure 2 – figure supplement 1D and figure supplement 2 A, along with associated text.

      Author response image 4.

      Pyruvate levels in WT vs ubp3Δ

      Author response image 5.

      Steady state TCA cycle intermediate levels

      Next, in order to address if the TCA cycle flux is impaired in ubp3Δ cells, we also measured the TCA cycle flux in WT vs ubp3Δ cells by pulsing the cells with 13C glucose and tracking 13C label incorporation from glucose into TCA cycle intermediates. This experiment first required substantial standardization, for the time of cell collection and quenching post 13C glucose addition, by measuring the kinetics of 13C incorporation into TCA cycle intermediates at different time points after 13C glucose addition. The standardization of this method is now included in the revised manuscript as Figure 2 – figure supplement 2 C, along with associated text, and is shown below for reference.

      Author response image 6.

      Kinetics of 13C labelling in TCA cycle intermediates

      Actual TCA cycle flux results: For measuring the TCA cycle flux, cells were treated with 1% 13C glucose, quenched and samples were collected at 7 mins post glucose addition which is in the linear range of 13C label incorporation (Figure 2- Figure 2 – figure supplement 2 C).

      Result: We did not observe any significant changes in the relative 13C label incorporation in TCA cycle intermediates. This data is included in the revised manuscript as Figure 2 – figure supplement 2 D, along with associated text, and is below for your reference.

      Author response image 7.

      TCA cycle flux

      What these data show is that the TCA cycle flux itself is not altered in ubp3Δ. A likely interpretation of this data is that this is due to the increase in the pyruvate transport to mitochondria in ubp3Δ cells, as indicated by the ~10-fold increase in Mpc3 (mitochondrial pyruvate transporter) protein levels (shown in Figure 5-figure supplement 5H), allowing the net same amount of pyruvate into the mitochondria. This increased mitochondrial pyruvate transport could support maintaining the TCA flux in ubp3Δ cells, and supporting the increased respiration. Putting a hierarchy together, the increased respiration in ubp3Δ cells could therefore be primarily due to increased Pi transport, followed by a consequent increase in ETC proteins. We leave it to the readers of this study to make this conclusion.

      We hope that we have addressed all concerns that the reviewer has with respect to TCA cycle flux in ubp3Δ cells.

      (3) In addition, some of the important literature was also missed in citation and discussion. For example, in a recent study (Ouyang et al., 2022), it was reported that phosphate starvation increases mitochondrial membrane potential independent of respiration in yeast and mammalian cells, and some of the conflicting results were presented in this study.

      We are very aware of the recent study by Ouyang et al, which reports that Pi starvation increases mitochondrial membrane potential independent of respiration. However, this study is distinct from the context of our case due to the reasons listed below.

      (a) The reviewer may have misinterpreted our low Pi condition as Pi starvation. There is no Pi ‘starvation’ in this study. Here, we cultured ubp3Δ and tdh2Δtdh3Δ cells in a low Pi medium with 1 mM Pi concentration in order to bring down the intracellular free Pi to that of WT levels. These cells are therefore not Pi-starved, but have been manipulated to have the same intracellular Pi levels as that of WT cells, as shown in Figure 4-figure supplement 1D. The Pi concentration in the medium is still in the millimolar range, and the cells are grown in this medium for a short time (~4 hrs) till they reach OD600 ~ 0.8. This is entirely different from the conditions used in Ouyang et al., 2022, where the cells were grown in a Pi-starvation condition with 1-100 micromolar Pi in the medium for a time duration of 6-8 hrs. Since cells respond differentially to changes in Pi concentrations over time (Vardi et al., 2014), the response to low Pi vs Pi starvation will be completely different.

      (b) In our study, mitochondrial membrane potential is used as only one of the readouts for mitochondrial activity. Our estimations of mitochondrial respiration are established by including other measurements such as Cox2 protein levels (as an indicator of the ETC) and basal OCR measurements (measuring respiration), all of which provide distinct information. The mitochondrial membrane potential can be regulated independent of mitochondrial respiration state (Liu et al., 2021), using membrane potential alone as a readout to estimate mitochondrial respiration can therefore be limiting in the information it provides. As indicated earlier, mitochondrial membrane potential can change, independent of mitochondrial respiration (Ouyang et al., 2022) and ATP synthesis (Liu et al., 2021). Since the focus of our study is mitochondrial respiration, and not just the change in membrane potential, making conclusions based on potential alone are ambiguous. Most studies in the field have in fact not used the comprehensive array of distinct estimates that we use in this study, and we believe the standards set in this study should become a norm for the field.

      (c) The only mutant that is similar to the Ouyang et al study is the Mir1 deletion mutant, which results in acute Pi starvation in mitochondria. In this strain, we find an increase in mitochondrial membrane potential. The data is not included in the manuscript but is shown below.

      Author response image 8.

      Mitochondrial potential in WT vs mir1Δ

      As clear from this data, mitochondrial membrane potential is significantly high in mir1Δ cells. However, the basal OCR and Cox2 protein levels clearly show decreased mitochondrial respiration which is expected in this mutant (Figure 5 A,B). This in fact highlights the limitations of solely relying on mitochondrial membrane potential measurements to draw conclusions, as doing so will lead to a misinterpretation of the actual mitochondrial activity in these cells. We do not wish to highlight limitations in other studies, but hope we make our point clear.

      (4) An additional experiment with strains lacking mitochondrial DNA under phosphate-rich and restricted conditions would further strengthen the result.

      Strains lacking mitochondrial DNA (Rho0 cells) cannot express the mitochondrially encoded ETC subunit proteins. These strains are therefore incapable of performing mitochondrial respiration. Since Rho0 cells are known to utilize alternate mechanisms to maintain their mitochondrial membrane potential (Liu et al., 2021), using mitotracker fluorescence as a readout of mitochondrial respiration in these strains under different Pi conditions is inconclusive and misleading due to the reasons mentioned in point number 3(b and c). However, since this was a concern raised by the reviewer, we now measured basal OCR in WT and Rho0 strains with Ubp3 deletion under normal vs low Pi medium. As expected, Rho0 cells show extremely low basal OCR values, an entire order of magnitude lower than WT cells. At these very low (barely detectable) levels the deletion of Ubp3 or change in Pi concentration in the medium does not change basal OCR, since these strains are not capable of respiration. We have included this data as Figure 4-figure supplement 1G.

      Author response image 9.

      Basal OCR in Rho0 cells

      (5) Western blot control panels should include entire membrane exposure, and non-cut western blots should be submitted as supplementary.

      The non-cut western blot images and the loading controls are now included in the revised manuscript as a supplementary file 2.

      (6) In Figure 4, it is shown that Pi addition decreases basal OCR to the WT level. However, the Cox2 level remains significantly higher. This data is confusing as to whether mitochondrial Pi directly regulates respiration or not.

      As described in the previous point, the Cox2 levels and the OCR provide distinct pieces of information. In figure 4, we show that culturing ubp3Δ in low Pi significantly decreases both Cox2 protein levels and basal OCR. Since Cox2 protein levels and basal OCR are different readouts for mitochondrial activity, there could be differences in the extent by which Pi availability controls each of these factors. Basal OCR is a direct readout for mitochondrial respiration, and is regulated by multiple factors including ETC protein levels, rate of ATP synthesis, rate of Pi transport etc. In figure 4, we find that culturing ubp3Δ in low Pi decreases basal OCR to WT level. This strongly suggests that high Pi levels are necessary to increase basal OCR in ubp3Δ.

      (7) Representative images of Ubx3 KO and wild-type strains stained with CMXRos are missing.

      Thank you for noticing this. This data is now included in the revised manuscript as Figure 1figure supplement 1C.

      Author response image 10.

      (8) Overall, mitochondrial copy number and mtDNA copy number should be analyzed in WT and Ubo3 KO cells as well as Pi-treated and non-treated cells, and basal OCR data should be normalized accordingly. The reported normalization against OD is not appropriate.

      This is a valid concern raised by the reviewer, and something we had extensively considered during the study. To normalize the total mitochondrial amounts in each strain, we always measure the protein levels of the mitochondrial outer membrane protein Tom70. While we had described this in the methods, it may not have been obvious in the text. But this information is included in Figure 1-figure supplement 1G. We did not observe any significant change in Tom70 levels, suggesting that the total mitochondrial amount does not change in ubp3Δ, and we have noted this in the manuscript (results section relevant to Figure 1). As an additional control, to directly measure the mitochondrial amount in these conditions, we have now measured the mitochondrial volume in ubp3Δ cells and WT cells treated with Pi. For this, we used a strain which encodes mitochondria targeted with mNeon green protein (described in Dua et al., JCB, 2023), and which can therefore independently assess total mitochondrial amount. We do not observe any changes in mitochondrial volume or amounts in ubp3Δ cells or WT+Pi, compared to that of WT cells. Therefore, the change in mitochondrial respiration in Ubp3 deletion and Pi addition are not due to changes in total amounts of mitochondria in these conditions. Given all these, the normalization of basal OCR using total cell number is therefore the most appropriate way for normalization. This is also conventionally used for basal OCR normalization in multiple studies.

      We have now included these additional data on mitochondrial volumes and amounts in the revised manuscript as Figure1-figure supplement 1F and Figure5-figure supplement 1D, and associated text, and is shown below.

      Author response image 11.

      Mitochondrial volume in WT vs ubp3Δ cells

      Author response image 12.

      Mitochondrial volume in WT and WT+Pi

      These data collectively address the reviewer’s concerns regarding changes in mitochondrial amounts in all the conditions and strains used in this study.

      Reviewer #2 (Public Review):

      Summary:

      Cells cultured in high glucose tend to repress mitochondrial biogenesis and activity, a prevailing phenotype type called Crabree effect that is observed in different cell types and cancer. Many signaling pathways have been put forward to explain this effect. Vengayil et al proposed a new mechanism involved in Ubp3/Ubp10 and phosphate that controls the glucose repression of mitochondria. The central hypothesis is that ∆ubp3 shifts the glycolysis to trehalose synthesis, therefore leading to the increase of Pi availability in the cytosol, then mitochondria receive more Pi, and therefore the glucose repression is reduced.

      Strengths:

      The strength is that the authors used an array of different assays to test their hypothesis. Most assays were well-designed and controlled.

      Weaknesses:

      I think the main conclusions are not strongly supported by the current dataset.

      (1) Although the authors discovered ∆ubp3 cells have higher Pi and mitochondrial activity than WT in high glucose, it is not known if WT cultured in different glucose concentration also change Pi that correlate with the mitochondrial activity. The focus of the research on ∆ubp3 is somewhat artificial because ∆ubp3 not only affects glycolysis and mitochondria, but many other cellular pathways are also changed. There is no idea whether culturing cells in low glucose, which derepress the mitochondrial activity, involves Ubp3 or not. Similarly, the shift of glycolysis to trehalose synthesis is also not relevant to the WT cells cultured in a low-glucose situation. “The focus of the research on ∆ubp3 is somewhat artificial because ∆ubp3 not only affects glycolysis and mitochondria, but many other cellular pathways are also changed. There is no idea whether culturing cells in low glucose, which de-repress the mitochondrial activity, involves Ubp3 or not.”

      We would like to clarify that the focus of this research is not on Ubp3, or to address mechanistic aspects of how Ubp3 regulates mitochondrial activity, or to identify the targets of Ubp3. That would be an entirely distinct study, with a very different approach.

      In this study, while carrying out a screen, we serendipitously found that ubp3Δ cells showed an increase in mitochondrial activity in high glucose. Subsequently, we used this observation, bolstered by diverse orthogonal approaches, to identify a general, systems-level principle that governs mitochondrial repression in high glucose. Through this, we identify a role of phosphate budgeting as a controller of mitochondrial repression in high glucose. In this study, our entire focus has been to use orthogonal approaches, as well as parsimonious interpretations, to establish this new hypothesis as a possibility. We hope this idea, supported by these data, will now enable researchers to pursue other experiments to establish the generality of this phenomenon.

      We have not focused our effort in identifying the role of Ubp3, or its regulation upon changes in glucose concentration in this context. That is a very specific, and separate effort, and misses the general point we address here. It is entirely possible that Ubp3 might also regulate mitochondrial activity by additional mechanisms other than mitochondrial Pi availability (such as via the reduction of key glycolytic enzymes at nodes of glycolysis, resulting in reduced glycolytic flux and rerouted glucose metabolism). Had the goal been to identify Ubp3 substrates, it is very likely that we would not have found the role of Pi homeostasis in controlling mitochondrial respiration. This is particularly because the loss of Ubp3 does not result in an acute disruption of glycolysis, unlike say a glycolytic enzyme mutant, which would have resulted in severe effects on growth and overall metabolic state. This would have made it difficult to dissect out finer details of metabolic principles that regulate mitochondrial respiration.

      In order to further corroborate our findings, we used the glycolysis defective mutant tdh2Δtdh3Δ cells, where we find a similar change in Pi balance. This complements the key observations made using ubp3Δ cells. Distinctly, we utilized the glycolytic inhibitor 2DG to independently assess the role of mitochondrial Pi transport in regulating respiration. Together, in this study we do not just relying on genetic mutants, but combine the Ubp3 deletion strain with a reduced GAPDH activity strain, and pharmacologic inhibition of glycolysis. Distinctly, we find that mitochondrial Pi transporter levels are repressed under high glucose (Figure 5C, Figure 5-figure supplement 1B). Further, we find that mitochondrial Pi transport is important in increasing mitochondrial respiration upon shift to low glucose and glycolytic inhibition by 2-DG. Therefore, we collectively unravel a more systems level principle that regulates glucose mediated mitochondrial repression, as opposed to a mechanistic study of Ubp3 targets.

      Of course, given the conservation of Ubp3, we are very excited to pursue a mechanistic study of Ubp3 targets in future. This is a general challenge for deubiquitinase enzymes, and till date there are very few bona fide substrates known for any deubiquitinase enzyme, from any cellular system (due to challenges in the field that we discuss separately, and have included in the discussion section of this text).

      “Similarly, the shift of glycolysis to trehalose synthesis is also not relevant to the WT cells cultured in a low-glucose situation”

      The reviewer is correct in pointing out that in low-glucose, the shift to trehalose synthesis might not be as relevant. We observe that the glycolysis defective mutant tdh2Δtdh3Δ cells does not show an increase in trehalose synthesis (Figure 3-figure supplement 1E). However, in this context, the decrease in the rate of GAPDH catalysed reaction alone appears to be sufficient to increase the Pi levels (Figure 3F) even without an increase in trehalose. Therefore, there might be differences in the relative contributions of these two arms towards Pi balance, based on whether it is low glucose in the environment, or a mutant such as ubp3 that modulates glycolytic flux. In ubp3Δ cells, the combination of low rate of GAPDH catalyzed reaction and high trehalose will happen (based on how glycolytic flux is modulated), vs only the low rate of GAPDH catalyzed reaction in tdh2Δtdh3Δ cells. As an end point the increase in Pi happens in both cases, but with slightly differing outcomes. It is also to be noted that in terms of free Pi sources a low-glucose condition (with low glycolytic rate) is very different from a no-glucose, respiratory condition (where cells perform very high gluconeogenesis). In high respiration conditions such as ethanol, cells switch to high gluconeogenesis, where there is a huge increase trehalose synthesis as a default (eg see Varahan et al 2019). In this condition, trehalose synthesis could be a major source for Pi (eg see Gupta 2021), and could support the increased mitochondrial respiration. In an ethanol medium, the directionality of GAPDH reaction is reversed. Therefore, this reaction will also now become an added source of Pi, instead of a consumer of Pi (see illustration in Figure 3G). Therefore, a reasonable interpretation is that a combination of increased trehalose and increased 1,3 BPG to G3P conversion can be a major Pi source to increasing mitochondrial respiration in a non-glucose, respiratory medium.

      “it is not known if WT cultured in different glucose concentration also change Pi that correlate with the mitochondrial activity”

      This is valid point raised by the reviewer. We have already found that the protein levels of mitochondrial Pi transporter is increased in a non-glucose respiratory (ethanol) medium and a low (0.1%) glucose medium (see Figure 5C, Figure5-figure supplement 1B). In addition, we have tried measuring mitochondrial Pi levels in cells grown in a high glucose medium vs a respiratory, ethanol medium. The results are shown below for the reviewer’s reference. Reviewer response image 3 – Mitochondrial Pi levels in ethanol vs glucose

      Author response image 13.

      We observe a clear trend where mitochondrial Pi levels are high in cells grown in ethanol medium compared to that of cells grown in high glucose. However, the estimation of Pi, and normalising the Pi levels in isolated mitochondria is extremely difficult in this condition (note that this has never been done before). This is likely due to a rapid rate of conversion of ADP and Pi to ATP (in ethanol) which increases the variation in the estimation of steady state Pi levels, and the high amounts of mitochondria in ethanol grown cells. Since the date shows high variation, we have not included this data in the manuscript, but we are happy to include it here in the response.

      Indeed, this study opens up the exciting question of addressing how intracellular Pi allocation is regulated in different conditions of glucose. This can be further extended to Crabtree negative strains such as K. lactis which do not show mitochondrial repression in high glucose. All of these are rich future research programs.

      (2) The central hypothesis that Pi is the key constraint behind the glucose repression of mitochondrial biogenesis/activity is supported by the data that limiting Pi will suppress mitochondrial activity increase in these conditions (e.g., ∆ubp3). However, increasing the Pi supply failed to increase mitochondrial activity. The explanation put forward by the authors is that increased Pi supply will increase glycolysis activity, and somehow even reduce the mitochondrial Pi. I cannot understand why only the increased Pi supply in ∆ubp3, but not the increased Pi by medium supplement, can increase mitochondrial activity. The authors said "...that ubp3Δ do not increase mitochondrial Pi by merely increasing the Pi transporters, but rather by increasing available Pi pools". They showed that ∆ubp3 mitochondria had higher Pi but WT cells with medium Pi supplement showed lower Pi, it is hard to understand why the same Pi increase in the cytosol had a different outcome in mitochondrial Pi. Later on, they showed that the isolated mito exposed to higher Pi showed increased activity, so why can't increased Pi in intact cells increase mito activity? Moreover, they first showed that ∆ubp3 had a Mir1 increase in Fig3A, then showed no changes in FigS4G. It is very confusing.

      “I cannot understand why only the increased Pi supply in ∆ubp3, but not the increased Pi by medium supplement, can increase mitochondrial activity.”

      This is an interesting point, that requires a nuanced explanation, which we try to provide below.

      For mitochondrial respiration to increase in the presence of high Pi, the cytosolic Pi has to be transported to the mitochondria sufficiently. In ubp3Δ the increased free Pi (as a consequence of rewired glycolysis) is transported to the mitochondria (Figure 4). This increased mitochondrial Pi can therefore increase mitochondrial respiration in ubp3Δ.

      In case of WT+Pi, the externally supplemented Pi cannot further enter mitochondria (as shown in Figure 5-Figure supplement 1C) and is most likely restricted to the cytosol. Because of this inability of the Pi to access mitochondria, the mitochondrial respiration does not increase in WT+Pi (Figure 5-Figure supplement 1E).

      The likely reason for this difference in mitochondrial Pi transport in ubp3Δ vs WT+Pi is the relative difference in their glycolytic rate. The glycolytic rate is inherently decreased in ubp3Δ, but not in WT+Pi. To dissect this possibility of glycolytic rate itself contributing to the Pi availability in the mitochondria, we inhibited glycolysis in WT cells (using 2DG), and then supplemented Pi. Compared to cells in the same glucose condition (with 2DG, but without supplementing excess Pi), now the WT+Pi (+2DG) has higher mitochondrial respiration (Figure 5-Figure supplement 1F). This suggests that a combination of low glycolysis and high Pi is required for increasing mitochondrial respiration (as elaborated in the discussion section of the manuscript).

      An obvious question that arises out of this observation is how does the change in glycolytic rate regulate mitochondrial Pi transport. One consequence of altering the glycolytic rate is a change in cytosolic pH. This itself will bear on the extent of Pi transport into mitochondria, as discussed in detail below.

      In mitochondria, Pi is co-transported along with protons. Therefore, changes in cytosolic pH (which changes the proton gradient) can control the mitochondrial Pi transport (Hamel et al., 2004). Glycolytic rate is a major factor that controls cytosolic pH. The cytosolic pH in highly glycolytic cells is ~7, and decreasing glycolysis results in cytosolic acidification (Orij et al., 2011). Therefore, under conditions of decreased glycolysis (such as loss of Ubp3), cytosolic pH becomes acidic. Since mitochondrial Pi transport is dependent on the proton gradient, a low cytosolic pH would favour mitochondrial Pi transport. Therefore, under conditions of decreased glycolysis (2DG treatment, or loss of Ubp3), where cytosolic pH would be acidic, increasing cytosolic Pi might indirectly increase mitochondria Pi transport, thereby leading to increased respiration.

      To explain this and integrate all these points, we have extended a discussion section in this manuscript. We include this section below:

      “Supplementing Pi under conditions of low glycolysis (where mitochondrial Pi transport is enhanced), as well as directly supplementing Pi to isolated mitochondria, increases respiration (Figure 5, Figure 5-figure supplement 1). Therefore, in order to derepress mitochondria, a combination of increased Pi along with decreased glycolysis is required. An additional systems-level phenomenon that might regulate Pi transport to the mitochondria is the decrease in cytosolic pH upon decreased glycolysis (60, 61). The cytosolic pH in highly glycolytic cells is ~7, and decreasing glycolysis results in cytosolic acidification (60, 61). Therefore, under conditions of decreased glycolysis (2DG treatment, deletion of Ubp3, and decreased GAPDH activity), cytosolic pH becomes acidic. Since mitochondrial Pi transport itself is dependent on the proton gradient, a low cytosolic pH would favour mitochondrial Pi transport (62). Therefore, under conditions of decreased glycolysis (2DG treatment, or loss of Ubp3, or decreased GAPDH activity), where cytosolic pH would be acidic, increasing cytosolic Pi might indirectly increase mitochondria Pi transport, thereby leading to increased respiration. Alternately, increasing mitochondrial Pi transporter amounts can achieve the same result, as seen by overexpressing Mir1 (Figure 5).”

      This possibility of changes in cytosolic pH regulating mitochondrial Pi transport and thereby respiration is a really interesting future research question, and an idea that has not yet been explored till date. This can stimulate new lines of thinking towards finding conserved biochemical principles that control mitochondrial repression in high glucose.

      “Moreover, they first showed that ∆ubp3 had a Mir1 increase in Fig3A, then showed no changes in FigS4G. It is very confusing”

      increase in Mir1 in ubp3Δ shown in figure 3A comes from the analysis of the proteomics dataset from a previous study (Isasa et al., 2015). Subsequently, we more systematically experimentally assessed Mir1 levels directly, and did not observe an increase in Mir1 (Figure 4figure supplement 1H in revised manuscript). It is entirely possible that in a large-scale study (as in Isasa 2015), some specific proteomic targets might not fully reproduce when tested very specifically (as is described in Handler et al., 2018 and Mehta et al., 2022). We do clearly indicate this in the text, but given the density of information in this study, it is understandable that this point was missed by the reviewer.

      (3) Given that there is no degradation difference for these glycolytic enzymes in ∆ubp3, and the authors found transcriptional level changes, suggests an alternative possibility where ∆ubp3 may signal through unknown mechanisms to parallelly regulate both mitochondrial biogenesis and glycolytic enzyme expression. The increase of trehalose synthesis usually happens in cells under proteostasis stress, so it is important to rule out whether ∆ubp3 signals these metabolic changes via proteostasis dysregulation. This echoes my first point that it is unknown whether wild-type cells use a similar mechanism as ∆ubp3 cells to regulate the glucose repression of mitochondria.

      We appreciate this point raised by the reviewer, but this again requires some clarification (as made earlier). The goal of this study was to identify systems-level principles that explain mitochondrial repression in high glucose. Although we started by performing a screen to identify proteostatic regulators of mitochondrial activity in high glucose, and identified Ubp3 as a mediator of mitochondrial activity, our approach was to use ubp3Δ cells as a model to understand the metabolic principles that regulate mitochondrial repression. This has been reiterated repeatedly in the manuscript – for example lines 123-124 “We therefore decided to use ubp3Δ cells to start delineating requirements for glucose-mediated mitochondrial repression.” and again in the discussion section – lines 442-460, where we discuss some unique advantages of using ubp3Δ cells to understand a general basis of mitochondrial regulation. To test this hypothesis, we also used orthogonal approaches, as well as other mutants and conditions with defective glycolysis, such as tdh2Δtdh3Δ cells and 2DG treatments. Only with these multiple converging evidences do we infer that there might be a role of the change in Pi balance (due to changes in glycolytic rate) in regulating mitochondrial activity.

      We certainly agree that there is great value in identifying the mechanistic details of how Ubp3 regulates mitochondria. But this requires very distinct approaches not pursued in this study. This is not the question that we are addressing in this story. Separately, identifying targets of DUBs is one of the exceptional challenges in biology, since there are currently no straightforward chemicalbiology approaches to do so for this class of proteins. Unlike kinase/phosphatase systems, or even ubiquitin ligases, substrate trapping mutants etc have proven to be abject failures in identifying direct targets of DUBs. A quantitative proteomics study might suggest some proteins/cellular processes regulated by Ubp3. This has been attempted for several DUBs, but rarely have any direct substrates of DUBs every been identified, in any system. A high quality quantitative, descriptive proteome dataset of ubp3Δ cells is already available from a previous study (Isasa et al., 2015), which we cite extensively in this manuscript, and indeed was invaluable for this study. We cannot improve the outstanding quality dataset already available. Interestingly, the findings of this study actually help substantiate our idea of an increased mitochondrial activity and change in Pi homeostasis in ubp3Δ cells. The Isasa et al dataset finds proteins involved in mitochondrial respiration that are high in ubp3Δ cells, and the glycolytic enzymes and PHO regulon proteins are reduced. In our study, using these data references, we were able to conceptually piece together how changes in glycolytic flux can alter Pi balance.

      Apart from identifying changes in protein levels, a separate challenge in making sense of this quantitative proteomics data is the difficulty in pinpointing any target of Ubp3 that specifically regulates these processes. A single DUB can have multiple substrates, and this could regulate the cellular metabolic state in a combinatorial manner. This is the essence of all signaling regulators in how they function, and it is therefore important to understand what their systems-level regulation of cell states are (separate from their specific individual substrates). Therefore, identifying the specific target of Ubp3 responsible for this metabolic rewiring can be very challenging. These experiments are well beyond the scope or interest of the current manuscript.

      If we had pursued that road in this study, we would not have made any general findings related to Pi balance, nor would this more general hypothesis have emerged.

      (4) Other major concerns:

      (a) The authors selectively showed a few proteins in their manuscript to support their conclusion. For example, only Cox2 and Tom70 were used to illustrate mitochondrial biogenesis difference in line 97. Later on, they re-analyzed the previous MS dataset from Isasa et al 2015 and showed a few proteins in Fig3A to support their conclusion that ∆ubp3 increases mitochondrial OXPHOS proteins. However, I checked that MS dataset myself and saw that many key OXPHOS proteins do not change, for example, both ATP1 and ATP2 do not change, which encode the alpha and beta subunits of F1 ATPase. They selectively reported the proteins' change in the direction along with their hypothesis.

      To clarify, we observe an increase in Cox2 protein levels but not in Tom70 levels which suggests that there is no increase in mitochondrial biogenesis. The increase is specific to some respiration related mitochondrial proteins such as Cox2 (Figure 1E, Figure 3A). We have clearly pointed out this in the manuscript. We used Cox2 protein levels as an additional readout for ETC activity, to validate our observations coming from the potentiometric mitotracker readouts, and basal oxygen consumption rate (OCR) measurements. This was for 3 reasons: Cox2 is a mitochondrial genome encoded subunit of the complex IV (cytochrome c oxidase) in the ETC, and has a redox centre critical for the cytochrome c oxidase activity. The biogenesis and assembly of complex IV subunits have been studied with respect to multiple conditions such as glucose availability and hypoxia and the expression and stability of the mitochondrial encoded complex IV subunits are exceptionally well correlated to changes in mitochondrial respiration (Fontanesi et al., 2006). Cox2 is very well characterised in S. cerevisiae, and the commercially available Cox2 antibodies are outstanding, which makes estimating Cox2 levels by western blotting unambiguous and reproducible.

      We re-analyzed the proteomic dataset from Isasa et al to find out additional information regarding the key nodes that are differentially regulated in ubp3Δ. We have not claimed at any point in the manuscript that all OXPHOS related proteins are upregulated in ubp3Δ, nor is there any need for that to be so. We identified Ubp3 from our screen, observed an increase in mitochondrial potential, basal OCR, and Cox2 levels. We later found out that the proteomic data set for ubp3Δ also supports our observations that mitochondrial respiration is upregulated in ubp3Δ. The reviewer points out that we “showed a few proteins in Fig3A to support their conclusion that ∆ubp3 increases mitochondrial OXPHOS proteins”. Our conclusion is that the deletion of Ubp3 increases mitochondrial respiration. The combined readouts which we used to reach this conclusion (OCR, mitochondrial potential, mitochondrial ATP production, Cox2 levels) are far more direct, comprehensive and conclusive than showing an increase in a few proteins related to OXPHOS, as also explained earlier toward a distinct reviewer query. Since different mitochondrial proteins are regulated by different mechanisms, we need not see an increase in all the OXPHOS proteins in a mutant like ubp3Δ where mitochondrial respiration is high. An increase in some key proteins would be sufficient to increase the respiration as seen in our case.

      To summarise, the proteomic dataset supports our observation, but our conclusions are not dependent on the increase in OXPHOS proteins observed in the dataset.

      (b) The authors said they deleted ETC component Cox2 in line 111. I checked their method and table S1, I cannot figure out how they selectively deleted COX2 from mtDNA. This must be a mistake.

      Yes, we understand that for mitochondrially encoded proteins, a simple knock-out strategy has limitations. However, we first tried to generate the Cox2 deletion mutant by a standard PCR mediated gene deletion strategy (Longtine 1998), with the optimistic assumption that even if all Cox2 is not lost, a substantial fraction of the Cox2 genes would be lost via recombination. We selected the transformants after strong antibiotic selection, and then we measured the Cox2 protein levels. Gratifyingly, we found that the mutant strain had substantially decreased Cox2 protein levels (but not a complete loss), and this was retained across generations. The data is shown below.

      Author response image 14.

      Cox2 levels in WT vs Cox2 mutants

      Since the mutants have decreased Cox2 levels, we went ahead and performed growth assays using this strain, in a WT or Ubp3 deletion background. Deletion of Ubp3 in the Cox2 mutant resulted in a more severe growth defect.

      However, we fully agree that this strain is not a complete Cox2 knockout, and it is possible that the decrease in Cox2 is due to modifications in some other unelated gene. In the text, we should also not have named this cox2Δ. Since we are not sure of the exact genetic modification in this mutant, we have removed this data from the revised manuscript.

      Instead, we have now repeated all experiments, utilizing a fully characterised Cox2 mutant -cox262, described in (5) which has defective respiration. In this revised version, we find that deletion of Ubp3 in this strain retains the originally observed severe growth defect in glucose. This is consistent with our conclusion that a functional mitochondria is required for proper growth in ubp3Δ mutant. To separately validate this conclusion, we also utilized a Rho0 strain which does not have mitochondrial DNA and thereby cannot perform mitochondrial respiration. We show that deletion of Ubp3 results in a more severe growth defect in a Rho0 strain. These results are included in the revised manuscript as figure 1-figure supplement 1 I.

      Author response image 15.

      Also, we further confirmed that the Rho0 strain and Rho0 ubp3 strain is incapable of respiration, using seahorse assay. This data is included in the revised manuscript as Figure 4-figure supplement 1G.

      Author response image 16.

      Basal OCR in Rho0 cells

      We hope that these new data address the reviewer’s concerns about the Cox2 mutant.

      (c) They used sodium azide in a lot of assays to inhibit complex IV. However, this chemical is nonspecific and broadly affects many ATPases as well. Not sure why they do not use more specific inhibitors that are commonly used to assay OCR in seahorse.

      We have now performed growth assays for WT and ubp3Δ cells in the presence of specific mitochondrial OXPHOS inhibitors - oligomycin and FCCP. We observe a more severe growth defect in ubp3Δ cells compared to WT cells in the presence of oligomycin and FCCP, similar to the results observed with sodium azide. All these data are now included in the revised manuscript as Figure 1I, Figure1-figure supplement 1H, along with associated text.

      Author response image 17.

      Growth rate in the presence of FCCP

      Author response image 18.

      Figure1-figure supplement 1H- Growth rate in the presence of oligomycin

      We hope that these new data addresses the reviewer’s concerns.

      (d) The authors measured cellular Pi level by grinding the entire cells to release Pi. However, this will lead to a mix of cytosolic and vacuolar Pi. Related to this caveat, the cytosol has ~50mM Pi, while only 1-2mM of these glycolysis metabolites, I am not sure why the reduction of several glycolysis enzymes will cause significant changes in cytosolic Pi levels and make Pi the limiting factor for mitochondrial respiration. One possibility is that the observed cytosolic Pi level changes were caused by the measurement issue mentioned above.

      The Pi estimation shown in figure 3 C, E, F and G is a measure of total Pi in the cells. The vacuole is a major storehouse of phosphate in cells. However, unlike plant cells where free phosphate is stored in vacuoles, yeast vacuoles store phosphate only in the form of polyphosphates (Yang et al., 2017, Hürlimann et al., 2007). The free Pi formed from the hydrolysis of polyphosphate is subsequently transported to cytosol via the exporter Pho91 (Hürlimann et al., 2007). This therefore makes cytosol and mitochondria the major storage of usable free Pi in yeast. Since the malachite green assay that we use for phosphate estimation is specific to free Pi, and not polyphosphate, the Pi estimates that we show in figure 3 come from a combination of cytosolic and mitochondrial Pi. As explained earlier, in order to specifically measure mitochondrial Pi, we have established methods to rapidly isolate mitochondria, and then followed this by estimating Pi in these isolated mitochondria (Figure 4B). Here we clearly see a large increase in mitochondrial Pi in the Ubp3 deletion cells. This allows us to estimate the changes in Pi levels that specific to mitochondria, without relying only on total Pi changes.

      “the cytosol has ~50mM Pi, while only 1-2mM of these glycolysis metabolites, I am not sure why the reduction of several glycolysis enzymes will cause significant changes in cytosolic Pi levels and make Pi the limiting factor for mitochondrial respiration”

      The reviewer has completely missed the fact that the glycolytic rate in yeast is the highest known for any cell. While the steady state levels of glycolytic metabolites might be ~2 mM, the process of glycolysis is not static but is rapid and continuous. Glucose is continuously broken down and converted to pyruvate, along with the consumption of Pi and generation of ATP. This is the reason for the rapid 13C label saturation (within seconds of 13C glucose addition) in yeast cells (Figure 2-figure supplement 1F). This instantaneous label saturation makes accurate flux measurements arduous because of which we had to optimize a method for measuring glycolytic flux in yeast cells (Figure 2-D, Figure 2-figure supplement 1F). Indeed, for that reason, our measurements of glycolytic flux in yeast are the first time this is being reported in the field. This in itself is an enormously challenging experiment, and establishes a new benchmark.

      In highly glycolytic cells, most of the ATP is synthesized via glycolysis and the rate of glycolysis and ATP synthesis is very high. In the reaction catalysed by GAPDH, Pi and ADP is converted to ATP. This ATP formed acts as a Pi donor to most of the Pi consuming reactions in the cells. Some of these processes such a protein translation utilizes ATP, but releases Pi and ADP and this Pi enters the cellular Pi pool. Several other reactions such as nucleotide biosynthesis, polyphosphate biosynthesis and protein phosphorylation use ATP as a Pi donor and the Pi is fixed in biomolecules. Increasing the rates of these ‘Pi sinks’ therefore can result in a decrease in Pi pools. This is a concept we have earlier tried to clarify more elaborately in (Gupta and Laxman, 2021). In fact, increasing nucleotide biosynthesis and polyphosphate synthesis has earlier been suggested to decrease available free Pi (Austin and Mayer 2020, Desfougères et al., 2016). When glycolytic flux is high, this is coupled/tuned to the consumption of Pi which will be correspondingly high due to increased ATP, nucleotide and polyphosphate synthesis. Pi levels rapidly decrease upon glucose addition, due to the continuous Pi consumption during glycolysis (Hohmann et al., 1996, Van Heerden et al., 2014 , Koobs et al., 1972). Therefore, changes in glycolytic rate due to change in glycolytic enzyme levels can result in significant changes in Pi levels due to changes in Pi consumption rate.

      Our results also show that the apart from Pi levels, the glycolytic state can regulate mitochondrial Pi transport as well. This is the reason for mitochondrial Pi levels and basal OCR not increasing merely by adding Pi to cells. We show that basal OCR can be increased by adding Pi in the presence of 2DG. This regulation of mitochondrial Pi transport is a major limiting factor for mitochondrial respiration and could be mediated partly by the regulating of Mir1 levels and also by the changes in the cytosolic pH which regulates the rate of mitochondrial Pi transport. We have discussed these points in the discussion section in our manuscript.

      We hope that this clarifies the reviewer’s concerns regarding how changes in glycolytic rate can regulate changes in cytosolic Pi levels.

      (e) The authors used ∆mir1 and MIR1 OE to show that Pi viability in the mitochondrial matrix is important for mitochondrial activity and biogenesis. This is not surprising as Pi is a key substrate required for OXPHOS activity. I doubt the approach of adding a control to determine whether Pi has a specific regulatory function, while other OXPHOS substrates, like ADP, O2 etc do not have the same effect.

      To clarify, we only used the mir1Δ cells to understand the requirement for Pi transport from cytosol to mitochondria in controlling respiration. The reviewer is correct in stating that deletion of Mir1 would reduce Pi import to mitochondria and thereby inhibit respiration. This is exactly the conclusion we suggest from this experiment as stated in the manuscript – “These data suggest that mitochondrial Pi transport (via Mir1) is critical for maintaining basal mitochondrial activity even in high glucose”. We have only used these experiments to support the idea that even though glycolysis and mitochondria are in different compartments, a change in Pi balance in one compartment (cytosol) can affect Pi levels in the other (mitochondria) since there is Pi transport between these two compartments. Since mitochondria has its own polyphosphate reserves, in the absence of these experiments with mir1Δ cells it can be imagined that mitochondria PolyP can be an additional source of Pi to support respiration, and therefore changes in cytosolic Pi may have only a minor effect on mitochondrial respiration. Our experiments with mir1Δ and Mir1-OEcells indubitably suggest that Pi transport to mitochondria from cytosol is important for respiration, and therefore changes in cytosolic Pi levels (or maintaining cytosolic Pi at a lower level due to the rate of glycolysis) will have rippling effects in mitochondrial Pi availability. Further, these data suggest that for example under glycolytic inhibition (low glucose, or 2DG), while all factors (signalling, substrate availability etc) favour respiration (and mitochondrial derepression), cells cannot unable to achieve this in the absence of ample Pi transport from cytosol. This therefore places Pi at the centre stage in controlling mitochondrial respiration.

      We conclude that Pi is a major, but not the only constraint for mitochondrial respiration. There certainly could be a role for ADP, oxygen availability etc in regulating respiration. However, these are beyond the scope of our study. We have discussed about the potential role of ADP in regulating mitochondrial repression in the discussion section. “An additional consideration is the possible contribution of changes in ADP in regulating mitochondrial activity, where the use of ADP in glycolysis might limit mitochondrial ADP. Therefore, when Pi changes as a consequence of glycolysis, it could be imagined that a change in ADP balance can coincidentally occur. However, prior studies show that even though cytosolic ADP decreases in the presence of glucose, this does not limit mitochondrial ADP uptake, or decrease respiration, due to the very high affinity of the mitochondrial ADP transporter.”

      We hope that this clarifies the reviewer’s concerns regarding the use of Mir1 OE and mir1Δ strains.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Some of the experiments should be repeated in other strain backgrounds for reproducibility and rigor.

      As discussed in the response to point number 1, we have now utilized multiple strains of S. cerevisiae to test our findings. We now find that our discoveries regarding the role of altered Pi budgeting as a constraint for mitochondrial respiration, and the role of Ubp3 as a regulator of mitochondrial repression are conserved across multiple genetic backgrounds of S. cerevisiae. These results are included in the revised manuscript as a new figure- Figure 6 and associated text. We used the W303, Σ1278 and BY4742 strains of S. cerevisiae to show that deletion of Ubp3 increases mitochondrial activity (as shown by increased mitochondrial membrane potential and increased Cox2 levels). Using the W303 strain we show that the deletion of Ubp3 increases Pi levels and that the increased Pi is necessary for increasing mitochondrial respiration (Figure 6C, D). These added experiments have substantially broadened the generality of our findings.

      The number of biological repeats needs to be increased in all experiments.

      We have increased the number of biological repeats in key experiments that shows that the increased Pi levels are necessary for the increased mitochondrial respiration in ubp3Δ and tdh2Δtdh3Δ cells (revised Figure 4F). Apart from a few basal OCR measurements and mitotracker data in supplementary figure, all our experiments are performed for 3 biological repeats. In case of basal OCR measurements, yeast cells have to be aliquoted to poly-L-lysine coated seahorse plates and centrifuged to ensure that the cells are properly settled. This is due to the non-adherent nature of yeast cells. During the centrifugation step, the wells in the two end rows cannot be utilized due to uneven settling of cells which affects the basal OCR readings in these wells. In case of several experiments that involve multiple samples, we were therefore limited to restrict the number of biological replicates to 2 (repeated independently), so that all samples could be accommodated in the plate.

      Full western blot images should be supplemented along with the other data.

      The complete western blot images are now included in the revised manuscript as supplementary file 2.

      TCA cycle flux should be analyzed and presented in the study to conclude some of the findings.

      As discussed in detail in the response to point number 2, we have performed steady state and flux measurements for TCA cycle intermediates. This data is now included as a new supplement figure- Figure 2-figure supplement 2.

      Reviewer #2 (Recommendations For The Authors):

      (1) In Fig. 2A, they should also include the gluconeogenesis enzymes (fructose 1,6 bisphosphatase, PEP carboxykinase, and pyruvate carboxylase) to exclude the possibility that glycolytic intermediates are not rerouted to gluconeogenesis.

      We measured the protein levels of Fbp1 (fructose 1,6 bisphosphatase) and Pck1 (PEP carboxykinase). We observed an increase in the protein levels in both enzymes in ubp3Δ. The data is shown below.

      Author response image 19.

      Fbp1 and Pck1 protein levels

      While we agree that this is an interesting observation which might help us in understanding the metabolic rewiring in ubp3Δ, we have not included this data in the current revised version of the manuscript due to two main reasons.

      (1) Since ubp3Δ cells have a defective glycolysis and therefore a defective glucose repression, the mRNA and protein levels of gluconeogenic enzymes which are usually glucose-repressed might increase. This might be a response at the level of transcription and translation of these enzymes and might or might not change the rate of gluconeogenesis in these cells. This is because of multiple other factors that regulate gluconeogenic flux such as allostery, mass action etc. Therefore, to avoid confounding our main points and since we cannot make a conclusive assumption on the gluconeogenic metabolism in these mutants, we don’t include this data. The primary focus of our story is the mitochondrial repression component. Understanding the feedback controls that alter gluconeogenesis in these mutants is beyond the scope of this study and could be addressed in a separate future study.

      (2) As we highlight extensively in the response letter and in the manuscript, our aim is not to understand the specific mechanistic role of Ubp3. In this manuscript, we identify the conserved constraints that control mitochondrial repression without focusing just on the role of Ubp3 in regulating this. Whether Ubp3 regulates gluconeogenesis is a question that could be addressed in a future study that focuses on identifying the altered signalling mechanisms in ubp3Δ and the targets of Ubp3.

      (2) In line 292, page 10, there is a typo "dermine".

      We apologize for this mistake. Corrected.

      (3) In Figure 5A, is there a reason why they chose 0.1% glucose condition as a low glucose condition? Also, is there a dose-dependent change in OCR or other mitochondrial functions according to the concentration of glucose?

      The glucose concentration of 0.1% was selected to decrease (but not completely remove) the available glucose. 0.1% glucose is considered as a standard low glucose condition in S. cerevisiae (Yin et al., 2003) and the effect of this glucose concentration on cellular processes has been extensively studied (Yin et al., 2003, Takeda et al., 2015 etc). <0.2% glucose is the critical threshold for activating respiratory metabolism (Takeda et al., 2015) and shifting cells to 0.1% glucose in our experiments will activate respiration, as we show in our data. However, this is very different from completely removing glucose or using an alternate carbon source such as ethanol, because this would result in full activation of gluconeogenesis. We further find that when cells are grown in ethanol, the gluconeogenic activation will also change the Pi homeostasis. This will in part be a result of the fully reversed direction of the GAPDH catalysed reaction (Figure 3G). If such a condition is used, it could lead to misinterpretations, and confound the conclusions that we make from these set of experiments where Pi homeostasis play a major role. In 0.1% glucose it has been shown that gluconeogenesis is still partly repressed (Yin et al., 2003). The pathways utilizing alternate carbon sources still remain repressed (even though to a lower extend compared to 2% glucose) in 0.1% glucose (Yin et al., 2003). We hope that this clarifies the concerns regarding the rationale behind using 0.1% glucose in our experiments.

      The extent of glucose repression is dependent on the concentration of glucose. Glucose concentration >1% has been shown to activate degradation of mRNAs involved in alternate carbon utilization. Different signaling pathways involved in growth under glucose and glucose repression is regulated by glucose concentration. This is discussed in detail in Yin et al., 2002. We (Figure 5figure supplement 1A) also observe a dose dependent increase in mitochondrial membrane potential in the presence of 2DG. This also suggests that the rate of glycolysis (which could be also mediated by changes in glucose concentration) can regulate the extent of mitochondrial derepression.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Reviews):  

      First, the metabolic network in this study is incomplete. For example, amino acid synthesis and lipid synthesis are important for biomass and growth, but they4 are not included in the three models used in this study. NADH and NADPH are as important as ATP/ADP/AMP, but they are not included in the models. In the future, a more comprehensive metabolic and biosynthesis model is required.  

      Thank you for the critical comment on the weakness of the present study. We actually tried to study a larger model like Turnborg et al (2021), which is a model of JCVI-syn3A, but we give up to include it in our model list to study in depth. This is because we noticed that the concentration of ATP in the model can be negative (we confirmed this with one of the authors of the paper). Another "big" kinetic model of metabolism that we could list would be Khodayari et al (2017). However, we could not find the models to compare the dynamics of this big model with. Therefore, we decided to use the model only for the central carbon metabolism for now. We would like to leave a more extended study for the near future.  

      We would like to mention that NADH and NADPH are included in Khodayari model and Boecker model, while NADH and NADPH are ramped up to NADH in the latter model.  

      Second, this work does not provide a mathematical explanation of the perturbation response χ. Since the perturbation analysis is performed close to the steady state (or at least belongs to the attractor of single-steady-state), local linear analysis would provide useful information. By complementing with other analysis in dynamical systems (described below) we can gain more logical insights about perturbation response.  

      We tried a linear stability analysis. However, with the perturbation strength we used here, the linearization of the model is no longer valid, in the sense that the linearized model

      leads to negative concentrations of the metabolites (xst+Δx < 0 for some metabolites). We have added a scatter plot of the response coefficient of trajectories sharing the initial condition, while the dynamics are computed by the original model and the linearized model, respectively. (Fig. S1). 

      Since the response coefficient is based on the logarithm of the concentrations, as the metabolite concentrations approach zero, the response coefficient becomes larger. The high response coefficient in the Boecker and Chassagnole model would be explained by this artifact.  The linearized Khodayari model shows either χ~1 or χ = 0 (one or more metabolite concentrations become negative). This could be due to the number of variables in the model. For the response coefficient to have a larger value, the perturbation should be along the eigenvector that leads to oscillatory dynamics with long relaxation time (i.e., the corresponding eigenvalue has a small real part in terms of absolute value and a non-zero imaginary part). However, since the Khodayari model has about 800 variables, if perturbations are along such directions, there is a high probability that one or more metabolite concentrations will become negative.

      We fully agree that if the perturbations on the metabolite concentrations are in the linear regime, the response to the perturbations can be estimated by checking the eigenvalues and eigenvectors. However, we would say that the relationship between the linearized model (and thus the spectrum of eigenvalues) and the original model is unclear in this regime.  We remarked this in Lines 158160.

      Recommendations for the authors:

      My major suggestion is about understanding the key quantity in this study: the response coefficient χ. When the perturbed state is close to the fixed point, one could adopt local stability analysis and consider the linearized system. For a linear system with one stable fixed point P, we consider the Jacobian matrix M on P. If all eigenvalues of M are real and negative, the perturbed trajectory will return to P with each component monotonically varies. If some eigenvalues have negative real part and nonzero imaginary part, then the perturbed trajectory will spiral inward to the fixed point. Depending on the spiral trajectory and the initially perturbed state, some components would deviate furthermore (transiently) from the fixed point on the spiral trajectory. This explains why the response coefficient χ can be greater than 1. 

      Mathematically, a locally linearized system has similar behavior to the linear system, and the examples in this study can be analyzed in the similar way. Specifically, if a system has many complex eigenvalues, then the perturbed trajectory is more likely to have further deviation. The metabolic network models investigated in this work are not extremely large, and hence the author could analyze its spectrum of the Jacobian matrix at the steady state. Since the steady state is stable, I expect the spectrum located in the left half of the complex plane. If the spectrum spread out away from the real axis, we expect to see more spiral trajectories under perturbation. I think the spectrum analysis will provide a complementary view with respect to analysis on χ.  The authors' major findings, about the network sparsity and cofactors, can also be investigated under the framework of the spectrum analysis.  

      Of course, when the nonlinear system is perturbed far away from the fixed point, there are other geometrical properties of the vector field that can cause the response coefficient χ to be greater than 1. This could also be investigated in the future by testing the behavior of small and large perturbations and observing if the systems have signatures of nonlinearity.  

      Since all perturbed states return to the steady state, the eigenvalues of the Jacobi matrix accompanying the linearized system around the steady state are in the left half complex plane (negative real value). Also, some eigenvalues have non-zero imaginary parts.    

      The reason we emphasize the "nonlinear regime" is that the linearization is no longer valid in this regime, i.e. the metabolite concentrations can be negative when we calculate the linearized system. Certainly, there are complex eigenvalues in the Jacobi matrix of any model. However, we would say that there is no clear relationship between the eigenvalues and the response coefficient.      

      Minor suggestions:  

      Line 127: Regarding the source of perturbation, cell division also generates unequal concentration of proteins and metabolites for two daughter cells, and it is an interesting mechanism to create metabolic perturbation. 

      Thank you for the insightful suggestion. We mentioned the cell division as another source of perturbation (Lines 130-131).

      Line 175: I do not quite understand the statement "fixing each metabolite concentration...", since the metabolite concentration in the ODE simulation would change immediately after this fixing.  

      We meant in the sentence that we fixed the concentration of the selected metabolite as the steady state concentration and set the dx/dt of that metabolite to zero. We have rewritten the sentences to avoid confusion (Lines 180-181).

      Figure 2: There are a lot of inconsistencies between the three models. Could we learn which model is more reasonable, or the conclusion here is that the cellular response under perturbation is model-specific? The latter explanation may not be quite satisfactory since we expect the overall cellular property should not be sensitive to the model details. 

      Ideally, the overall cellular property should be insensitive to model details. However, the reality is that the behavior of the models (e.g., steady-state properties, relaxation dynamics, etc.) depends on the specific parameter choices, including what regulation is implemented. I think this situation is part of the motivation for the ensemble modeling (by J. Liao and colleague) that has been developed.  

      Detailed responsiveness would be model specific. For example, FBP has a fairly strong effect in the Boecker model, but less so in the Khodayari model, and the opposite effect in the Chassagnole model (Fig. 2). Our question was whether there are common tendencies among kinetic models that tend to show model-specific behavior.  

      Reviewer 2 (Public Review):

      (1) In the study on determining key metabolites affecting responses to perturbations (starting from line 171), the authors fix the values of individual concentrations to their steady-state values and observe the responses. Such a procedure adds artificial constraints to the network because, in the natural responses of cells (and models) to perturbations, it is highly unlikely that metabolites will not evolve in time. By fixing the values of specific metabolites, the authors prohibit the metabolic network from evolving in the most optimal way to compensate for the perturbation. Instead of this procedure, have the authors considered for this task applying techniques from variance-based sensitivity analysis (Sobol, global sensitivity analysis), where they can calculate the first-order sensitivity index and total effect index? Using this technique, the authors would be able to determine the key metabolites while allowing for metabolic responses to perturbations without unnatural constraints. 

      Thank you for the useful suggestion for studying the roles of each metabolite for responsiveness. We have computed the total sensitivity index (Homma and Salteli, 1996) for each metabolite of each model (Fig.S5). The total sensitivity indices of ATP are high-ranked in Khodayari- and Chassagnole model, while it is middle-ranked in the Boecker model. We believe that the importance of the adenyl cofactors is highlighted also in terms of the Sobol’ sensitivity analysis (the figure is referred in Lines 193-195). 

      We have encountered a minor difficulty for computing the sensitivity index. For the computation of the sensitivity index, we need to carry out the following Monte Carlo integral, 

      where the superscript (m) is the sample number index. The subscript i represents the ith element of the vector x, and ~i represents the vector x except for the ith element. The tilde stands for resampling.  

      There are several conserved quantities in each model. For independent resampling, we need to deal with the conserved quantities. For the Boecker and Chassagnole models, we picked a single metabolite from each conservation law and solved its concentration algebraically to make the metabolite concentration the dependent variable. Then, we can resample the metabolite concentration of one metabolite without changing the concentrations of other metabolites, which are independent variables.  

      However, in the Khodayari model, it was difficult to solve the dependent variables because the model has about 800 variables. Therefore, we gave up the computations of the sensitivity indices of the metabolites whose concentration is part of any conserved quantities, namely NAD, NADH, NADP, NADPH, Q8, and Q8H2.

      (2) To follow up on the previous remark, the authors state that the metabolites that augment the response coefficient when their concentration is fixed tend to be allosteric regulators. The authors should report which allosteric regulations are implemented in each of the models so that one can compare against Figure 2. Again, the effect of allosteric regulation by a specific metabolite that is quantified the way the authors did is biased by fixing the concentration value - it is true that negative feedback is broken when the metabolite concentration is fixed, however, in the rate law, there is still the fixed inhibition term with its value corresponding to the inhibition at the steady state. To see the effect of allosteric regulation by a metabolite, one can change the inhibition constants instead of constraining the responses with fixed concentrations.  

      We have listed the substrate-level regulations (Table S1-3). Also, we re-ran the simulation with reduced the effect of the substrate-level regulations for the reactions that are suspected to influence the change of the response coefficient. Instead of fixing the concentrations (Fig. S6). 

      The impact of substrate-level regulations is discussed in Lines 203-212.   

      We replaced "allosteric regulation" with "substrate-level regulation" because we noticed that some regulations are not necessarily allosteric.

      (3) Given the role of ATP in metabolic processes, the authors' finding of the sensitivity of the three networks' responses to perturbations in the AXP concentrations seems reasonable. However, drawing such firm conclusions from only three models, with each of them built around one steady state and having one kinetic parameter set despite that they were built for different physiologies, raises some questions. It is well-known in studies related to basins of attraction of the steady states that the nonlinear responses also depend on the actual steady states, the values of kinetic parameters, and implemented kinetic rate law, i.e., not only on the topology of the underlying systems. In the population of only three models, we cannot exclude the possibility of overlaps and strong similarities in the values of kinetic parameters, steady states, and enzyme saturations that all affect and might bias the observed responses. Ideally, to eliminate the possibility of such biases, one should simulate responses of a large population of models for multiple physiologies (and the corresponding steady states) and multiple parameter sets per physiology. This can be a difficult task, but having more kinetic models in this work would go a long way toward more convincing results. Recently, E. coli nonlinear kinetic models from several groups appeared that might help in this task, e.g., Haiman et al., PLoS Comput Biol, 17(1): e1008208, (2021), Choudhury et al., Nat Mach Intell, 4, 710-719, (2022); Hu et al., Metab Eng, 82, 123-133 (2024), Narayanan et al., Nat Commun, 15:723, (2024). 

      We have computed the responsiveness of 215 models generated by the MASSpy package (Haiman et al, 2021). Several model realizations showed a strong responsiveness, i.e. a broader distribution of the response coefficient (Fig.S8), and mentioned in Lines 339-341.

      We would like to mention that the three models studied in the present manuscript have limited overlap in terms of kinetic rate law and, accordingly, parameter values. In the Khodayari model, all reactions are bi-uni or uni-uni reactions implemented by mass-action kinetics, while the Boecker and Chassagnole models use the generalized Michaelis-Menten type rate laws. Also, the relationship between the response coefficients of the original model and the linearized model highlights the differences between the models (Fig. S1). If the models were somewhat effectively similar, the scatter plots of the response coefficient of the original- and linearized model should look similar among the three models. However, the three panels show completely different trends. Thus, the three models have less similarity even when they are linearized around the steady states. 

      (4) Can the authors share their insights on what could be the underlying reasons for the bimodal distribution in Figure 1E? Even after adding random reactions, the distribution still has two modes - why is that?  

      We have not yet resolved why only the Khodayari model shows the bimodal distribution of the response coefficient. However, by examining the time courses, the dynamics of the Khodayari model look like those of the excitable systems. This feature may contribute to the bimodal distribution of the response coefficient. In the future, we would like to show whether the system is indeed the excitable system and whcih reactions contribute to such dynamics.

      (5) Considering the effects of the sparsity of the networks on the perturbation responses (from line 223 onwards), when we compare the three analyzed models, it is clear that the Khodayari et al. model is a superset of the other two models. Therefore, this model can be considered as, e.g., Chassagnole model with Nadd reactions (though not randomly added). Based on Figures 1b and S2, one can observe that the responses of the Khodayari models have stronger responses, which is exactly opposite to the authors' conclusion that adding the reactions weakens the responses.

      The authors should comment on this.  

      The sparsity of the network is defined by the ratio of the number of metabolites to the number of reactions. Note that the Khodayari model is a superset of the Boecker and Chassagnole models in terms of the number of reactions, but also in terms of the number of metabolites (Boecker does not have the pentose phosphate pathway, Chassagnole does not have the TCA cycle, and neither has oxyative phosphorylation). Thus, even if we manually add reactions to the Boecker model, for example, we cannot obtain a network that is equivalent to the Khodayari model.  We added one sentence to clarify the point (Lines 254-255).

      Recommendations for the authors: 

      (1) Some typos: Line 57, remove ?; Line 134, correct "relaxation". 

      Thank you for pointing out. We fixed the typos.

      (2) Lines 510-515, please rewrite/clarify, it is confusing what are you doing. 

      We rewrote the sentences (Lines 529-532). We are sorry for the confusion.

      (3) Line 522, where are the expressions above Leq and K*? 

      Leq appears in the original paper of the Boecker model, but we decided not to use Leq. We apologize for not removing Leq from the present manuscript. The * in K* is the wildcard for representing the subscripts. We added a description for the role of “*”. 

      (4) Lines 525-530, based on the wording, it seems like you test first for 128 initial concentrations if the models converge back to the steady state and then you generate another set of 128 initial concentrations - is this what you are doing, or you simply use the 128 initial concentrations that have passed the test? 

      We apologize for the confusion. We did the first thing. We have rewritten the sentence to make it clearer. 

      (5) Figure 3, caption, by "broken line," did the authors mean "dashed line"? 

      We meant dashed line. We changed “broken line” to “dashed line”.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      I applaud the authors' for providing a thorough response to my comments from the first round of review. The authors' have addressed the points I raised on the interpretation of the behavioral results as well as the validation of the model (fit to the data) by conducting new analyses, acknowledging the limitations where required and providing important counterpoints. As a result of this process, the manuscript has considerably improved. I have no further comments and recommend this manuscript for publication.

      We are pleased that our revisions have addressed all the concerns raised by Reviewer #1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for assessment of memory-based tasks may provide improved early detection in Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      The authors also compare the latent cause model to the Rescorla-Wagner model and a latent state model allowing for better assessment of the latent cause model as a strong model for assessing reinstatement.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent causes by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory after reinstatement, at least based on the simulation and examples shown in figures 1 and 3. More specifically, in figure 1, the authors indicate that the posterior probability of the latent cause,z<sub>A</sub> (the putative acquisition memory), increases, partially leading to reinstatement. This does not appear to be the case as test 3 (day 36) appears to have similar posterior probabilities for z<sub>A</sub> as well as similar weights for the CS as compared to the last days of extinction. Rather, the model appears to mainly modify the weights in the most recent latent cause, z<sub>B</sub> - the putative the 'extinction state', during reinstatement. The authors suggest that previous experimental data have indicated that spontaneous recovery or reinstatement effects are due to an interaction of the acquisition and extinction memory. These studies have shown that conditioned responding at a later time point after extinction is likely due to a balance between the acquisition memory and the extinction memory, and that this balance can shift towards the acquisition memory naturally during spontaneous recovery, or through artificial activation of the acquisition memory or inhibition of the extinction memory (see Lacagnina et al. for example). Here the authors show that the same latent cause learned during extinction, z<sub>B</sub>, appears to dominate during the learning phase of reinstatement, with rapid learning to the context - the weight for the context goes up substantially on day 35 - in z<sub>B</sub>. This latent cause, z<sub>B</sub>, dominates at the reinstatement test, and due to the increased associative strength between the context and shock, there is a strong CR. For the simulation shown in figure 1, it's not clear why a latent cause model is necessary for this behavior. This leads to the next point.

      We would like to first clarify that our behavioral paradigm did not last for 36 days, as noted by the reviewer. Our reinstatement paradigm contained 7 phases and 36 trials in total: acquisition (3 trials), test 1 (1 trial), extinction 1 (19 trials), extinction 2 (10 trials), test 2 (1 trial), unsignaled shock (1 trial), test 3 (1 trial). The day is labeled under each phase in Figure 2A. 

      We have provided explanations on how the reinstatement is explained by the latent cause model in the first round of the review. Briefly, both acquisition and extinction latent causes contribute to the reinstatement (test 3). The former retains the acquisition fear memory, and the latter has the updated w<sub>context</sub> from unsignaled shock. Although the reviewer is correct that the z<sub>B</sub> in Figure 1D makes a great contribution during the reinstatement, we would like to argue that the elevated CR from test 2 (trial 34) to test 3 (trial 36) is the result of the interaction between z<sub>A</sub> and z<sub>B</sub>.

      We provided Author response image 1 using the same data in Figure 1D and 1E to further clarify this point. The posterior probability of z<sub>A</sub> increased after an unsignaled shock (trial 35), which may be attributed to the return of acquisition fear memory. The posterior probability of z<sub>A</sub> then decreased again after test 3 (trial 36) because there was no shock in this trial. Along with the weight change, the expected shock change substantially in these three trials, resulting in reinstatement. Note that the mapping of expected shock to CR in the latent cause model is controlled by parameter θ and λ. Once the expected shock exceeds the threshold θ, the CR will increase rapidly if λ is smaller.

      Lastly, accepting the idea that separate memories are responsible for acquisition and extinction in the memory modification paradigm, the latent cause model (LCM) is a rational candidate modeling this idea. Please see the following reply on why a simple model like the Rescorla-Wagner (RW) model is not sufficient to fully explain the behaviors observed in this study.

      Author response image 1.

      The sum posterior probability (A), the sum of associative weight of CS (B), and the sum of associative weight of context (C) of acquisition and extinction latent causes in Figure 1D and 1E.

      (2) The authors compared the latent cause model to the Rescorla-Wagner model. This is very commendable, particularly since the latent cause model builds upon the RW model, so it can serve as an ideal test for whether a more simplified model can adequately predict the behavior. The authors show that the RW model cannot successfully predict the increased CR during reinstatement (Appendix figure 1). Yet there are some issues with the way the authors have implemented this comparison:

      (2A) The RW model is a simplified version of the latent cause model and so should be treated as a nested model when testing, or at a minimum, the number of parameters should be taken into account when comparing the models using a method such as the Bayesian Information Criterion, BIC.

      We acknowledge that the number of parameters was not taken into consideration when we compared the models. We thank the reviewer for the suggestion to use the Bayesian Information Criterion (BIC). However, we did not use BIC in this study for the following reasons. We wanted a model that can explain fear conditioning, extinction and reinstatement, so our first priority is to fit the test phases. Models that simulate CRs well in non-test phases can yield lower BIC values even if they fail to capture reinstatement. When we calculate the BIC by using the half normal distribution (μ = 0, σ \= 0.3) as the likelihood for prediction error in each trial, the BIC of the 12-month-old control is -37.21 for the RW model (Appendix 1–figure 1C) and -11.60 for the LCM (Figure 3C). Based on this result, the RW model would be preferred, yet the LCM was penalized by the number of parameters, even though it fit better in trial 36. Because we did not think this aligned with our purpose to model reinstatement, we chose to rely on the practical criteria to determine whether the estimated parameter set is accepted or not for our purpose (see Materials and Methods). The number of accepted samples can thus roughly be seen as the model's ability to explain the data in this study. These exclusion criteria then created imbalances in accepted samples across models (Appendix 1–figure 2). In the RW model, only one or two samples met the criteria, preventing meaningful statistical comparisons of BIC within each group. Overall, though we agreed that BIC is one of the reasonable metrics in model comparison, we did not think it aligns with our purpose in this study.

      (2B) The RW model provides the associative strength between stimuli and does not necessarily require a linear relationship between V and the CR. This is the case in the original RW model as well as in the LCM. To allow for better comparison between the models, the authors should be modeling the CR in the same manner (using the same probit function) in both models. In fact, there are many instances in which a sigmoid has been applied to RW associative strengths to predict CRs. I would recommend modeling CRs in the RW as if there is just one latent cause. Or perhaps run the analysis for the LCM with just one latent cause - this would effectively reduce the LCM to RW and keep any other assumptions identical across the models.

      Regarding the suggestion to run the analysis using the LCM with one latent cause, we agree that this method is almost identical to the RW model, which is also mentioned in the original paper (Gershman et al., 2017). Importantly, it would also eliminate the RW model’s advantage of assigning distinct learning rates to different stimuli, highlighted in the next comment (2C).

      We thank the reviewer for suggesting applying the transformation of associative strength (V) to CR as in the LCM. We examined this possibility by heuristically selecting parameter values to test how such a transformation would influence the RW model (Author response image 2A). Specifically, we set α<sub>CS</sub> = 0.5, α<sub>context</sub> \= 1, β = 1, and introduced the additional parameters θ and λ, as in the LCM. This parameter set is determined heuristically to address the reviewer’s concern about a higher learning rate of context. The dark blue line is the plain associative strength. The remaining lines are CR curves under different combinations of θ and λ.

      Consistent with the reviewer’s comment, under certain parameter settings (θ \= 0.01, λ = 0.01), the extended RW model can reproduce higher CRs at test 3, thereby approximating the discrimination index observed in the 12-month-old control group. However, this modification changes the characteristics of CRs in other phases from those in the plain RW model. In the acquisition phase, the CRs rise more sharply. In the extinction phase, the CRs remain high when θ is small. Though changing λ can modulate the steepness, the CR curve is flat on the second day of the extinction phase, which does not reproduce the pattern in observed data (Figure 2B). These trade-offs suggest that the RW model with the sigmoid transformation does not improve fit quality and, in fact, sacrifices features that were well captured by simpler RW simulations (Appendix 1–figure 1A to 1D). To further evaluate this extended RW model (RW*), we applied the same parameter estimation method used in the LCM for individual data (see Materials and Methods). For each animal, α<sub>CS</sub>, α<sub>context</sub>, β, θ, and λ were estimated with their lower and upper bounds set as previously described (see Appendix 1, Materials and Methods). The results showed that the number of accepted samples slightly increased compared to the RW model without sigmoidal transformation of CR (RW* vs. RW in Author response image 2B, 2C). However, this improvement did not surpass the LCM (RW* vs. LCM in Author response image 2B, Author response image 1C). Overall, these results suggest that while using the same method to map the expected shock to CR, the RW model does not outperform the LCM. Practically, further extension, such as adding novel terms, might improve the fitting level. We would like to note that such extensions should be carefully validated if they are reasonable and necessary for an internal model, which is beyond the scope of this study. We hope this addresses the reviewer's concerns about the implementation of the RW model. 

      Author response image 2.

      Simulation (A) and parameter estimation (B and C) in the extended Rescorla-Wagner model.

      (2C) In the paper, the model fits for the alphas in the RW model are the same across the groups. Were the alphas for the two models kept as free variables? This is an important question as it gets back to the first point raised. Because the modeling of the reinstatement behavior with the LCM appears to be mainly driven by latent cause z<sub>B</sub>, the extinction memory, it may be possible to replicate the pattern of results without requiring a latent cause model. For example, the 12-month-old App NL-G-F mice behavior may have a deficit in learning about the context. Within the RW model, if the alpha for context is set to zero for those mice, but kept higher for the other groups, say alpha_context = 0.8, the authors could potentially observe the same pattern of discrimination indices in figure 2G and 2H at test. Because the authors don't explicitly state which parameters might be driving the change in the DI, the authors should show in some way that their results cannot simply be due to poor contextual learning in the 12 month old App NL-G-F mice, as this can presumably be predicted by the RW model. The authors' model fits using RW don't show this, but this is because they don't consider this possibility that the alpha for context might be disrupted in the 12-month-old App NL-G-F mice. Of course, using the RW model with these alphas won't lead to as nice of fits of the behavior across acquisition, extinction, and reinstatement as the authors' LCM, the number of parameters are substantially reduced in the RW model. Yet the important pattern of the DI would be replicated with the RW model (if I'm not mistaken), which is the important test for assessment of reinstatement.

      We would like to clarify that we estimated three parameters in the RW model for individuals:  α<sub>CS</sub>,  α<sub>context</sub>, and β. Even if we did so, many samples did not satisfy our criteria (Appendix 1–figure 2). Please refer to the “Evaluation of model fit” in Appendix 1 and the legend of Appendix 1–figure 1A to 1D, where we have written the estimated parameter values.

      We did not agree that paralyzing the contextual learning by setting  α<sub>context</sub>  as 0 in the RW model can explain the CR curve of 12-month-old AD mice well. Specifically, the RW model cannot capture the between-day extinction dynamics (i.e., the increase in CR at the beginning of day 2 extinction)  and the higher CR at test 3 relative to test 2 (i.e., DI between test 3 and test 2 is greater than 0.5). In addition, because the context input (= 0.2) was relatively lower than the CS input (= 1), and there is only a single unsignaled shock trial, even setting  α<sub>context</sub> = 1 results in only a limited increase in CR (Appendix 1–figure 1A to 1D; see also Author response image 2 9). Thus, the RW model cannot replicate the reinstatement effect or the critical pattern of discrimination index, even under conditions of stronger contextual learning.  

      (3) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual-US learning during the US re-exposure or to increased responding to the CS - presumably caused by reactivation of the acquisition memory. The authors do perform a comparison between the preCS and CS period, but it is not clear whether this is taken into account in the LCM. For example, the instance of the model shown in figure 1 indicates that the 'extinction cause', or cause z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. If they haven't already, I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model. In more precise terms, it's not clear whether the authors incorporate a preCS/ITI period each day the cue is presented as a vector of just the context in addition to the CS period in which the vector contains both the context and the CS. Based on the description, it seemed to me that they only model the CRs during the CS period on days when the CS is presented, and thereby the context is only ever modeled on its own (as just the context by itself in the vector) on extinction days when the CS is not presented. If they are modeling both timepoints each day that the CS I presented, then I would recommend explicitly stating this in the methods section.

      In this study, we did not model the preCS freezing rate, and we thank the reviewer for the suggestion to model preCS periods as separate context-only trials. In our view, however, this approach is not consistent with the assumptions of the LCM. Our rationale is that the available periods of context and the CS are different. We assume that observation of the context lasts from preCS to CS. If we simulate both preCS (context) and CS (context and tone), the weight of context would be updated twice. Instead, we follow the same method as described in the original code from Gershman et al. (2017) to consider the context effect. We agree that explicitly modeling preCS could provide additional insights, but we believe it would require modifying or extending the LCM. We consider this an important direction for future research, but it is outside the scope of this study.

      (4) The authors fit the model using all data points across acquisition and learning. As one of the other reviewers has highlighted, it appears that there is a high chance for overfitting the data with the LCM. Of course, this would result in much better fits than models with substantially fewer free parameters, such as the RW model. As mentioned above, the authors should use a method that takes into account the number of parameters, such as the BIC.

      Please refer to the reply to public review (2A) for the reason we did not take the suggestion to use BIC. In addition, we feel that we have adequately addressed the concern of overfitting in the first round of the review. 

      (5) The authors have stated that they do not think the Barnes maze task can be modeled with the LCM. Whether or not this is the case, if the authors do not model this data with the LCM, the Barnes maze data doesn't appear valuable to the main hypothesis. The authors suggest that more sophisticated models such as the LCM may be beneficial for early detection of diseases such as Alzheimer's, so the Barnes maze data is not valuable for providing evidence of this hypothesis. Rather, the authors make an argument that the memory deficits in the Barnes maze mimic the reinstatement effects providing support that memory is disrupted similarly in these mice. Although, the authors state that the deficits in memory retrieval are similar across the two tasks, the authors are not explicit as to the precise deficits in memory retrieval in the reinstatement task - it's a combination of overgeneralizing latent causes during acquisition, poor learning rate, over differentiation of the stimuli.

      We would like to clarify that we valued the latent cause model not solely because it is more sophisticated and fits more data points, but it is an internal model that implicates the cognitive process. Please also see the reply to the recommendations to authors (3) about the reason why we did not take the suggestion to remove this data.

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's inability to retain competing memories. These issues are evident in Figure 3:

      (1) The model misses trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, and the faster return of fear during reinstatement compared to the gradual learning of fear during acquisition. It also underestimates the increase in fear at the start of day 2 of extinction, particularly in controls.

      (2) The model explains the higher fear response in controls during reinstatement largely through a stronger association to the context formed during the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition (as seen in Figure 3C). In the experiment, however, this memory does seem to be important for explaining the higher fear response in controls during reinstatement (as seen in Author Response Figure 3). The model does show a necessary condition for memory retrieval, which is that controls rely more on the latent causes from acquisition. But this alone is not sufficient, since the associations within that cause may have been overwritten during extinction. The Rescorla-Wagner model illustrates this point: it too uses the latent cause from acquisition (as it only ever uses a single cause across phases) but does not retain the original stimulus-shock memory, updating and overwriting it continuously. Similarly, the latent cause model may reuse a cause from acquisition without preserving its original stimulus-shock association.

      These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. over differentiation), but the model itself does not appear to capture these processes accurately.

      The authors could benefit from a model that better matches the data and captures the retention and retrieval of fear memories across phases. While they explored alternatives, including the Rescorla-Wagner model and a latent state model, these showed no meaningful improvement in fit. This highlights a broader issue: these models are well-motivated but may not fully capture observed behavior.

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current models fall short in doing so.

      We thank the reviewer for the insightful comments. For the comments (1) and (2), please refer to our previous author response to comments #26 and #27. We recognize that the models tested in this study have limitations and, as noted, do not fully capture all aspects of the observed behavioral data. We see this as an important direction for future research and value the reviewer’s suggestions.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I have maintained some of the main concerns included in the first round of reviews as I think they remain concerns with the new draft, even though the authors have included substantially more analysis of their data, which is appreciated. I particularly found the inclusion of the comparative modeling valuable, although I think the analysis comparing the models should be improved.

      (1) This relates to point 1 in the public assessment or #16 in the response to reviewers from the authors. The authors raise the point that even a low posterior can drive behavioral expression (lines 361-365 in the response to authors), and so the acquisition latent cause may partially drive reinstatement. Yet in the stimulation shown in figure 1D, this does not seem to be the case. As I mentioned in the public response, in figure 1, the posteriors for z<sub>A</sub> are similar on day 34 and day 36, yet only on day 36 is there a strong CR. At least in this example, it does not appear that z<sub>A</sub> contributes to the increased responding from day 34 (test 2) to day 36 (test 3). There may be a slight increase in z1 in figure 3C, but the dominant change from day 34 to day 36 appears to be the increase in the posterior of z3 and the substantial increase in w3. The authors then cite several papers which have shown the shift in balance between what it is the putative acquisition memory and extinction memory (i.e. Lacagnina et al.). Yet I do not see how this modeling fits with most of the previous findings. For example, in the Lacagnina et al. paper, activation of the acquisition ensemble or inhibition of the extinction ensemble drives freezing, whereas the opposite pattern reduces freezing. What appears to be the pattern in the modeling in this paper is primarily learning of context in the extinction latent cause to predict the shock. As I mention in point 2C of the public review, it's not clear why this pattern of results would require a latent cause model. Would a high alpha for context and not the CS not give a similar pattern of results in the RW model? At least for giving similar results of the DIs in figure 2?

      First, we would like to clarify that the x-axis in Figure 1D is labeled “Trial,” not “Day.” Please refer to the reply to public review (1), where we clarified the posterior probability of the latent cause from trials 34 to 36. Second, although we did not have direct neural circuit evidence in this study, we discussed the similarities between previous findings and the modeling in the first review. Briefly, our main point focuses on the interaction between acquisition and extinction memory. In other words, responses at different times arise from distinct internal states made up of competing memories. We assume that the reviewer expects a modeling result showing nearly full recovery of acquisition memory, which aligns with previous findings where optogenetic activation of the acquisition engram can partially mimic reinstatement (Zaki et al., 2022; see also the response to comment #12 in the first round of review). We acknowledge that such a modeling result cannot be achieved with the latent cause model and see it as a potential future direction for model improvement.

      Please also refer to the reply to public review (2) about how a high alpha for context in the RW model cannot explain the pattern we observed in the reinstatement paradigm.

      (2) This is related to point 3 in the public comments and #13 in the response to reviewers. I raised the question of comparing the preCS/ITI period with the CS period, but my main point was why not include these periods in the LCM itself as mentioned in more detail in point 3 in the current public review. The inclusion of the comparisons the authors performed helped, but my main point was that the authors could have a better measure of wcontext if they included the preCS period as a stimulus each day (when only the context is included in the stimulus). This would provide better estimates of wcontext. As stated in the public review, perhaps the authors did this, but my understanding of the methods this was not the case, rather, it seems the authors only included the CS period for CRs within the model (at least on days when the CS was present).

      Please refer to the reply to public review (3) about the reason why we did not model the preCS freezing rate.

      (3) This relates to point 4 in the public review and #15 and #24 in the response to authors. The authors have several points for why the two experiments are similar and how results may be extrapolated - lines 725-733. The first point is that associative learning is fundamental in spatial learning. I'm not sure that this broad connection between the two studies is particularly insightful for why one supports the other as associative learning is putatively involved in most behavioral tasks. In the second point about reversals, why not then use a reversal paradigm that would be easier to model with LCM? This data is certainly valuable and interesting, yet I don't think it's helpful for this paper to state qualitatively the similarities in the potential ways a latent cause framework might predict behavior on the Barnes maze. I would recommend that the authors either model the behavior with LCM, remove the experiment from the paper, or change the framing of the paper that LCM might be an ideal approach for early detection of dementia or Alzheimer's disease.

      We would like to clarify that our aim was not to present the LCM as an ideal tool for early detection of AD symptoms. Rather, our focus is on the broader idea of utilizing internal models and estimating individual internal states in early-stage AD. Regarding using a reversal paradigm that would be easier to model with LCM, the most straightforward approach is to use another type of paradigm for fear conditioning, then to examine the extent to which similar behavioral characteristics are observed between paradigms within subjects. However, re-exposing the same mice to such paradigms is constrained by strong carry-over effects, limiting the feasibility of this experiment. Other behavioral tasks relevant to AD that avoid shock generally involve action selection for subsequent observation (Webster et al., 2014), which falls outside the structure of LCM. Our rationale for including the Barnes maze task is that spatial memory deficit is implicated in the early stage of AD, making it relevant for translational research. While we acknowledge that exact modeling of Barnes maze behavior would require a more sophisticated model (as discussed in the first round of review), our intention to use the reversal Barnes maze paradigm is to suggest a presumable memory modification learning in a non-fear conditioning paradigm. We also discussed whether similar deficits in memory modification could be observed across two behavioral tasks.

      (4) Reviewer # mentioned that the change in pattern of behavior only shows up in the older mice questioning the clinical relevance of early detection. I do think this is a valid point and maybe should be addressed. There does seem to be a bit of a bump in the controls on day 23 that doesn't appear in the 6-month group. Perhaps this was initially a spontaneous recovery test indicated by the dotted vertical line? This vertical line does not appear to be defined in the figure 1 legend, nor in figures 2 and 3.

      We would like to emphasize that the App<sup>NL-G-F</sup> knock-in mouse is widely considered a model of early-stage AD, characterized by Aβ accumulation with little to no neurofibrillary tangle pathology or neuronal loss (see Introduction). By examining different ages, we can assess the contribution of both the amount and duration of Aβ accumulation as well as age-related factors. Modeling the deficit in the memory modification process in the older App<sup>NL-G-F</sup> knock-in mice, we suggested a diverged internal state in early-stage AD in older age, and this does not diminish the relevance of the model for studying early cognitive changes in AD.

      We would also like to clarify again that the x-axis in the figure is “Trial,” not “Day.” The vertical dashed lines in these figures indicate phase boundaries, and they were defined in the figure legend: in Figure 1C, “The vertical dashed lines separate the phases.”; in Figure 2B, “The dashed vertical line separates the extinction 1 and extinction 2 phases.”; in Figure 3, “The vertical dashed lines indicate the boundaries of phases.”

      (5) Are the examples in figure 3 good examples? The example for the 12-month-old control shows a substantial increase in weights for the context during test 3, but not for the CS. Yet in the bar plots in Figure 4 G and H, this pattern seems to be different. The weights for the context appear to substantially drop in the "after extinction" period as compared to the "extinction" period. It's hard to tell the change from "extinction" to "after extinction" for the CS weights (the authors change the y-axis for the CS weights but not for the context weights from panels G to H).

      We would like to clarify that in Figure 3C, the increase in weights for context is not presented during test 3 (trial 36), noted by the reviewer; rather, it is the unsignaled shock phase (trial 35).

      We assumed that the reviewer might misunderstand that the labels on the left in Figure 4, “Acquisition”, “Extinction”, and “After extinction”, indicate the time point. However, the data shown in Figure 4C to 4H are all from the same time point: test 3 (trial 36). The grouping reflects the classification of latent causes based on the trial in which they were inferred. In addition, for Figures 4G and 4H, the y‐axis limits were not set identically because the data range for “Sum of w<sub>CS</sub>” varied. This was done to ensure the visibility of all data points. In Figure 4, each dot represents one animal. Take Figure 3D as an example. The point in Figure 4G is the sum of w3 and w4 in trial 36, and the point in Figure 4H is w5 in trial 36, note that the subscript numerals indicate latent cause index. We hope this addresses the reviewer’s question about the difference between the two figures.


      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The authors show certain memory deficits in a mouse knock-in model of Alzheimer's Disease (AD). They show that the observed memory deficits can be explained by a computational model, the latent cause model of associative memory. The memory tasks used include the fear memory task (CFC) and the 'reverse' Barnes maze. Research on AD is important given its known huge societal burden. Likewise, better characterization of the behavioral phenotypes of genetic mouse models of AD is also imperative to advance our understanding of the disease using these models. In this light, I applaud the authors' efforts.

      Strengths:

      (1) Combining computational modelling with animal behavior in genetic knock-in mouse lines is a promising approach, which will be beneficial to the field and potentially explain any discrepancies in results across studies as well as provide new predictions for future work.

      (2) The authors' usage of multiple tasks and multiple ages is also important to ensure generalization across memory tasks and 'modelling' of the progression of the disease.

      Weaknesses:

      [#1] (1) I have some concerns regarding the interpretation of the behavioral results. Since the computational model then rests on the authors' interpretation of the behavioral results, it, in turn, makes judging the model's explanatory power difficult as well. For the CFC data, why do knock-in mice have stronger memory in test 1 (Figure 2C)? Does this mean the knock-in mice have better memory at this time point? Is this explained by the latent cause model? Are there some compensatory changes in these mice leading to better memory? The authors use a discrimination index across tests to infer a deficit in re-instatement, but this indicates a relative deficit in re-instatement from memory strength in test 1. The interpretation of these differential DIs is not straightforward. This is evident when test 1 is compared with test 2, i.e., the time point after extinction, which also shows a significant difference across groups, Figure 2F, in the same direction as the re-instatement. A clarification of all these points will help strengthen the authors' case.

      We appreciate the reviewer for the critical comments. According to the latent cause framework, the strength of the memory is influenced by at least 2 parameters: associative weight between CS and US given a latent cause and posterior probability of the latent cause. The modeling results showed that a higher posterior probability of acquisition latent cause, but not higher associative weight, drove the higher test 1 CR in App<sup>NL-G-F</sup> mice (Results and Discussion; Figure 4 – figure supplement 3B, 3C). In terms of posterior, we agree that App<sup>NL-G-F</sup> mice have strong fear memory. On the other hand, this suggests that App<sup>NL-G-F</sup> mice exhibited a tendency toward overgeneralization, favoring modification of old memories, which adversely affected the ability to retain competing memories. The strong memory in test 1 would be a compensatory effect of overgeneralization.    

      To estimate the magnitude of reinstatement, at least, one would have to compare CRs between test 2 (extinction) and test 3 (reinstatement), as well as those between test 1 (acquisition) and test 3. These comparisons represent the extent to which the memory at the reinstatement is far from that in the extinction, and close to that in the acquisition. Since discrimination index (DI) has been widely used as a normalized measure to evaluate the extent to which the system can distinguish between two conditions, we applied DI consistently to behavioral and simulated data in the reinstatement experiment, and the behavioral data in the reversal Barnes maze experiment, allowing us to evaluate the discriminability of an agent in these experiments. In addition, we used DI to examine its correlation with estimated parameters, enabling us to explore how individual discriminability may relate to the internal state. We have already discussed the differences in DI between test 3 and test 1, as well as CR in test 1 between control and App<sup>NL-G-F</sup> in the manuscript and further elaborated on this point in Line 232, 745-748.   

      [#2] (2) I have some concerns regarding the interpretation of the Barnes maze data as well, where there already seems to be a deficit in the memory at probe test 1 (Figure 6C). Given that there is already a deficit in memory, would not a more parsimonious explanation of the data be that general memory function in this task is impacted in these mice, rather than the authors' preferred interpretation? How does this memory weakening fit with the CFC data showing stronger memories at test 1? While I applaud the authors for using multiple memory tasks, I am left wondering if the authors tried fitting the latent cause model to the Barnes maze data as well.

      While we agree that the deficits shown in probe test 1 may imply impaired memory function in App<sup>NL-G-F</sup> mice in this task, it would be difficult to explain this solely in terms of impairments in general memory function. The learning curve and the daily strategy changes suggested that App<sup>NL-G-F</sup> mice would have virtually intact learning ability in the initial training phase (Figure 6B, 6F, Figure 6 – figure supplement 1 and 3). For the correspondence relationship between the reinstatement and the reversal Barnes maze learning from the aspect of memory modification process, please also see our reply to comment #24. We have explained why we did not fit the latent cause model to the Barnes maze data in the provisional response.

      [#3] (3) Since the authors use the behavioral data for each animal to fit the model, it is important to validate that the fits for the control vs. experimental groups are similar to the model (i.e., no significant differences in residuals). If that is the case, one can compare the differences in model results across groups (Figures 4 and 5). Some further estimates of the performance of the model across groups would help.

      We have added the residual (i.e., observed CR minus simulated CR) in Figure 3 – figure supplement 1D and 1E. The fit was similar between control and App<sup>NL-G-F</sup> mice groups in the test trials, except test 3 in the 12-month-old group. The residual was significantly higher in the 12-month-old control mice than App<sup>NL-G-F</sup> mice, suggesting the model underestimated the reinstatement in the control, yet the DI calculated from the simulated CR replicates the behavioral data (Figure 3 – figure supplement 1A to 1C). These results suggest that the latent cause model fits our data with little systematic bias such as an overestimation of CR for the control group in the reinstatement, supporting the validity of the comparisons in estimated parameters between groups. These results and discussion have been added in the manuscript Line 269-276.

      One may notice that the latent cause model overestimated the CR in acquisition trials in all groups in Figure 3 – figure supplement 1D and 1E. We have discussed this point in the reply to comment #26, 34 questioned by reviewer 3.

      [#4] (4) Is there an alternative model the authors considered, which was outweighed in terms of prediction by this model? 

      Yes, we have further evaluated two alternative models: the Rescorla-Wagner (RW; Rescorla & Wagner, 1972) model and the latent state model (LSM; Cochran & Cisler, 2019). The RW model serves as a baseline, given its known limitations in explaining fear return after extinction. The LSM is another contemporary model that shares several concepts with the latent cause model (LCM) such as building upon the RW model, assuming a latent variable inferred by Bayes’ rule, and involving a ruminative update for memory modification. We evaluated the three models in terms of the prediction accuracy and reproducibility of key behavioral features. Please refer to the Appendix 1 for detailed methods and results for these two models.

      As expected, the RW model fit well to the data till the end of extinction but failed to reproduce reinstatement (Appendix 1 – figure 1A to 1D). Due to a large prediction error in test 3, few samples met the acceptance criteria we set (Appendix 1 – figure 2 and 3A). Conversely, the LSM reproduced reinstatement, as well as gradual learning in acquisition and extinction phases, particularly in the 12month-old control (Appendix 1 – figure 1G). The number of accepted samples in the LSM was higher than in the RW model but generally lower than in the LCM (Appendix 1 – figure 2). The sum of prediction errors over all trials in the LSM was comparable to that in the LCM in the 6-month-old group (Appendix 1 – figure 4A), it was significantly lower in the 12-month-old group (Appendix 1 – figure 4B). Especially the LSM generated smaller prediction errors during the acquisition trials than in the LCM, suggesting that the LSM might be better at explaining the behaviors of acquisition (Appendix 1 – figure 4A and 4B; but see the reply for comment #34). While the LSM generated smaller prediction errors than the LCM in test 2 of the control group, it failed to replicate the observed DIs, a critical behavioral phenotype difference between control and App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6A to 6C; cf. Figure 2F to 2H, Figure 3 – figure supplement 1A to 1C).

      Thus, although each model could capture different aspects of reinstatement, standing on the LCM to explain the reinstatement better aligns with our purpose. It should also be noted that we did not explore all parameter spaces of the LSM, hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research. 

      [#5] One concern here is also parameter overfitting. Did the authors try leaving out some data (trials/mice) and predicting their responses based on the fit derived from the training data?

      Following the reviewer’s suggestion, we confirmed if overfitting occurred using all trials to estimate parameters. Estimating parameters while actually leaving out trials would disorder the time lapse across trials, and thereby the prior of latent causes in each trial. Instead, we removed the constraint of prediction error by setting the error threshold to 1 for certain trials to virtually leave these trials out. We treated these trials as a virtual “training” dataset, while the rest of the trials were a “test” dataset. For the median CR data of each group (Figure 3), we estimated parameters under 6 conditions with unique training and test trials, then evaluated the prediction error for the training and test trials. Note that training and test trials were arbitrarily decided. Also, the error threshold for the acquisition trial was set to 1 as described in Materials and Methods, which we have further discussed the reason in the reply to comment #34 and treated acquisition trials separately from the test trials. We expect that the contribution of the data from the acquisition and test trials for parameter estimation could be discounted compared to those from the training trials with the constraint, and if overfitting occurred, the prediction error in the test data would be worse than that in the training trials.

      Author response image 1A to 1F showed the simulated and observed CR under each condition, where acquisition trials were in light-shaded areas, test trials were in dark-shaded areas, and the rest of the trials were training trials. Author response image 1G showed mean squared prediction error across the acquisition, training and test trials under each condition. The dashed gray line showed the mean squared prediction error of training trials in Figure 3 as a baseline.

      In conditions i and ii, where two or four trials in the extinction were used for training (Author response image 1A and 1B), the prediction error was generally higher in test trials than in training trials. In conditions iii and iv where ten trials in the extinction were used for training (Author response image 1C and 1D), the difference in prediction error between testing and training trials became smaller. These results suggest that providing more extinction trial data would reduce overfitting. In condition v (Author response image 1E), the results showed that using trials until extinction can predict reinstatement in control mice but not App<sup>NL-G-F</sup> mice. Similarly, in condition vi (Author response image 1F), where test phase trials were left out, the prediction error differences were greater in App<sup>NL-G-F</sup> mice. These results suggest that the test trials should be used for the parameter estimation to minimize prediction error for all groups. Overall, this analysis suggests that using all trials would reduce prediction error with few overfitting. 

      Author response image 1.

      Leaving trials out in parameter estimation in the latent cause model. (A – F) The observed CR (colored line) is the median freezing rate during the CS presentation over the mice within each group, which is the same as that in Figure 3. The colors indicate different groups: orange represents 6-month-old control, light blue represents 6-month-old App<sup>NL-G-F</sup> mice, pink represents 12-month-old control, and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice. Under six different leave-out conditions (i – vi), parameters were estimated and used for generating simulated CR (gray line). In each condition, trials were categorized as acquisition (light-shaded area), training data (white area), and test data (dark-shaded area) based on the error threshold during parameter estimation. Only the error threshold of the test data trial was different from the original method (see Material and Method) and set to 1. In conditions i to vi, the number of test data trials is 27, 25, 19, and 19 in extinction phases. In condition v, the number of test data trials is 2 (trials 35 and 36). In condition vi, test data trials were the 3 test phases (trials 4, 34, and 36). (G) Each subplot shows the mean squared prediction error for the test data trial (gray circles), training data trial (white squares), and acquisition trial (gray triangles) in each group. The left y-axis corresponds to data from test and training trials, and the right y-axis corresponds to data from acquisition trials. The dashed line indicates the results calculated from Figure 3 as a baseline.  

      Reviewer #1 (Recommendations for the authors):

      Minor:

      [#6] (1) I would like the authors to further clarify why 'explaining' the reinstatement deficit in the AD mouse model is important in working towards the understanding of AD i.e., which aspect of AD this could explain etc.

      In this study, we utilized the reinstatement paradigm with the latent cause model as an internal model to illustrate how estimating internal states can improve understanding of cognitive alteration associated with extensive Aβ accumulation in the brain. Our findings suggest that misclassification in the memory modification process, manifesting as overgeneralization and overdifferentiation, underlies the memory deficit in the App<sup>NL-G-F</sup> knock-in model mice. 

      The parameters in the internal model associated with AD pathology (e.g., α and σ<sub>x</sub><sup>2</sup> in this study) can be viewed as computational phenotypes, filling the explanatory gap between neurobiological abnormalities and cognitive dysfunction in AD. This would advance the understanding of cognitive symptoms in the early stages of AD beyond conventional behavioral endpoints alone.

      We further propose that altered internal states in App<sup>NL-G-F</sup> knock-in mice may underlie a wide range of memory-related symptoms in AD as we observed that App<sup>NL-G-F</sup> knock-in mice failed to retain competing memories in the reversal Barnes maze task. We speculate on how overgeneralization and overdifferentiation may explain some AD symptoms in the manuscript:

      - Line 565-569: overgeneralization may explain deficits in discriminating highly similar visual stimuli reported in early-stage AD patients as they misclassify the lure as previously learned object

      - Line 576-579: overdifferentiation may explain impaired ability to transfer previously learned association rules in early-stage AD patients as they misclassify them as separated knowledge. 

      - Line 579-582: overdifferentiation may explain delusions in AD patients as an extended latent cause model could simulate the emergence of delusional thinking

      We provide one more example here that overgeneralization may explain that early-stage AD patients are more susceptible to proactive interference than cognitively normal elders in semantic memory tests (Curiel Cid et al., 2024; Loewenstein et al., 2015, 2016; Valles-Salgado et al., 2024), as they are more likely to infer previously learned material. Lastly, we expect that explaining memory-related symptoms within a unified framework may facilitate future hypothesis generation and contribute to the development of strategies for detecting the earliest cognitive alteration in AD.  

      [#7] (2) The authors state in the abstract/introduction that such computational modelling could be most beneficial for the early detection of memory disorders. The deficits observed here are pronounced in the older animals. It will help to further clarify if these older animals model the early stages of the disease. Do the authors expect severe deficits in this mouse model at even later time points?

      The early stage of the disease is marked by abnormal biomarkers associated with Aβ accumulation and neuroinflammation, while cognitive symptoms are mild or absent. This stage can persist for several years during which the level of Aβ may reach a plateau. As the disease progresses, tau pathology and neurodegeneration emerge and drive the transition into the late stage and the onset of dementia. The App<sup>NL-G-F</sup> knock-in mice recapitulate the features present in the early stage (Saito et al., 2014), where extensive Aꞵ accumulation and neuroinflammation worsen along with ages (Figure 2 – figure supplement 1). Since App<sup>NL-G-F</sup> knock-in mice are central to Aβ pathology without tauopathy and neurodegeneration, it should be noted that it does not represent the full spectrum of the disease even at advanced ages. Therefore, older animals still model the early stages of the diseases and are suitable to study the long-term effect of Aβ accumulation and neuroinflammation. 

      The age tested in previous reports using App<sup>NL-G-F</sup> mice spanned a wide range from 2 months old to 24 months old. Different behavioral tasks have varied sensitivity but overall suggest the dysfunction worsens with aging (Bellio et al., 2024; Mehla et al., 2019; Sakakibara et al., 2018). We have tested the reinstatement experiment with 17-month-old App<sup>NL-G-F</sup> mice before (Author response image 2). They showed more advanced deficits with the same trends observed in 12-month-old App<sup>NL-G-F</sup> mice, but their freezing rates were overall at a lower level. There is a concern that possible hearing loss may affect the results and interpretation, therefore we decided to focus on 12-month-old data.

      Author response image 2.

      Freezing rate across reinstatement paradigm in the 17-month-old App<sup>NL-G-F</sup> mice. Dashed and solid lines indicate the median freezing rate over 34 mice before (preCS) and during (CS) tone presentation, respectively. Red, blue, and yellow backgrounds represent acquisition, extinction, and unsignaled shock in Figure 2A. The dashed vertical line separates the extinction 1 and extinction 2 phases.

      [#8] (3) There are quite a few 'marginal' p-values in the paper at p>0.05 but near it. Should we accept them all as statistically significant? The authors need to clarify if all the experimental groups are sufficiently powered.

      For our study, we decided a priori that p < 0.05 would be considered statistically significant, as described in the Materials and Methods. Therefore, in our Results, we did not consider these marginal values as statistically significant but reported the trend, as they may indicate substantive significance.

      We described our power analysis method in the manuscript Line 897-898 and have provided the results in Tables S21 and S22.

      [#9] (4) The authors emphasize here that such computational modelling enables us to study the underlying 'reasoning' of the patient (in the abstract and introduction), I do not see how this is the case. The model states that there is a latent i.e. another underlying variable that was not previously considered.

      Our use of the term “reasoning” was to distinguish the internal model, which describes how an agent makes sense of the world, from other generative models implemented for biomarker and disease progression prediction. However, we agree that using “reasoning” may be misleading and imprecise, so to reduce ambiguity we have removed this word in our manuscript Line 27: Nonetheless, internal models of the patient remain underexplored in AD; Line 85: However, previous approaches did not suppose an internal model of the world to predict future from current observation given prior knowledge.   

      [#10] (5) The authors combine knock-in mice with controls to compute correlations of parameters of the model with behavior of animals (e.g. Figure 4B and Figure 5B). They run the risk of spurious correlations due to differences across groups, which they have indeed shown to exist (Figure 4A and 5A). It would help to show within-group correlations between DI and parameter fit, at least for the control group (which has a large spread of data).

      We agree that genotype (control, App<sup>NL-G-F</sup>) could be a confounder between the estimated parameters and DI, thereby generating spurious correlations. To address this concern, we have provided withingroup correlation in Figure 4 – figure supplement 2 for the 12-month-old group and Figure 5 – figure supplement 2 for the 6-month-old group.

      In the 12-month-old group, the significant positive correlation between σx2 and DI remained in both control and App<sup>NL-G-F</sup> mice even if we adjusted the genotype effect, suggesting that it is very unlikely that the correlations in Figure 4B are due to the genotype-related confounding. On the other hand, the positive correlation between α and DI was found to be significant in the control mice but not in the App<sup>NL-G-F</sup> mice. Most of α were distributed around the lower bound in App<sup>NL-G-F</sup> mice, which possibly reduced the variance and correlation coefficient. These results support our original conclusion that α and σ<sub>x</sub><sup>2</sup> are parameters associated with a lower magnitude of reinstatement in aged App<sup>NL-G-F</sup> mice.

      In the 6-month-old group, the correlations shown in Figure 5B were not preserved within subgroups, suggesting genotype would be a confounder for α, σ<sub>x</sub><sup>2</sup>, and DI. We recognized that significant correlations in Figure 5B may arise from group differences, increased sample size, or greater variance after combining control and App<sup>NL-G-F</sup> mice. 

      Therefore, we concluded that α and σ<sub>x</sub><sup>2</sup> are associated with the magnitude of reinstatement but modulated by the genotype effect depending on the age. 

      We have added interpretations of within-group correlation in the manuscript Line 307-308, 375-378.

      [#11] (6) It is unclear to me why overgeneralization of internal states will lead to the animals having trouble recalling a memory. Would this not lead to overgeneralization of memory recall instead?

      We assume that the reviewer is referring to “overgeneralization of internal states,” a case in which the animal’s internal state remained the same regardless of the observation, thereby leading to “overgeneralization of memory recall.” We agree that this could be one possible situation and appears less problematic than the case in which this memory is no longer retrievable. 

      However, in our manuscript, we did not deal with the case of “overgeneralization of internal states”. Rather, our findings illustrated how the memory modification process falls into overgeneralization or overdifferentiation and how it adversely affects the retention of competing memories, thereby causing App<sup>NL-G-F</sup> mice to have trouble recalling the same memory as the control mice. 

      According to the latent cause model, retrieval failure is explained by a mismatch of internal states, namely when an agent perceives that the current cue does not match a previously experienced one, the old latent cause is less likely to be inferred due to its low likelihood (Gershman et al., 2017). For example, if a mouse exhibited higher CR in test 2, it would be interpreted as a successful fear memory retrieval due to overgeneralization of the fear memory. However, it reflects a failure of extinction memory retrieval due to the mismatch between the internal states at extinction and test 2. This is an example that overgeneralization of memory induces the failure of memory retrieval. 

      On the other hand, App<sup>NL-G-F</sup> mice exhibited higher CR in test 1, which is conventionally interpreted as a successful fear memory retrieval. When estimating their internal states, they would infer that their observation in test 1 well matches those under the acquisition latent causes, that is the overgeneralization of fear memory as shown by a higher posterior probability in acquisition latent causes in test 1 (Figure 4 – figure supplement 3). This is an example that over-generalization of memory does not always induce retrieval failure as we explained in the reply to comment #1. 

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for the assessment of memory-based tasks may provide improved early detection of Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in the production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      [#12] (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent states by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory state after reinstatement, at least based on the examples shown in Figures 1 and 3. Rather, the model appears to mainly modify the weights in the most recent state, putatively the 'extinction state', during reinstatement. Of course, the authors must rely on how the model fits the data, but this seems problematic based on prior research indicating that reinstatement is most likely due to the reactivation of the acquisition memory. This may call into question whether the model is successfully modeling the underlying processes or states that lead to behavior and whether this is a valid approach for AD.

      We thank the reviewer for insightful comments. 

      We agree that, as demonstrated in Gershman et al. (2017), the latent cause model accounts for spontaneous recovery via the inference of new latent causes during extinction and the temporal compression property provided by the prior. Moreover, it was also demonstrated that even a relatively low posterior can drive behavioral expression if the weight in the acquisition latent cause is preserved. For example, when the interval between retrieval and extinction was long enough that acquisition latent cause was not dominant during extinction, spontaneous recovery was observed despite the posterior probability of acquisition latent cause (C1) remaining below 0.1 in Figure 11D of Gershman et al. (2017). 

      In our study, a high response in test 3 (reinstatement) is explained by both acquisition and extinction latent cause. The former preserves the associative weight of the initial fear memory, while the latter has w<sub>context</sub> learned in the unsignaled shock phase. These positive w were weighted by their posterior probability and together contributed to increased expected shock in test 3. Though the posterior probability of acquisition latent cause was lower than extinction latent cause in test 3 due to time passage, this would be a parallel instance mentioned above. To clarify their contributions to reinstatement, we have conducted additional simulations and the discussion in reply to the reviewer’s next comment (see the reply to comment #13).

      We recognize that our results might appear to deviate from the notion that reinstatement results from the strong reactivation of acquisition memory, where one would expect a high posterior probability of the acquisition latent cause. However, we would like to emphasize that the return of fear emerges from the interplay of competing memories. Previous studies have shown that contextual or cued fear reinstatement involves a neural activity switch back to fear state in the medial prefrontal cortex (mPFC), including the prelimbic cortex and infralimbic cortex, and the amygdala, including ventral intercalated amygdala neurons (ITCv), medial subdivision of central nucleus of the amygdala (CeM), and the basolateral amygdala (BLA) (Giustino et al., 2019; Hitora-Imamura et al., 2015; Zaki et al., 2022). We speculate that such transition is parallel to the internal states change in the latent cause model in terms of posterior probability and associative weight change.

      Optogenetic manipulation experiments have further revealed how fear and extinction engrams contribute to extinction retrieval and reinstatement. For instance, Gu et al. (2022) used a cued fear conditioning paradigm and found that inhibition of extinction engrams in the BLA, ventral hippocampus (vHPC), and mPFC after extinction learning artificially increased freezing to the tone cue. Similar results were observed in contextual fear conditioning, where silencing extinction engrams in the hippocampus dentate gyrus (DG) impaired extinction retrieval (Lacagnina et al., 2019). These results suggest that the weakening extinction memory can induce a return of fear response even without a reminder shock. On the other hand, Zaki et al. (2022) showed that inhibition of fear engrams in the BLA, DG, or hippocampus CA1 attenuated contextual fear reinstatement. However, they also reported that stimulation of these fear engrams was not sufficient to induce reinstatement, suggesting these fear engram only partially account for reinstatement. 

      In summary, reinstatement likely results from bidirectional changes in the fear and extinction circuits, supporting our interpretation that both acquisition and extinction latent causes contribute to the reinstatement. Although it remains unclear whether these memory engrams represent latent causes, one possible interpretation is that w<sub>context</sub> update in extinction latent causes during unsignaled shock indicates weakening of the extinction memory, while preservation of w in acquisition latent causes and their posterior probability suggests reactivation of previous fear memory. 

      [#13] (2) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual US learning during the US re-exposure or to increased response to the CS - presumably caused by reactivation of the acquisition memory. For example, the instance of the model shown in Figure 1 indicates that the 'extinction state', or state z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. By not comparing the difference in the evoked freezing CR at the test (ITI vs CS period), the purpose of the reinstatement test is lost in the sense of whether a previous memory was reactivated - was the response to the CS restored above and beyond the freezing to the context? I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model.

      To clarify the contribution of context, we have provided preCS freezing rate across trials in Figure 2 – figure supplement 2. As the reviewer pointed out, the preCS freezing rate did not remain at the same level across trials, especially within the 12-month-old control and App<sup>NL-G-F</sup> group (Figure 2 – figure supplement 2A and 2B), suggesting the effect context. A paired samples t-test comparing preCS freezing (Figure 2 – figure supplement 2E) and CS freezing (Figure 2E) in test 3 revealed significant differences in all groups: 6-month-old control, t(23) = -6.344, p < 0.001, d = -1.295; 6-month-old App<sup>NL-G-F</sup>, t(24) = -4.679, p < 0.001, d = -0.936; 12-month-old control, t(23) = -4.512, p < 0.001, d = 0.921; 12-month-old App<sup>NL-G-F</sup>, t(24) = -2.408, p = 0.024, d = -0.482. These results indicate that the response to CS was above and beyond the response to context only. We also compared the change in freezing rate (CS freezing rate minus preCS freezing rate) in test 2 and test 3 to examine the net response to the tone. The significant difference was found in the control group, but not in the App<sup>NL-GF</sup> group (Author response image 3). The increased net response to the tone in the control group suggested that the reinstatement was partially driven by reactivation of acquisition memory, not solely by the contextual US learning during the unsignaled shock phase. We have added these results and discussion in the manuscript Line 220-231.

      Author response image 3.

      Net freezing rate in test 2 and test 3. Net freezing rate is defined as the CS freezing rate (i.e., freezing rate during 1 min CS presentation) minus the preCS freezing rate (i.e., 1 min before CS presentation). The dashed horizontal line indicates no freezing rate change from the preCS period to the CS presentation. *p < 0.05 by paired-sample Student’s t-test, and the alternative hypothesis specifies that test 2 freezing rate change is less than test 3. Colors indicate different groups: orange represents 6-month-old control (n = 24), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 25), pink represents 12-month-old control (n = 24), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 25). Each black dot represents one animal. Statistical results were as follows: t(23) = -1.927, p = 0.033, Cohen’s d = -0.393 in 6-month-old control; t(24) = -1.534, p = 0.069, Cohen’s d = -0.307 in 6-month-old App<sup>NL-G-F</sup>; t(23) = -1.775, p = 0.045, Cohen’s d = -0.362 in 12-month-old control; t(24) = 0.86, p = 0.801, Cohen’s d = 0.172 in 12-monthold App<sup>NL-G-F</sup>

      According to the latent cause model, if the reinstatement is merely induced by an association between the context and the US in the unsignaled shock phase, the CR given context only and that given context and CS in test 3 should be equal. However, the simulation conducted for each mouse using their estimated parameters confirmed that this was not the case in this study. The results showed that simulated CR was significantly higher in the context+CS condition than in the context only condition (Author response image 4). This trend is consistent with the behavioral results we mentioned above.

      Author response image 4.

      Simulation of context effect in test 3. Estimated parameter sets of each sample were used to run the simulation that only context or context with CS was present in test 3 (trial 36). The data are shown as median with interquartile range, where white bars with colored lines represent CR for context only and colored bars represent CR for context with CS. Colors indicate different groups: orange represents 6-month-old control (n = 15), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 12), pink represents 12-month-old control (n = 20), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 18). Each black dot represents one animal. **p < 0.01, and ***p < 0.001 by Wilcoxon signed-rank test comparing context only and context + CS in each group, and the alternative hypothesis specifies that CR in context is not equal to CR in context with CS. Statistical results were as follows: W = 15, p = 0.008, effect size r = -0.66 in 6-month-old control; W = 0, p < 0.001, effect size r = -0.88 in 6-month-old App<sup>NL-G-F</sup>; W = 25, p = 0.002, effect size r = -0.67 in 12-month-old control; W = 9, p = 0.002 , effect size r = -0.75 in 12-month-old App<sup>NL-G-F</sup>

      [#14] (3) This is related to the second point above. If the question is about the memory processes underlying memory retrieval at the test following reinstatement, then I would argue that the model parameters that are not involved in testing this hypothesis be fixed prior to the test. Unlike the Gershman paper that the authors cited, the authors fit all parameters for each animal. Perhaps the authors should fit certain parameters on the acquisition and extinction phase, and then leave those parameters fixed for the reinstatement phase. To give a more concrete example, if the hypothesis is that AD mice have deficits in differentiating or retrieving latent states during reinstatement which results in the low response to the CS following reinstatement, then perhaps parameters such as the learning rate should be fixed at this point. The authors state that the 12-month-old AD mice have substantially lower learning rate measures (almost a 20-fold reduction!), which can be clearly seen in the very low weights attributed to the AD mouse in Figure 3D. Based on the example in Figure 3D, it seems that the reduced learning rate in these mice is most likely caused by the failure to respond at test. This is based on comparing the behavior in Figures 3C to 3D. The acquisition and extinction curves appear extremely similar across the two groups. It seems that this lower learning rate may indirectly be causing most of the other effects that the authors highlight, such as the low σx, and the changes to the parameters for the CR. It may even explain the extremely high K. Because the weights are so low, this would presumably lead to extremely low likelihoods in the posterior estimation, which I guess would lead to more latent states being considered as the posterior would be more influenced by the prior.

      We thank the reviewer for the suggestion about fitting and fixing certain parameters in different phases.

      However, this strategy may not be optimal for our study for the following scientific reasons.

      Our primary purpose is to explore internal states in the memory modification process that are associated with the deficit found in App<sup>NL-G-F</sup> mice in the reinstatement paradigm. We did not restrict the question to memory retrieval, nor did we have a particular hypothesis such that only a few parameters of interest account for the impaired associative learning or structure learning in App<sup>NL-G-F</sup> mice while all other parameters are comparable between groups. We are concerned that restricting questions to memory retrieval at the test is too parsimonious and might lead to misinterpretation of the results. As we explain in reply to comment #5, removing trials in extinction during parameter estimation reduces the model fit performance and runs the risk of overfitting within the individual. Therefore, we estimated all parameters for each animal, with the assumption that the estimated parameter set represents individual internal state (i.e., learning and memory characteristics) and should be fixed within the animal across all trials.  

      Figure 3 is the parameter estimation and simulation results using the median data of each group as an individual. The estimated parameter value is one of the possible cases in that group to demonstrate how a typical learning curve fits the latent cause model. The reviewer mentioned “20-fold reduction in learning rate” is the comparison of two data points, not the actual comparison between groups. The comparison between control and App<sup>NL-G-F</sup> mice in the 12-month-old group for all parameters was provided in Table S7. The Mann-Whitney U test did not reveal a significant difference in learning rate (η): 12-month-old control (Mdn = 0.09, IQR=0.23) vs. 12-month-old App<sup>NL-G-F</sup> (Mdn = 0.12, IQR=0.23), U = 199, p = 0.587.  

      We agree that lower learning rate could bias the learning toward inferring a new latent cause. However, this tendency may depend on the value of other parameters and varied in different phases in the reinstatement paradigm. Here, we used ⍺ as an example and demonstrate their interaction in Appendix 2 – table 2 with relatively extreme values: ⍺ \= {1, 3} and η \= {0.01, 0.5} while the rest of the parameters fixed at the initial guess value. 

      When ⍺ = 1, the number of latent causes across phases (K<sub>acq</sub>, K<sub>ext</sub>, K<sub>rem</sub>) remain unchanged and their posterior probability in test 3 were comparable even if η increased from 0.01 to 0.5. This is an example that lower η does not lead to inferring new latent causes because of low ⍺. The effect of low learning rate manifests in test 3 CR due to low w<sub>context, acq</sub> and w<sub>context, ext</sub>

      When ⍺ = 3, the number of acquisition latent causes (K<sub>acq</sub>) was higher in the case of η = 0.01 than that of η = 0.5, showing the effect mentioned by the reviewer. However, test 1 CR is much lower when η = 0.01, indicating unsuccessful learning even after inferring a new latent cause. This is none of the cases observed in this study. During extinction phases, the effect of η is surpassed by the effect of high ⍺, where the number of extinction latent causes (K<sub>ext</sub>) is high and not affected by η. After the extinction phases, the effect of K kicks in as the total number of latent causes reaches its value (K = 33 in this example), especially in the case of η = 0.01. A new latent cause is inferred after extinction in the condition of η = 0.5, but the CR 3 is still high as the w<sub>context, acq</sub> and w<sub>context, ext</sub> are high. This is an example that a new latent cause is inferred in spite of higher η

      Overall, the learning rate would not have a prominent effect alone throughout the reinstatement paradigm, and it has a joint effect with other parameters. Note that the example here did not cover our estimated results, as the estimated learning rate was not significantly different between control and App<sup>NL-G-F</sup> mice (see above). Please refer to the reply to comment #31 for more discussion about the interaction among parameters when the learning rate is fixed. We hope this clarifies the reviewer’s concern.

      [#15] (4) Why didn't the authors use the latent causal model on the Barnes maze task? The authors mention in the discussion that different cognitive processes may be at play across the two tasks, yet reversal tasks have been suggested to be solved using latent states to be able to flip between the two different task states. In this way, it seems very fitting to use the latent cause model. Indeed, it may even be a better way to assess changes in σx as there are presumably 12 observable stimuli/locations.

      Please refer to our provisional response about the application of the latent cause model to the reversal Barnes maze task. Briefly, it would be difficult to directly apply the latent cause model to the Barnes maze data because this task involves operant learning, and thereby almost all conditions in the latent cause model are not satisfied. Please also see our reply to comment #24 for the discussion of the link between the latent cause model and Barnes maze task. 

      Reviewer #2 (Recommendations for the authors):

      [#16] (1) I had a bit of difficulty finding all the details of the model. First, I had to mainly rely on the Gershman 2017 paper to understand the model. Even then, there were certain aspects of the model that were not clear. For instance, it's not quite clear to me when the new internal states are created and how the maximum number of states is determined. After reading the authors' methods and the Gershman paper, it seems that a new internal state is generated at each time point, aka zt, and that the prior for that state decays onwards from alpha. Yet because most 'new' internal states don't ever take on much of a portion of the posterior, most of these states can be ignored. Is that a correct understanding? To state this another way, I interpret the equation on line 129 to indicate that the prior is determined by the power law for all existing internal states and that each new state starts with a value of alpha, yet I don't see the rule for creating a new state, or for iterating k other than that k iterates at each timestep. Yet this seems to not be consistent with the fact that the max number of states K is also a parameter fit. Please clarify this, or point me to where this is better defined.

      I find this to be an important question for the current paper as it is unclear to me when the states were created. Most notably, in Figure 3, it's important to understand why there's an increase in the posterior of z<sub>5</sub> in the AD 12-month mice at test. Is state z<sub>5</sub> generated at trial 5? If so, the prior would be extremely small by trial 36, making it even more perplexing why z<sub>5</sub> has such a high posterior. If its weights are similar to z<sub>3</sub> and z<sub>4</sub>, and they have been much more active recently, why would z<sub>5</sub> come into play?

      We assume that the “new internal state" the reviewer is referring to is the “new latent cause." We would like to clarify that “internal state" in our study refers to all the latent causes at a given time point and observation. As this manuscript is submitted as a Research Advance article in eLife, we did not rephrase all the model details. Here, we explain when a new latent cause is created (i.e., the prior probability of a new latent cause is greater than 0) with the example of the 12-month-old group (Figure 3C and 3D). 

      Suppose that before the start of each trial, an agent inferred the most likely latent cause with maximum posterior, and it inferred k latent causes so far. A new latent cause can be inferred at the computation of the prior of latent causes at the beginning of each trial.  

      In the latent cause model, it follows a distance-dependent Chinese Restaurant Process (CRP; Blei and Frazier, 2011). The prior of each old latent cause is its posterior probability, which is the final count of the EM update before the current. In addition, the prior of old latent causes is sensitive to the time passage so that it exponentially decreases as a forgetting function modulated by g (see Figure 2 in Gershman et al., 2017). Simultaneously, the prior of a new cause is assigned ⍺. The new latent cause is inferred at this moment. Hence, the prior of latent causes is jointly determined by ⍺, g and its posterior probability. The maximum number of latent causes K is set a priori and does not affect the prior while k < K (see also reply to comment #30 for the discussion of boundary set for K and comment #31 for the discussion of the interaction between ⍺ and K). Note that only one new latent cause can be inferred in each trial, and (k+1)<sup>th</sup> latent cause, which has never been inferred so far, is chosen as the new latent cause.

      In our manuscript, the subscript number in zₖ denotes the order in which they were inferred, not the trial number. In Figures 3C and 3D, z<sub>3</sub> and z<sub>4</sub> were inferred in trials 5 and 6 during extinction; z<sub>5</sub> is a new latent cause inferred in trial 36. Therefore, the prior of z<sub>5</sub> is not extremely small compared to z<sub>4</sub> and z<sub>3</sub>.

      In both control and App<sup>NL-G-F</sup> mice in the 12-month-old (Figures 3C and 3D), z<sub>3</sub> is dominant until trial 35. The unsignaled shock at trial 35 generates a large prediction error as only context is presented and followed by the US. This prediction error reduces posterior of z<sub>3</sub>, while increasing the posterior of z<sub>4</sub> and w<sub>context</sub> in z<sub>3</sub> and z<sub>4</sub>. This decrease of posterior of z<sub>3</sub> is more obvious in the App<sup>NL-G-F</sup> than in the control group, prompting them to infer a new latent cause z<sub>5</sub> (Figure 3C and 3D). Although Figure 3C and 3D are illustrative examples as we explained in the reply to comment #14, this interpretation would be plausible as the App<sup>NL-G-F</sup> group inferred a significantly larger number of latent causes after the extinction with slightly higher posteriors of them than those in the control group (Figure 4E).

      [#17] (2) Related to the above, Are the states z<sub>A</sub> and z<sub>B</sub> defined by the authors to help the reader group the states into acquisition and extinction states, or are they somehow grouped by the model? If the latter is true, I don't understand how this would occur based on the model. If the former, could the authors state that these states were grouped together by the author?

      We used z<sub>A</sub> and z<sub>B</sub> annotations to assist with the explanation, so this is not grouped by the model. We have stated this in the manuscript Line 181-182.

      [#18] (3) This expands on the third point above. In Figure 3D, internal states z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> appear to be pretty much identical in weights in the App group. It's not clear to me why then the posterior of z<sub>5</sub> would all of a sudden jump up. If I understand correctly, the posterior is the likelihood of the observations given the internal state (presumably this should be similar across z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub>), multiplied by the prior of the state. Z3 and Z4 are the dominant inferred states up to state 36. Why would z<sub>5</sub> become more likely if there doesn't appear to be any error? I'm inferring no error because there are little or no changes in weights on trial 36, most prominently no changes inz<sub>3</sub> which is the dominant internal state in step 36. If there's little change in weights, or no errors, shouldn't the prior dominate the calculation of the posterior which would lead to z<sub>3</sub> and z<sub>4</sub> being most prominent at trial 36?

      We have explained how z<sub>5</sub> of the 12-month-old App<sup>NL-G-F</sup> was inferred in the reply to comment #16. Here, we explain the process underlying the rapid changes of the posterior of z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> from trial 35 to 36.

      During the extinction, the mice inferred z<sub>3</sub> given the CS and the context in the absence of US. In trial 35, they observed the context and the unsignaled shock in the absence of the CS. This reduced the likelihood for the CS under z<sub>3</sub> and thereby the posterior of z<sub>3</sub>, while relatively increasing the posterior of z<sub>4</sub>. The associative weight between the context and the US , w<sub>context</sub>, indeed increased in both z<sub>3</sub> and z<sub>4</sub>, but w<sub>context</sub> of z<sub>4</sub> was updated more than that of z<sub>3</sub> due to its higher posterior probability. At the beginning of trial 36, a new latent cause z<sub>5</sub> was inferred with a certain prior (see also the reply for comment #16), and w<sub>5</sub> = w<sub>0</sub>, where w<sub>0</sub> is the initial value of weight. After normalizing the prior over latent causes, the emergence of z<sub>5</sub> reduced the prior probability of other latent causes compared to the case where the prior of z<sub>5</sub> is 0. Since the CS was presented while the US was absent in trial 36, the likelihood of the CS and that of the US under z<sub>3</sub>, and especially z<sub>4</sub>, given the cues and w became lower than the case in which z<sub>5</sub> has not been inferred yet. Consequently, the posterior of z<sub>5</sub> became salient (Figure 3D).

      To maintain consistency across panels, we used a uniform y-axis range. However, we acknowledge that this may make it harder to notice the changes of associative weights in Figure 3D. We have provided the subpanel in Figure 3D with a smaller y-axis limit to reveal the weight changes at trial 35 in Author response image 5.

      Author response image 5.

      Magnified view of w<sub>context</sub> and wCS in the last 3 trials in Figure 3D. The graph format is the same as in Figure 3D. The weight for CS (w<sub>CS</sub>) and that for context (w<sub>context</sub>) in each latent cause across trial 34 (test 2), 35 (unsignaled shock), and 36 (test 3) in 12-month-old App<sup>NL-G-F</sup> in Figure 3D was magnified in the upper and lower magenta box, respectively.

      [#19] (8) In Figure 4B - The figure legend didn't appear to indicate at which time points the DIs are plotted.

      We have amended the figure legend to indicate that DI between test 3 and test 1 is plotted.

      [#20] (9) Lines 301-303 state that the posterior probabilities of the acquisition internal states in the 12month AD mice were much higher at test 1 and that this resulted in different levels of CR across the control and 12-month App group. This is shown in the Figure 4A supplement, but this is not apparent in Figure 3 panels C and D. Is the example shown in panel D not representative of the group? The CRs across the two examples in Figure 3 C and D look extremely similar at test 1. Furthermore, the posteriors of the internal states look pretty similar across the two groups for the first 4 trials. Both the App and control have substantial posterior probabilities for the acquisition period, I don't see any additional states at test 1. The pattern of states during acquisition looks strikingly similar across the two groups, whereas the weights of the stimuli are considerably different. I think it would help the authors to use an example that better represents what the authors are referring to, or provide data to illustrate the difference. Figure 4C partly shows this, but it's not very clear how strong the posteriors are for the 3rd state in the controls.

      Figure 3 serves as an example to explain the internal states in each group (see also the third paragraph in the reply to comment #14). Figure 4C to H showed the results from each sample for between-group comparison in selected features. Therefore, the results of direct comparisons of the parameter values and internal states between genotypes in Figure 3 are not necessarily the same as those in Figure 4. Both examples in Figure 3C and 3D inferred 2 latent causes during the acquisition. In terms of posterior till test 1 (trial 4), the two could be the same. However, such examples were not rare, as the proportion of the mice that inferred 2 latent causes during the acquisition was slightly lower than 50% in the control, and around 90% in the App<sup>NL-G-F</sup> mice (Figure 4C). The posterior probability of acquisition latent cause in test 1 showed a similar pattern (Figure 4 – figure supplement 3), with values near 1 in around 50% of the control mice and around 90% of the App<sup>NL-G-F</sup> mice.  

      [#21] (10) Line 320: This is a confusing sentence. I think the authors are saying that because the App group inferred a new state during test 3, this would protect the weights of the 'extinction' state as compared to the controls since the strength of the weight updates depends on the probability of the posterior.

      In order to address this, we have revised this sentence to “Such internal states in App<sup>NL-G-F</sup> mice would diverge the associative weight update from those in the control mice after extinction.” in the manuscript Line 349-351.

      [#22] (11) In lines 517-519 the authors address the difference in generalizing the occurrence of stimuli across the App and control groups. It states that App mice with lower alpha generalized observations to an old cause rather than attributing it as a new state. Going back to statement 3 above, I think it's important to show that the model fit of a reduction in alpha does not go hand-in-hand with a reduction in the learning rates and hence the weights. Again, if the likelihoods are diminished due to the low weights, then the fit of alpha might be reduced as well. To reiterate my point above, if the observations in changes in generalization and differentiation occur because of a reduction in the learning rate, the modeling may not be providing a particularly insightful understanding of AD, other than that poor learning leads to ineffectual generalization and differentiation. Do these findings hold up if the learning rates are more comparable across the control and App group?

      These findings were explained on the basis of comparable learning rates between control and App<sup>NL-GF</sup> mice in the 12-month-old group (see the reply to comment #14). In addition, we have conducted simulation for different ⍺ and σ<sub>x</sub><sup>2</sup> values under the condition of the fixed learning rate, where overgeneralization and overdifferentaiton still occurred (see the reply to comment #26).  

      [#23] (12) Lines 391 - 393. This is a confusing sentence. "These results suggest that App NL-G-F mice could successfully form a spatial memory of the target hole, while the memory was less likely to be retrieved by a novel observation such as the absence of the escape box under the target hole at the probe test 1." The App mice show improved behavior across days of approaching the correct hole. Is this statement suggesting that once they've approached the target hole, the lack of the escape box leads to a reduction in the retention of that memory?

      We speculated that when the mice observed the absence of the escape box, a certain prediction error would be generated, which may have driven the memory modification. In App<sup>NL-G-F</sup> mice, such modification, either overgeneralization or overdifferentiation, could render the memory of the target hole vulnerable; if overgeneralization occurred, the memory would be quickly overwritten as the goal no longer exists in this position in this maze, while if overdifferentiation occurred, a novel memory such that the goal does not exist in the maze different from previous one would be formed. In either case of misclassification, the probability of retrieving the goal position would be reduced. To reduce ambiguity in this sentence, we have revised the description in the manuscript Line 432-434 as follows: “These results suggest that App<sup>NL-G-F</sup> mice could successfully form a spatial memory of the target hole, while they did not retrieve the spatial memory of the target hole as strongly as control mice when they observed the absence of the escape box during the probe test.”

      [#24] (13) The connection between the results of Barnes maze and the fear learning paradigm is weak. How can changes in overgeneralization due to a reduction in the creation of inferred states and differentiation due to a reduced σx lead to the observations in the Barnes maze experiment?

      We extrapolated our interpretation in the reinstatement modeling to behaviors in a different behavioral task, to explore the explanatory power of the latent cause framework formalizing mechanisms of associative learning and memory modification. Here, we explain the results of the reversal Barnes maze paradigm in terms of the latent cause model, while conferring the reinstatement paradigm.

      Whilst we acknowledge that fear conditioning and spatial learning are not fully comparable, the reversal Barnes maze paradigm used in our study shares several key learning components with the reinstatement paradigm. 

      First, associative learning is fundamental in spatial learning (Leising & Blaisdell, 2009; Pearce, 2009). Although we did not make any specific assumptions of what kind of associations were learned in the Barnes maze, performance improvements in learning phases likely reflect trial-and-error updates of these associations involving sensory preconditioning or secondary conditioning. Second, the reversal training phases could resemble the extinction phase in the reinstatement paradigm, challenge previously established memory. In terms of the latent cause model, both the reversal learning phase in the reversal Barnes maze paradigm and the extinction phase in the reinstatement paradigm induce a mismatch of the internal state. This process likely introduces large prediction errors, triggering memory modification to reconcile competing memories.  

      Under the latent cause framework, we posit that the mice would either infer new memories or modify existing memories for the unexpected observations in the Barnes maze (e.g., changed location or absence of escape box) as in the reinstatement paradigm, but learn a larger number of association rules between stimuli in the maze compared to those in the reinstatement. In the reversal Barnes maze paradigm, the animals would infer that a latent cause generates the stimuli in the maze at certain associative weights in each trial, and would adjust behavior by retaining competing memories.

      Both overgeneralization and overdifferentiation could explain the lower exploration time of the target hole in the App<sup>NL-G-F</sup> mice in probe test 1. In the case of overgeneralization, the mice would overwrite the existing spatial memory of the target hole with a memory that the escape box is absent. In the case of overdifferentiation, the mice would infer a new memory such that the goal does not exist in the novel field, in addition to the old memory where the goal exists in the previous field. In both cases, the App<sup>NL-G-F</sup> mice would not infer that the location of the goal is fixed at a particular point and failed to retain competing spatial memories of the goal, leading to relying on a less precise, non-spatial strategy to solve the task.  

      Since there is no established way to formalize the Barnes maze learning in the latent cause model, we did not directly apply the latent cause model to the Barnes maze data. Instead, we used the view above to explore common processes in memory modification between the reinstatement and the Barnes maze paradigm. 

      The above description was added to the manuscript on page 13 (Line 410-414) and page 19-20 (Line 600-602, 626-639).

      [#25] (14) In the fear conditioning task, it may be valuable to separate responding to the context and the cue at the time of the final test. The mice can learn about the context during the reinstatement, but there must be an inference to the cue as it's not present during the reinstatement phase. This would provide an opportunity for the model to perhaps access a prior state that was formed during acquisition. This would be more in line with the original proposal by Gershman et al. 2017 with spontaneous recovery.

      Please refer to the reply to comment #13 regarding separating the response to context in test 3.  

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering the early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's ability to retain competing memories. These issues are evident in Figure 3:

      [#26] (1) The model misses key trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, the increase in fear at the start of day 2 of extinction (especially in controls), and the more rapid reinstatement of fear observed in older controls compared to acquisition.

      We acknowledge these limitations and explained why they arise in the latent cause model as follows.

      a. Absence of a fear response at the start of the experiment and the gradual learning of fear during acquisition 

      In the latent cause model, the CR is derived from a sigmoidal transformation from the predicted outcome with the assumption that its mapping to behavioral response may be nonlinear (see Equation 10 and section “Conditioned responding” in Gershman et al., 2017). 

      The magnitude of the unconditioned response (trial 1) is determined by w<sub>0</sub>, θ, and λ. An example was given in Appendix 2 – table 3. In general, a higher w<sub>0</sub> and a lower θ produce a higher trial 1 CR when other parameters are fixed. During the acquisition phase, once the expected shock exceeds θ, CR rapidly approaches 1, and further increases in expected shock produce few changes in CR. This rapid increase was also evident in the spontaneous recovery simulation (Figure 11) in Gershman et al. (2017). The steepness of this rapid increase is modulated by λ such that a higher value produces a shallower slope. This is a characteristic of the latent cause model, assuming CR follows a sigmoid function of expected shock, while the ordinal relationship over CRs is maintained with or without the sigmoid function, as Gershman et al. (2017) mentioned. If one assumes that the CR should be proportional to the expected shock, the model can reproduce the gradual response as a linear combination of w and posteriors of latent causes while omitting the sigmoid transformation (Figure 3). 

      b. Increase in fear at the start of day 2 extinction

      This point is partially reproduced by the latent cause model. As shown in Figure 3, trial 24 (the first trial of day 2 extinction) showed an increase in both posterior probability of latent cause retaining fear memory and the simulated CRs in all groups except the 6-month-old control group, though the increase in CR was small due to the sigmoid transformation (see above). This can be explained by the latent cause model as 24 h time lapse between extinction 1 and 2 decreases the prior of the previously inferred latent cause, leading to an increase of those of other latent causes. 

      Unlike other groups, the 6-month-old control did not exhibit increased observed CR at trial 24

      but at trial 25 (Figure 3A). The latent cause model failed to reproduce it, as there was no increase in posterior probability in trial 24 (Figure 3A). This could be partially explained by the low value of g, which counteracts the effect of the time interval between days: lower g keeps prior of the latent causes at the same level as those in the previous trial. Despite some failures in capturing this effect, our fitting policy was set to optimize prediction among the test trials given our primary purpose of explaining reinstatement.

      c. more rapid reinstatement of fear observed in older controls compared to acquisition

      We would like to point out that this was replicated by the latent cause model as shown in Figure 3 – figure supplement 1C. The DI between test 3 and test 1 calculated from the simulated CR was significantly higher in 12-month-old control than in App<sup>NL-G-F</sup> mice (cf. Figure 2C to E).  

      [#27] (2) The model attributes the higher fear response in controls during reinstatement to a stronger association with the context from the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition. These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. overdifferentiation), but the model itself does not appear to capture these processes accurately. The authors could benefit from a model that better matches the data and that can capture the retention and recollection of a fear memory across phases.

      First, we would like to clarify that the latent cause model explains the reinstatement not only by the extinction latent cause with increased w<sub>context</sub> but also the acquisition latent cause with preserved wCS and w<sub>context</sub> (see also reply to comment #13). Second, the latent cause model primarily attributes the higher fear reinstatement in control to a lower number of latent causes inferred after extinction (Figure 4E) and higher w<sub>context</sub> in extinction latent cause (Figure 4G). We noted that there was a trend toward significance in the posterior probability of latent causes inferred after extinction (Figure 4E), which in turn influences those of acquisition latent causes. Although the posterior probability of acquisition latent cause appeared trivial and no significance was detected between control and App<sup>NL-G-F</sup> mice (Figure 4C), it was suppressed by new latent causes in App<sup>NL-G-F</sup> mice (Author response image 6).

      This indicates that App<sup>NL-G-F</sup> mice retrieved acquisition memory less strongly than control mice. Therefore, we argue that the latent cause model attributed a higher fear response in control during reinstatement not solely to the stronger association with the context but also to CS fear memory from acquisition. Although we tested whether additional models fit the reinstatement data in individual mice, these models did not satisfy our fitting criteria for many mice compared to the latent cause model (see also reply to comment #4 and #28).

      Author response image 6.

      Posterior probability of acquisition, extinction, and after extinction latent causes in test 3. The values within each bar indicate the mean posterior probability of acquisition latent cause (darkest shade), extinction latent cause (medium shade), and latent causes inferred after extinction (lightest shade) in test 3 over mice within genotype. Source data are the same as those used in Figure 4C–E (posterior of z).

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current model falls short in doing so.

      Reviewer #3 (Recommendations for the authors):

      [#28] Other computational models may better capture the data. Ideally, I'd look for a model that can capture the gradual learning during acquisition, and, in some mice, the inferring of a new latent cause during extinction, allowing the fear memory to be retained and referenced at the start of day 2 extinction and during later tests.

      We have further evaluated another computational model, the latent state model, and compared it with the latent cause model. The simulation of reinstatement and parameter estimation method of the latent state model were described in the Appendix.

      The latent state model proposed by Cochran and Cisler (2019) shares several concepts with the latent cause model, and well replicates empirical data under certain conditions. We expect that it can also explain the reinstatement. 

      Following the same analysis flow for the latent cause model, we estimated the parameters and simulated reinstatement in the latent state model from individual CRs and median of them. In the median freezing rate data of the 12-month-old control mice, the simulated CR replicated the observed CR well and exhibited the ideal features that the reviewer looked for: gradual learning during acquisition and an increased fear at the start of the second-day extinction (Appendix 1 – figure 1G). However, a lot of samples did not fit well to the latent state model. The number of anomalies was generally higher than that in the latent cause model (Appendix 1 – figure 2). Within the accepted samples, the sum of squared prediction error in all trials was significantly lower in the latent state model, which resulted from lower prediction error in the acquisition trials (Appendix 1 – figure 4A and 4B). In the three test trials, the squared prediction error was comparable between the latent state model and the latent cause model except for the test 2 trials in the control group (Appendix 1 – figure 4A and 4B, rightmost panel). On the other hand, almost all accepted samples continued to infer the acquisition latent states during extinction without inferring new states (Appendix 1 – figure 5B and 5E, left panel), which differed from the ideal internal states the reviewer expected. While the latent state model fit performance seems to be better than the latent cause model, the accepted samples cannot reproduce the lower DI between test 3 and test 1 in aged App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6C). These results make the latent state model less suitable for our purpose and therefore we decided to stay with the latent cause model. It should also be noted that we did not explore all parameter spaces of the latent state model hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research.

      If you decide not to go with a new model, my preference would be to drop the current modeling. However, if you wish to stay with the current model, I'd like to see justification or acknowledgment of the following:

      [#29] (1) Lower bound on alpha of 1: This forces the model to infer new latent causes, but it seems that some mice, especially younger AD mice, might rely more on classical associative learning (e.g., Rescorla-Wagner) rather than inferring new causes.

      We acknowledge that the default value set in Gershman et al. (2017) is 0.1, and the constraint we set is a much higher value. However, ⍺ = 1 does not always force the model to infer new latent causes.

      In the standard form Chinese restaurant process (CRP), the prior that n<sup>th</sup> observation is assigned to a new cluster is given by ⍺ / (n - 1 + ⍺) (Blei & Gershman, 2012). When ⍺ = 1, the prior of the new cluster for the 2nd observation will be 0.5; when ⍺ = 3, this prior increases to 0.75. Thus, when ⍺ > 1, the prior of the new cluster is above chance early in the sequence, which may relate to the reviewer’s concern. However, this effect diminishes as the number of observations increases. For instance, the prior of the new cluster drops to 0.1 and 0.25 for the 10th observation when ⍺ = 1 and 3, respectively. Furthermore, the prior in the latent cause model is governed by not only α but also g, a scaling parameter for the temporal difference between successive observations (see Results in the manuscript) following “distance-dependent” CRP, then normalized over all latent causes including a new latent cause. Thus, it does not necessarily imply that ⍺ greater than 1 forces agents to infer a new latent cause_. As shown in Appendix 2 – table 4, the number of latent causes does not inflate in each trial when _α = 1. On the other hand, the high number of latent causes due to α = 2 can be suppressed when g = 0.01. More importantly, the driving force is the prediction error generated in each trial (see also comment #31 about the interaction between ⍺ and σ<sub>x</sub><sup>2</sup>). Raising the value of ⍺ per se can be viewed as increasing the probability to infer a new latent cause, not forcing the model to do so by higher α alone. 

      During parameter exploration using the median behavioral data under a wider range of ⍺ with a lower boundary at 0.1, the estimated value eventually exceeded 1. Therefore, we set the lower bound of ⍺ to be 1 is to reduce inefficient sampling. 

      [#30] (2) Number of latent causes: Some mice infer nearly as many latent causes as trials, which seems unrealistic.

      We set the upper boundary for the maximum number of latent causes (K) to be 36 to align with the infinite features of CRP. This allowed some mice to infer more than 20 latent causes in total. When we checked the learning curves in these mice, we found that they largely fluctuated or did not show clear decreases during the extinction (Author response image 7, colored lines). The simulated learning curves were almost flat in these trials (Author response image 7, gray lines). It might be difficult to estimate the internal states of such atypical mice if the sampling process tried to fit them by increasing the number of latent causes. Nevertheless, most of the samples have a reasonable total number of latent causes: 12-month-old control mice, Mdn = 5, IQR = 4; 12-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 1.75; 6-month-old control mice, Mdn = 7, IQR = 12.5; 6-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 5.25. These data were provided in Tables S9 and S12.  

      Author response image 7.

      Samples with a high number of latent causes. Observed CR (colored line) and simulated CR (gray line) for individual samples with a total number of inferred latent causes exceeding 20. 

      [#31] (3) Parameter estimation: With 10 parameters fitting one-dimensional curves, many parameters (e.g., α and σx) are likely highly correlated and poorly identified. Consider presenting scatter plots of the parameters (e.g., α vs σx) in the Supplement.

      We have provided the scatter plots with a correlation matrix in Figure 4 – figure supplement 1 for the 12-month-old group and Figure 5 – figure supplement 1 for the 6-month-old group. As pointed out by the reviewer, there are significant rank correlations between parameters including ⍺ and σ<sub>x</sub><sup>2</sup> in both the 6 and 12-month-old groups. However, we also noted that there are no obvious linear relationships between the parameters.

      The correlation above raises a potential problem of non-identifiability among parameters. First, we computed the variance inflation index (VIF) for all parameters to examine the risk of multicollinearity, though we did not consider a linear regression between parameters and DI in this study. All VIF values were below the conventional threshold 10 (Appendix 2 – table 5), suggesting that severe multicollinearity is unlikely to bias our conclusions. Second, we have conducted the simulation with different combinations of ⍺, σ<sub>x</sub><sup>2</sup>, and K to clarify their contribution to overgeneralization and overdifferentiation observed in the 12-month-old group. 

      In Appendix 2 – table 6, the values of ⍺ and σ<sub>x</sub><sup>2</sup> were either their upper or lower boundary set in parameter estimation, while the value K was selected heuristically to demonstrate its effect. Given the observed positive correlation between alpha and σ<sub>x</sub><sup>2</sup>, and their negative correlation with K (Figure 4 - figure supplement 1), we consider the product of K \= {4, 35}, ⍺ \= {1, 3} and σ<sub>x</sub><sup>2</sup> \= {0.01, 3}. Among these combinations, the representative condition for the control group is α = 3, σ<sub>x</sub><sup>2</sup> = 3, and that for the App<sup>NL-G-F</sup> group is α = 1, σ<sub>x</sub><sup>2</sup> = 0.01. In the latter condition, overgeneralization and overdifferentiation, which showed higher test 1 CR, lower number of acquisition latent causes (K<sub>acq</sub>), lower test 3 CR, lower DI between test 3 and test 1, and higher number of latent causes after extinction (K<sub>rem</sub>), was extremely induced. 

      We found conditions that fall outside of empirical correlation, such as ⍺ = 3, σ<sub>x</sub><sup>2</sup> = 0.01, also reproduced overgeneralization and overdifferentiation. Similarly, the combination, ⍺ = 1, σ<sub>x</sub><sup>2</sup> = 3, exhibited control-like behavior when K = 4 but shifted toward App<sup>NL-G-F</sup>-like behavior when K = 36. The effect of K was also evident when ⍺ = 3 and σ<sub>x</sub><sup>2</sup> = 3, where K = 36 led to over-differentiation. We note that these conditions were artificially set and likely not representative of biologically plausible. These results underscore the non-identifiability concern raised by the reviewer. Therefore, we acknowledge that merely attributing overgeneralization to lower ⍺ or overdifferentiation to lower σ<sub>x</sub><sup>2</sup> may be overly reductive. Instead, these patterns likely arise from the joint effect of ⍺, σ<sub>x</sub><sup>2</sup>, and K. We have revised the manuscript accordingly in Results and Discussion (page 11-13, 18-19).

      [#32] (4) Data normalization: Normalizing the data between 0 and 1 removes the interpretability of % freezing, making mice with large changes in freezing indistinguishable seem similar to mice with small changes.

      As we describe in our reply to comment #26, the conditioned response in the latent cause model was scaled between 0 and 1, and we assume 0 and 1 mean the minimal and maximal CR within each mouse, respectively. Furthermore, although we initially tried to fit simulated CRs to raw CRs, we found that the fitting level was low due to the individual difference in the degree of behavioral expression: some mice exhibited a larger range of CR, while others showed a narrower one. Thus, we decided to normalize the data. We agree that this processing will make the mice with high changes in freezing% indistinguishable from those with low changes. However, the freezing% changes within the mouse were preserved and did not affect the discrimination index.

      [#33] (5) Overlooking parameter differences: Differences in parameters, like w<sub>0</sub>, that didn't fit the hypothesis may have been ignored.

      Our initial hypothesis is that internal states were altered in App<sup>NL-G-F</sup> mice, and we did not have a specific hypothesis on which parameter would contribute to such a state. We mainly focus on the parameters (1) that are significantly different between control and App</sup>NL-G</sup>- mice and (2) that are significantly correlated to the empirical behavioral data, DI between test 3 and test 1. 

      In the 12-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, w<sub>0</sub> and K showed marginal p-value in Mann-Whitney U test (Table S7) and moderate correlation with the DI (Table S8). While differences in K were already discussed in the manuscript, we did miss the point that w<sub>0</sub> could contribute to the differences in w between control and App<sup>NL-G-F</sup> (Figure 4G) in the previous manuscript. We explain the contribution of w<sub>0</sub> on the reinstatement results here. When other parameters are fixed, higher w<sub>0</sub> would lead to higher CR in test 3, because higher w<sub>0</sub> would allow increasing w<sub>context</sub> by the unsignaled shock, leading to reinstatement (Appendix 2 – table 7). It is likely that higher w<sub>0</sub> would be sampled through the parameter estimation in the 12-month-old control but not App<sup>NL-G-F</sup>. On the other hand, the number of latent causes is not sensitive to w<sub>0</sub> when other parameters were fixed at the initial guess value (Appendix 2 – table 1), suggesting w<sub>0</sub> has a small contribution to memory modification process. 

      Thus, we speculate that although the difference in w<sub>0</sub> between control and App<sup>NL-G-F</sup> mice may arise from the sampling process, resulting in a positive correlation with DI between test 3 and test 1, its contribution to diverged internal states would be smaller relative to α or σ<sub>x</sub><sup>2</sup> as a wide range of w<sub>0</sub> has no effect on the number of latent causes (Appendix 2 – table 7). We have added the discussion of differences in w<sub>0</sub> in the 12-month-old group in manuscript Line 357-359.

      In the 6-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, 𝜃 is significantly higher in the AD mice group (Table S10) but not correlated with the DI (Table S11). We have already discussed this point in the manuscript.  

      [#34] (6) Initial response: Higher initial responses in the model at the start of the experiment may reflect poor model fit.

      Please refer to our reply to comment #26 for our explanation of what contributes to high initial responses in the latent cause model.

      In addition, achieving a good fit for the acquisition CRs was not our primary purpose, as the response measured in the acquisition phase includes not only a conditioned response to the CS and context but also an unconditioned response to the novel stimuli (CS and US). This mixed response presumably increased the variance of the measured freezing rate over individuals, therefore we did not cover the results in the discussion.

      Rather, we favor models at least replicating the establishment of conditioning, extinction and reinstatement of fear memory in order to explain the memory modification process. As we mentioned in the reply for comment #4, alternative models, the latent state model and the Rescorla-Wagner model, failed to replicate the observation (cf. Figure 3 – figure supplement 1A-1C). Thus, we chose to stand on the latent cause model as it aligns better with the purpose of this study. 

      [#35] In addition, please be transparent if data is excluded, either during the fitting procedure or when performing one-way ANCOVA. Avoid discarding data when possible, but if necessary, provide clarity on the nature of excluded data (e.g., how many, why were they excluded, which group, etc?).

      We clarify the information of excluded data as follows. We had 25 mice for the 6-month-old control group, 26 mice for the 6-month-old App<sup>NL-G-F</sup> group, 29 mice for the 12-month-old control group, and 26 mice for the 12-month-old App<sup>NL-G-F</sup> group (Table S1). 

      Our first exclusion procedure was applied to the freezing rate data in the test phase. If the mouse had a freezing rate outside of the 1.5 IQR in any of the test phases, it is regarded as an outlier and removed from the analysis (see Statistical analysis in Materials and Methods). One mouse in the 6-month-old control group, one mouse in the 6-month-old App<sup>NL-G-F</sup> group, five mice in the 12-month-old control group, and two mice in the 12-month-old App<sup>NL-G-F</sup> group were excluded.

      Our second exclusion procedure was applied during the fitting and parameter estimation (see parameter estimation in Materials and Methods). We have provided the number of anomaly samples during parameter estimation in Appendix 1 – figure 2.   

      Lastly, we would like to state that all the sample sizes written in the figure legends do not include outliers detected through the exclusion procedure mentioned above.

      [#36] Finally, since several statistical tests were used and the differences are small, I suggest noting that multiple comparisons were not controlled for, so p-values should be interpreted cautiously.

      We have provided power analyses in Tables S21 and S22 with methods described in the manuscript (Line 897-898) and added a note that not all of the multiple comparisons were corrected for in the manuscript (Line 898-899).

      References cited in the response letter only 

      Bellio, T. A., Laguna-Torres, J. Y., Campion, M. S., Chou, J., Yee, S., Blusztajn, J. K., & Mellott, T. J. (2024). Perinatal choline supplementation prevents learning and memory deficits and reduces brain amyloid Aβ42 deposition in App<sup>NL-G-F</sup> Alzheimer’s disease model mice. PLOS ONE, 19(2), e0297289. https://doi.org/10.1371/journal.pone.0297289

      Blei, D. M., & Frazier, P. I. (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research, 12(74), 2461–2488.

      Cochran, A. L., & Cisler, J. M. (2019). A flexible and generalizable model of online latent-state learning. PLOS Computational Biology, 15(9), e1007331. https://doi.org/10.1371/journal.pcbi.1007331

      Curiel Cid, R. E., Crocco, E. A., Duara, R., Vaillancourt, D., Asken, B., Armstrong, M. J., Adjouadi, M., Georgiou, M., Marsiske, M., Wang, W., Rosselli, M., Barker, W. W., Ortega, A., Hincapie, D., Gallardo, L., Alkharboush, F., DeKosky, S., Smith, G., & Loewenstein, D. A. (2024). Different aspects of failing to recover from proactive semantic interference predicts rate of progression from amnestic mild cognitive impairment to dementia. Frontiers in Aging Neuroscience, 16. https://doi.org/10.3389/fnagi.2024.1336008

      Giustino, T. F., Fitzgerald, P. J., Ressler, R. L., & Maren, S. (2019). Locus coeruleus toggles reciprocal prefrontal firing to reinstate fear. Proceedings of the National Academy of Sciences, 116(17), 8570–8575. https://doi.org/10.1073/pnas.1814278116

      Gu, X., Wu, Y.-J., Zhang, Z., Zhu, J.-J., Wu, X.-R., Wang, Q., Yi, X., Lin, Z.-J., Jiao, Z.-H., Xu, M., Jiang, Q., Li, Y., Xu, N.-J., Zhu, M. X., Wang, L.-Y., Jiang, F., Xu, T.-L., & Li, W.-G. (2022). Dynamic tripartite construct of interregional engram circuits underlies forgetting of extinction memory. Molecular Psychiatry, 27(10), 4077–4091. https://doi.org/10.1038/s41380-022-01684-7

      Lacagnina, A. F., Brockway, E. T., Crovetti, C. R., Shue, F., McCarty, M. J., Sattler, K. P., Lim, S. C., Santos, S. L., Denny, C. A., & Drew, M. R. (2019). Distinct hippocampal engrams control extinction and relapse of fear memory. Nature Neuroscience, 22(5), 753–761. https://doi.org/10.1038/s41593-019-0361-z

      Loewenstein, D. A., Curiel, R. E., Greig, M. T., Bauer, R. M., Rosado, M., Bowers, D., Wicklund, M., Crocco, E., Pontecorvo, M., Joshi, A. D., Rodriguez, R., Barker, W. W., Hidalgo, J., & Duara, R. (2016). A Novel Cognitive Stress Test for the Detection of Preclinical Alzheimer’s Disease: Discriminative Properties and Relation to Amyloid Load. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 24(10), 804–813. https://doi.org/10.1016/j.jagp.2016.02.056

      Loewenstein, D. A., Greig, M. T., Curiel, R., Rodriguez, R., Wicklund, M., Barker, W. W., Hidalgo, J., Rosado, M., & Duara, R. (2015). Proactive Semantic Interference Is Associated With Total and Regional Abnormal Amyloid Load in Non-Demented Community-Dwelling Elders: A Preliminary Study. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 23(12), 1276–1279. https://doi.org/10.1016/j.jagp.2015.07.009

      Valles-Salgado, M., Gil-Moreno, M. J., Curiel Cid, R. E., Delgado-Á lvarez, A., Ortega-Madueño, I., Delgado-Alonso, C., Palacios-Sarmiento, M., López-Carbonero, J. I., Cárdenas, M. C., MatíasGuiu, J., Díez-Cirarda, M., Loewenstein, D. A., & Matias-Guiu, J. A. (2024). Detection of cerebrospinal fluid biomarkers changes of Alzheimer’s disease using a cognitive stress test in persons with subjective cognitive decline and mild cognitive impairment. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1373541

      Zaki, Y., Mau, W., Cincotta, C., Monasterio, A., Odom, E., Doucette, E., Grella, S. L., Merfeld, E., Shpokayte, M., & Ramirez, S. (2022). Hippocampus and amygdala fear memory engrams reemerge after contextual fear relapse. Neuropsychopharmacology, 47(11), 1992–2001. https://doi.org/10.1038/s41386-022-01407-0

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      The chosen classification scheme for aGPCRs may require reassessment and amendment by the authors in order to prevent confusion with previously issued classification attempts of this family. (…) Can the authors suggest another scheme (mind to avoid the subfamily IIX or the alternative ADGRA-G,L,V subfamily schemes of metazoan aGPCRs), and adapt their numbering throughout the text and all figures/supplementary figures/supplementary files?

      We appreciate the reviewer's comment and agree that a different nomenclature should be used for choanoflagellate aGPCRs to avoid possible confusion. We have now re-labeled the choanoflagellate aGPCR subfamilies, previously numbered from I to XIX, using alphabetical enumeration (from A to S). Changes have been made throughout the main text, in Figure 5, and in Supplementary Figures S6 and S7.

      line 10: The abbreviation 'GPCR-TKL/Ks' is not explained.

      Thank you for pointing this out. We have now revised the text to explain the abbreviation:

      “Adhesion GPCRs and a class of GPCRs fused to kinases (the GPCR-TKL/Ks) are the most abundant GPCRs in choanoflagellates.”

      line 30: "7TM domain is diagnostic for GPCRs": strange wording. Use an alternative expression.

      We changed the wording to: 

      “A conserved seven transmembrane (7TM) domain is a hallmark of GPCRs, while the wide spectrum of extracellular and intracellular domains in some GPCRs reflects the diversification of the gene family and its functions (Schiöth and Lagerström 2008).”

      line 33: In the case of rhodopsins, not the GPCR (i.e., the apoprotein) responds directly to photons, but the retinal, which isomerises upon illumination.

      We thank the reviewer for bringing this to our attention, and we have now removed mention of photons from the list of cues detected by GPCRs.

      “For example, the extracellular N-terminus and the three extracellular loops of the 7TM domain respond to a wide range of cues, including odorant molecules, peptides, amines, lipids, nucleotides, and other molecules (Yang et al. 2021).”

      line 111: What are "genome-enabled choanoflagellates"? Explain the term. As it stands, it doesn't make sense to me.

      We meant only to highlight that these two species have sequenced genomes. We have deleted the phrase “genome enabled.”

      “To assess the predictive power of our protein-detection pipeline, we then compared the new GPCR and cytosolic signaling component datasets from two choanoflagellates – Salpingoeca rosetta and Monosiga brevicollis – with previously published GPCR and downstream GPCR signaling component counts for these two species (Nordström et al. 2009a; Krishnan et al. 2012; De Mendoza et al. 2014; Krishnan et al. 2015; Lokits et al. 2018).”

      line 145: Please give a reasoning for the naming of each of the new families (e.g., RemiSens, Hidden Gold, GPCR-TLK/K, etc.) or at least the explanations of the acronyms/names early in the manuscript, even if they are discussed later in more detail.

      Thank you for identifying this as an area of confusion. While we feel that going into the rationale behind each of the names here would interrupt the flow of the manuscript, we have added a phrase encouraging readers to “hold that thought” with the hope that they can wait for the sections that specifically focus on each of these new GPCR families.

      “This left twelve new GPCR families that had not, to our knowledge, been previously detected in choanoflagellates: Rhodopsin, TMEM145, GPR180, TMEM87, GPR155, GPR157, and six additional GPCR families that appear to fall outside all previously characterized GPCR families in eukaryotes. For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      lines 297/298 and 2049: Rename tethered agonist "peptide" to "element". Synthetic peptides resembling the TA were used in experiments to test for the sufficiency of the TA for receptor activation, but because the naturally occurring TAs are part of the receptor protein, they are not peptides.

      Thank you for pointing this out. We have revised the text as suggested.

      line 2026: I think the letters in the acronym "CMR" are mixed up and were intended to read "CRM".

      Good catch! We have corrected the text.

      line 2048: "diagnostic" again. Change to "tell-tale", "hallmark", or another similar descriptor.

      We have corrected the text accordingly.

      2058: Strike "motif" in order to avoid confusion with the now obsolete term "GPS motif", which entailed the five most C-terminal β-strands of GAIN subdomain B (not thus neither the full GAIN domain nor the GPS).

      Thank you for pointing this out. We have corrected the text.

      Figure 5: Did the authors also find homologs placed in the aGPCR family based on their 7TM domain sequence but lacking a GAIN domain similar to vertebrate ADGRA/GPR123, the only aGPCR known to lack a GAIN domain (10.1016/j.tips.2013.06.002)? Irrespective of the authors' findings or non-finding on that matter, please insert a note on this in the results text.

      We thank the reviewer for bringing this interesting point to our attention. We have now added a new supplementary figure A in Fig. S9 to answer the reviewer's comment. We also modified the legend of Fig. S9  to take into account this change and uploaded a new supplementary data file 20 to support Fig. S9A. Finally, we revised the main text under the section “Adhesion GPCRs” as requested: 

      Lines 328-331: “ While the GAIN and aGPCR 7TM domains evolved before the origin of opisthokonts (Araç et al.2012; Krishnan et al. 2012; De Mendoza et al. 2014), we detected the fusion of these two domains into a single module (GAIN/7TM) in most, but not all, holozoan aGPCRs (Fig. 5D, Fig.S7B and S9A; Supplementary file 20; Prömel et al, 2013; Krishnan et al. 2014).

      Reviewer #2:

      While the study contributes several interesting observations, it does not radically revise the evolutionary history of the GPCR family. However, in an era increasingly concerned with the reproducibility of scientific findings, this is arguably a strength rather than a weakness. It is encouraging to see that previously established patterns largely hold, and that with expanded sampling and improved methods, new insights can be gained, especially at the level of specific GPCR subfamilies. Then, no functional follow-ups are provided in the model system Salpingoeca rosetta, but I am sure functional work on GPCRs in choanoflagellates is set to reveal very interesting molecular adaptations in the future.

      We agree with the reviewer and anticipate that this work will provide a useful resource to motivate the future functional characterization of GPCRs in choanoflagellates, other CRMs, as well as in metazoans.

      The GPCR-TKL fusion is a particularly interesting finding, especially given the presence of such sequences in sponges. This could potentially represent a synapomorphy shared between sponges and choanoflagellates, later lost in other animals. The authors mention that BLASTP searches using the kinase domain recover the sponge GPCR-TKLs, suggesting the fusion may be ancestral. It would be useful to include phylogenetic trees of both the GPCR and TKL domains to assess this possibility. The authors might also consider examining sponge genomes released by the DTOL project to increase representation from this group.

      We agree and thank the reviewer for this suggestion. We have now added the requested phylogenetic analyses to the new Figure S17, revised the supplementary files and Methods accordingly, and commented on these results in the main text under the section “GPCR-TKL/K and GPCR-TKs“.  

      Lines 579 – 589: “While no metazoan homologs were found when using the 7TM domain of choanoflagellate GPCR-TKs as queries, using the conserved tyrosine kinase domains as queries recovered GPCR-TKs in sponges but not in other metazoan lineages or other holozoans (Fig. S17E). To test whether GPCR-TKs in sponges and choanoflagellates are homologous, we performed phylogenetic analyses of their TK and 7TM domains (Fig. S17F and G). While the TK domains of GPCR-TKs from sponges and choanoflagellates formed a well-supported clade, their 7TM domains did not. These results point to a heterogeneous evolutionary history that may include domain swapping (i.e. ancestral GPCR-TKs in which the 7TM domain was replaced in either the sponge or choanoflagellate lineages) or convergent evolution, in which homologous 7TM domains fused with unrelated 7TM domains in the sponge and choanoflagellate lineages.”

      Added to the Method section “Sequence alignment and phylogenetic analyses”:

      Lines 913 – 933: “Phylogenetic analyses of holozoan aGPCRs, Glutamate Receptors, and Gα subunits, and the 7TM and Kinase domains from GPCR TK/TKL/Ks were performed in this study. (…) To construct the phylogenies of the Kinase domain and 7TM domain from the GPCR TK/TKL/Ks, we first built a dataset including all the GPCR TK/TKL/Ks sequences identified in choanoflagellates and in sponges, as well as the GPCR TKL/Ks previously published in oomycetes and amoebozoans (Van Den Hoogen et al. 2018). We extracted the 7TM domain and Kinase domain from each sequence by combining the transmembrane domain prediction tool TMHMM-2.0 and the protein domain prediction tool InterProScan with the alignment tool MAFFT (E-INS-I algorithm) on Geneious Prime v2024.07 (Supplementary Files 30 and 32). We then aligned the aGPCR, Glutamate and Glutamate GPCR TK/TKL/K Receptor 7TMs, the GPCR TK/TKL/Ks Kinase domain, or the full-length Gα sequences using MAFFT with the E-INS-I algorithm. The resulting alignments were then used for Maximum-likelihood and/or Bayesian inference of phylogenies (Fig. 3B, Fig. 5A, Fig. S3D, and Fig. S6A, and Fig. S17F and G; Supplementary Files 5, 9, 16,18, 31, and 33).”

      Rhodopsin-like receptors are proposed in the discussion to be potential cases of lateral gene transfer (LGT) between eukaryotes. To support or refute this hypothesis, it would be valuable to place the choanoflagellate and ichthyosporean Rhodopsins within a broader phylogeny of this family, including (a few) representatives from animals and other eukaryotes. Even if deep branching relationships remain unresolved, signs such as unusually short branches could point toward recent LGT events.

      Thank you for your suggestion. While we originally considered testing these alternative hypotheses in this manuscript by building a phylogeny, the rapid sequence evolution of the Rhodopsin family has stymied similar efforts in the past and instead motivated others to use clustering approaches like those used in our study (Hu et al. 2017; Thiel et al. 2023). Unfortunately, these types of analyses cannot be used to readily identify instances of LGT.

      Therefore, following the suggestion of the reviewer, we bit the bullet and performed phylogenetic analyses on the sequences in question. Unfortunately, these analyses were completely inconclusive, and we feel they do not warrant inclusion in the manuscript. The topologies of the sequence trees recovered were poorly supported and sensitive to most of the variables we tested – the set of rhodopsin sequences included, the multiple alignment algorithms used, and the probabilistic methods employed to infer the phylogenies. 

      Instead, we have revised the manuscript to highlight the challenge of differentiating between the different hypotheses that are consistent with the phylogenetic distribution of Rhodopsins:

      Lines 670 – 678: “Thus, while it is formally possible that Rhodopsins existed in stem choanoflagellates and were lost in most modern choanoflagellate lineages, either horizontal gene transfer or convergent evolution in the shared ancestor of S. macrocollata and S. punica are similarly plausible explanations for their presence in these species. Differentiating between these alternative evolutionary scenarios is challenging because of rapid rate of sequence evolution within the family and the resultant loss of phylogenetic signal. Our own preliminary investigations of Rhodopsin evolution in non-metazoans were inconclusive. Therefore, ambiguities about the provenance and function of CRM Rhodopsins currently obscure the ancestry of metazoan Rhodopsins and opsins.”

      While the study surveys most available holozoan genomes, it appears that the genomes of Amoebidium spp.-which are cited in the manuscript- were not included. It may not be necessary to repeat all analyses with these two species (A. appalachense and A. parasiticum), but a preliminary search indicates the presence of four candidate 7tm_1 (Rhodopsin-like) proteins in their proteomes. These may warrant closer inspection (e.g., via BLASTP against animal databases) to confirm whether they are genuine GPCRs or false positives.

      Author response image 1.

      We thank the reviewer for bringing these sequences to our attention. To be clear, we did not analyze the Amoebidium spp. genome and we can find no reference to it in our manuscript. If the reviewer had the impression that the genome was analyzed, we would be grateful to know the source of the confusion so that it can be corrected. (We did not intentionally exclude the genome; it simply was not available on the Multicell Genome database from which we retrieved the ichthyosporean genomes and transcriptomes used in this study.)

      Nevertheless, out of curiosity, we have now analyzed the sequences provided by the reviewer and summarize our findings here for the interest of the reviewer. Although the sequences were annotated as 7tm_1 (Rhodopsin-like) proteins in the original genome study, none of these sequences group with metazoan or choanoflagellate Rhodopsins in our clustering analysis; instead, we found that these putative GPCRs form a distinct cluster that only weakly resembles cAMP receptors, both on the basis of their sequence and predicted structures. 

      It is not surprising to find new GPCR clusters as new taxa are folded into the study, and these Amoebidium sequences do not add to our understanding of Rhodopsin evolution. Therefore, we have not added their analysis to the manuscript, but we hope the reviewer finds our quick analysis of interest.

      Author response image 2.

      In Figure 2, perhaps expanding the other holozoan clades would have been nice, as there are not too many species, but I understand if that's beyond the point of the manuscript, focused on choanoflagellates.

      Thank you for this comment. However, given the focus of this study, we feel that an expansion of the other holozoan clades would reduce the clarity of the figure.

      line 87 - "To this end, the 671 validated choanoflagellate GPCRs were sorted by sequence similarity, resulting in 18 clusters. "Some details in the results section would be nice, or at least clear references to where this is explained in more detail. How were the extra choanoflagellate GPCRs added if they failed to be identified with quite sensitive HMM profiles?

      We apologize for the possible confusion and thank the reviewer for the suggestion; we have now added specific references to the related sections from the material and methods for interested readers.

      We believe that the "extra choanoflagellate GPCRs" mentioned by the reviewer refer to the choanoflagellate GPCRs that failed to be detected when the choanoflagellate genomes and transcriptomes were searched with the predominantly metazoan-derived GPCRHMM and HMMs from the GPCR_A Pfam clan (CL0192). We were able to recover these extra choanoflagellate GPCRs by using custom choanoflagellate-specific GPCR HMMs and by blasting the choanoflagellate GPCRs previously identified as queries against the 23 choanoflagellate proteomes. We hope that the referencing of the Methods section "Recovering additional choanoflagellate GPCRs using choanoflagellate GPCR BLAST queries and custom choanoflagellate GPCR HMMs", in lines 91 and 93, will help clarify this point.

      line 108 - Well, from the figure it seems that most eukaryotes have an 'animal-like' G protein signalling, so that's perhaps more of an eukaryotic signature than something that puts choanoflagellates and animals together.

      Excellent point! We have revised the text.

      line 132 - It is unclear what the criteria are to include these taxa as helpers for choanoflagellate classification, and not adding the other unicellular holozoans. Just some text justification could help.

      Thank you for pointing this out. We have added an explanation of the rationale to the methods — section “Clustering of the 918 validated choanoflagellate GPCRs” — and referred to it in the main text.

      New text added to methods:

      “The non-choanoflagellate sequences added to the dataset were either top blast hits recovered after searching the entire Eukprot v3 dataset (993 species) with choanoflagellate GPCRs as queries, or previously published and well-documented GPCR sequences from metazoans.”

      line 145 - These families are listed, but perhaps it would be nice to explicitly mention that they will be covered in more detail later on in the manuscript. I found myself wondering about those exotic names, until I reached the sections in the manuscript where they are explained.

      Thank you for this suggestion. We have now modified our sentence to refer to the related sections.

      “For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      line 199 - perhaps would be nice to explain domain architecture of validated Dictyostelium GABA-like receptors (ANF domain?).

      Thank you for your suggestion. We have now modified the sentence to mention the protein domain composition of the validated GABA-like receptor, GrlE, in Dictyostelium.

      “The Glutamate Receptors from the amoebozan Dictyostelium discoideum, of which at least one, GrlE, binds both GABA and Glutamate presumably through its conserved ANF domain (Anjard and Loomis 2006; Taniura et al. 2006; Wu and Janetopoulos 2013), grouped separately from metazoan and CRM GPCRs in our analysis.”

      Figure S4 - Perhaps a stacked bar chart would be easier to browse than a bunch of pie charts, notoriously difficult to quantify.

      Thank you for this comment. Opinions differ on how best on whether pie charts or bar charts are more effective in this context (including between the authors of this manuscript). However, we think the point of Figure S4 a minor point, only to be appreciated by a tiny number of readers, and therefore have left the data presentation as it was in the original submission.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Li et al. investigate Ca2+ signaling in T. gondii and argue that Ca2+ tunnels through the ER to other organelles to fuel multiple aspects of T. gondii biology. They focus in particular on TgSERCA as the presumed primary mechanism for ER Ca2+ filling. Although, when TgSERCA was knocked out there was still a Ca2+ release in response to TG present.

      Note that we did not generate a complete SERCA knockout, as this gene is essential, and its complete loss would not permit the isolation of viable parasites. Instead, we created conditional mutants that downregulate the expression of SERCA. Importantly, some residual activity is present in the mutant after 24 h of ATc treatment as shown in Fig 4C. This is consistent with our Western blots, which demonstrate the presence of residual SERCA protein at 1, 1.5 and 2 days post ATc treatment (Fig. 3B). We have clarified this point in the revised manuscript (lines 232233). See also lines 97-102.

      Overall the Ca2+ signaling data do not support the conclusion of Ca2+ tunneling through the ER to other organelles in fact they argue for direct Ca2+ uptake from the cytosol. The authors show EM membrane contact sites between the ER and other organelles, so Ca2+ released by the ER could presumably be taken up by other organelles but that is not ER Ca2+ tunneling. They clearly show that SERCA is required for T. gondii function.

      Overall, the data presented to not fully support the conclusions reached

      We agree that the data does not support Ca<sup>2+</sup> tunneling as defined and characterized in mammalian cells. In response to this comment, we have modified the title and the text accordingly.

      However, we respectfully would like to emphasize that the study demonstrates more than just the role of SERCA in T. gondii “function”. Our findings reveal that the ER, through SERCA activity, sequesters calcium following influx through the PM (see reviewer 2 comment). The ER calcium pool is important for replenishing other intracellular compartments.

      The experiments support a model in which the ER actively takes up cytosolic Ca²⁺ as it enters the parasite and contributes to intracellular Ca²⁺ redistribution during transitions between distinct extracellular calcium environments. We believe that the role of the ER in modulating intracellular calcium dynamics is demonstrated in Figures 1H–K, 4G-H, and 5H–K. To highlight the relevance of these findings, we have included an expanded discussion in the revised manuscript. See lines 443-449 and 510-522.

      Data argue for direct Ca2+ uptake from the cytosol

      The ER most likely takes up calcium from the cytosol following its entry through the PM and redistributes it to the other organelles. We deleted any mention of the word “tunneling” and replaced it with transfer and re-distribution as they reflect our experimental findings more accurately.

      We interpret the experiments shown in Figure 1 H and I as re-distribution because the amount of calcium released after nigericin or GPN are greatly enhanced after TG addition. We first add calcium to allow intracellular stores to become filled, followed by the addition of TG, which allows calcium leakage from the ER. This leaked calcium can either enter the cytosol and be pumped out or be taken up by other organelles. Our interpretation is that this process leads to an increased calcium content in acidic compartments.

      We conducted an additional experiment in which SERCA was inhibited prior to calcium addition, allowing cytosolic calcium to be exported or taken up by acidic stores. We observed a change in the GPN response (Fig. S2A), possibly indicating that the PLVAC can sequester calcium when SERCA is inactive. While this may support the reviewer’s view, TG treatment does not reflect physiological conditions and may enhance calcium transfer to other compartments. Although the result is interesting, interpretation is complicated by the use of parasites in suspension and drug exposure in solution. Single-parasite measurements are not feasible due to weak signals, and adhered parasites are even less physiological than those in suspension.

      In support of our view, the experiments shown in Figs 4G and H show that down regulating SERCA reduces significantly the response to GPN indicating diminished acidic store loading. In Fig 5I we observe that mitochondrial calcium uptake is reduced in the iDSERCA (+ATc) mutant in response to GPN. Fig 2B demonstrates that TgSERCA can take up calcium at 55 nM, close to resting cytosolic calcium while in Figures 5E and S5B we show that the mitochondrion is not responsive to an increase of cytosolic calcium. Uptake by the mitochondria requires much higher concentrations (Fig 5B-C), which may be achieved within microdomains at MCS between the ER and mitochondrion. This is also consistent with findings reported by Li et al (Nat Commun. 2021) where similar microdomains mediated transfer of calcium to the apicoplast (Fig. 7 E and F of the mentioned reference) was observed.

      Reviewer 2 (Public review):

      The role of the endoplasmic reticulum (ER) calcium pump TgSERCA in sequestering and redistributing calcium to other intracellular organelles following influx at the plasma membrane.

      T. gondii transitions through life cycle stages within and exterior to the host cells, with very different exposures to calcium, adds significance to the current investigation of the role of the ER in redistributing calcium following exposure to physiological levels of extracellular calcium

      They also use a conditional knockout of TgSERCA to investigate its role in ER calcium store-filling and the ability of other subcellular organelles to sequester and release calcium. These knockout experiments provide important evidence that ER calcium uptake plays a significant role in maintaining the filling state of other intracellular compartments.

      We thank the reviewer.

      While it is clearly demonstrated, and not surprising, that the addition of 1.8 mM extracellular CaCl2 to intact T. gondii parasites preincubated with EGTA leads to an increase in cytosolic calcium and subsequent enhanced loading of the ER and other intracellular compartments, there is a caveat to the quantitation of these increases in calcium loading. The authors rely on the amplitude of cytosolic free calcium increases in response to thapsigargin, GPN, nigericin, and CCCP, all measured with fura2. This likely overestimates the changes in calcium pool sizes because the buffering of free calcium in the cytosol is nonlinear, and fura2 (with a Kd of 100-200 nM) is a substantial, if not predominant, cytosolic calcium buffer. Indeed, the increases in signal noise at higher cytosolic calcium levels (e.g. peak calcium in Figure 1C) are indicative of fura2 ratio calculations approaching saturation of the indicator dye.

      We acknowledge the limitations associated with using Fura-2 for cytosolic calcium measurements. However, according to the literature (Grynkiewicz, Get al. (1985). J. Biol. Chem. 260 (6): 3440–3450. PMID 3838314) Fura-2 is suited for measurements between 100 nM and 1 µM calcium. The responses in our experiments were within that range and the experiments with the SERCA mutant and mitochondrial GCaMPfs supports the conclusions of our work.

      However, we agree with the reviewer that the experiment shown in Fig 1C (now Fig 1D) presents a response that approaches the limit of the linear range of Fura-2. In response to this, we have replaced this panel with a more representative experiment that remains within the linear range of the indicator (revised Fig 1D). Additionally, we have included new experiments adding GPN along with corresponding quantifications, which further support our conclusions regarding calcium dynamics in the parasite.

      Another caveat, not addressed, is that loading of fura2/AM can result in compartmentalized fura2, which might modify free calcium levels and calcium storage capacity in intracellular organelles.

      We are aware of the potential issue of Fura-2 compartmentalization, and our protocol was designed to minimize this effect. We load cells with Fura-2 for 26 min at room temperature, then maintain them on ice, and restrict the use of loaded parasites to 2-3 hours. We have observed evidence of compartmentalization as this is reflected in increasing concentrations of resting calcium with time. We carry out experiments within a time frame in which the resting calcium stays within the 100 nM range. We have included a sentence in the Materials and Methods section. Lines 604-606.

      Additionally, following this reviewer’s suggestion, we performed further experiments to directly assess compartmentalization. See below the full response to reviewer 2.

      The finding that the SERCA inhibitor cyclopiazonic acid (CPA) only mobilizes a fraction of the thapsigargin-sensitive calcium stores in T. gondii coincides with previously published work in another apicomplexan parasite, P. falciparum, showing that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools (Borges-Pereira et al., 2020, DOI: 10.1074/jbc.RA120.014906). It would be valuable to determine whether this reflects the off-target effects of thapsigargin or the differential sensitivity of TgSERCA to the two inhibitors.

      This is an interesting observation, and we now include a discussion of this result considering the Plasmodium study and include the citation. Lines 436-442.

      Figure S1 suggests differential sensitivity, and it shows that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools in T. gondii. Also important is that we used 1 µM TG as we are aware that TG has shown off-target effects at higher concentrations. TG is a well-characterized, irreversible SERCA inhibitor that ensures complete and sustained inhibition of SERCA activity. In contrast, CPA is a reversible inhibitor whose effectiveness is influenced by ATP levels, and it may only partially inhibit SERCA or dissociate over time, allowing residual Ca²⁺ reuptake into the ER.

      Additionally, as suggested by the reviewer we performed experiments using the Mag-Fluo-4 protocol to compare the inhibitory effects of CPA and TG. These results are presented in Fig. S3 (Lines 217-223). Under the conditions of the Mag-Fluo-4 assay with digitonin-permeabilized cells, both TG and CPA showed similar rates of Ca<sup>2+</sup> leakage following the addition of the inhibitor. This may indicate that under the conditions of the Mag-Fluo-4 experiments the rate of Ca<sup>2+</sup> leak is mostly determined by the intrinsic leak mechanism and not by the nature of the inhibitor. By contrast, in intact Fura-2–loaded cells, CPA induces a smaller cytosolic Ca²⁺ increase than TG, consistent with less efficient SERCA inhibition likely due to its reversibility and possibly incomplete inhibition under cellular conditions.

      The authors interpret the residual calcium mobilization response to Zaprinast observed after ATc knockdown of TgSERCA (Figures 4E, 4F) as indicative of a target calcium pool in addition to the ER. While this may well be correct, it appears from the description of this experiment that it was carried out using the same conditions as Figure 4A where TgSERCA activity was only reduced by about 50%.

      We partially agree with the reviewer that 50% knockdown of TgSERCA means that the ER may still be targeted by zaprinast, and that there is no definitive evidence of the involvement of another calcium pool. The Mag-Fluo-4 experiment, while we acknowledge that the fluorescence of MagFluo-4 is not linear to calcium, indicates that SERCA activity is present even after 24 hr of ATc treatment. However, when Zaprinast is added after TG, we observed a significant calcium release in wild type cells. This result suggests the presence of another large calcium pool than the one mobilized by TG (PMID: 2693306).

      We recently published work describing the Golgi as a calcium store in Toxoplasma (PMID: 40043955) and we showed in Fig. S4 D-G of that work, that GPN treatment of tachyzoites loaded with Fura-2 diminished the Zaprinast response indicating that they could be impacting a similar store. In the present study we performed additional experiments in which TG was followed by GPN and Zaprinast showing a similar pattern. GPN significantly diminished the Zaprinast response. These results are shown now in Figure S2B. We address these possibilities in the discussion and interpretation of the result. Lines 451-460.

      The data in Figures 4A vs 4G and Figures 4B vs 4H indicate that the size of the response to GPN is similar to that with thapsigargin in both the presence and absence of extracellular calcium. This raises the question of whether GPN is only releasing calcium from acidic compartments or whether it acts on the ER calcium stores, as previously suggested by Atakpa et al. 2019 DOI: 10.1242/jcs.223883. Nonetheless, Figure 1H shows that there is a robust calcium response to GPN after the addition of thapsigargin.

      The results of the indicated experiments did not exclude the possibility that GPN can also mobilize some calcium from the ER besides acidic organelles. We don’t have any evidence to support that GPN can mobilize calcium from the ER either. Based on our unpublished work, we think GPN mainly release calcium from the PLVAC. We included the mentioned citation and discuss the result considering the possibility that GPN may be acting on more than one store. Lines 451-460.

      An important advance in the current work is the use of state-of-the-art approaches with targeted genetically encoded calcium indicators (GECIs) to monitor calcium in important subcellular compartments. The authors have previously done this with the apicoplast, but now add the mitochondria to their repertoire. Despite the absence of a canonical mitochondrial calcium uniporter (MCU) in the Toxoplasma genome, the authors demonstrate the ability of T. gondii mitochondrial to accumulate calcium, albeit at high calcium concentrations. Although the calcium concentrations here are higher than needed for mammalian mitochondrial calcium uptake, there too calcium uptake requires calcium levels higher than those typically attained in the bulk cytosolic compartment. And just like in mammalian mitochondria, the current work shows that ER calcium release can elicit mitochondrial calcium loading even when other sources of elevated cytosolic calcium are ineffective, suggesting a role for ER-mitochondrial membrane contact sites. With these new tools in hand, it will be of great value to elucidate the bioenergetics and transport pathways associated with mitochondrial calcium accumulation in T. gondii.

      We thank this reviewer praising our work. Studies of bioenergetics and transport pathways associated with mitochondrial calcium accumulation is part of our future plans mentioned in lines 520-522 and 545.

      The current studies of calcium pools and their interactions with the ER and dependence on SERCA activity in T. gondi are complemented by super-resolution microscopy and electron microscopy that do indeed demonstrate the presence of close appositions between the ER and other organelles (see also videos). Thus, the work presented provides good evidence for the ER acting as the orchestrating organelle delivering calcium to other subcellular compartments through contact sites in T. gondi, as has become increasingly clear from work in other organisms.

      Thank you

      Reviewer #3 (Public review):

      This manuscript describes an investigation of how intracellular calcium stores are regulated and provides evidence that is in line with the role of the SERCA-Ca2+ATPase in this important homeostasis pathway. Calcium uptake by mitochondria is further investigated and the authors suggest that ER-mitochondria membrane contact sites may be involved in mediating this, as demonstrated in other organisms.

      The significance of the findings is in shedding light on key elements within the mechanism of calcium storage and regulation/homeostasis in the medically important parasite Toxoplasma gondii whose ability to infect and cause disease critically relies on calcium signalling. An important strength is that despite its importance, calcium homeostasis in Toxoplasma is understudied and not well understood.

      We agree with the reviewer. Thank you

      A difficulty in the field, and a weakness of the work, is that following calcium in the cell is technically challenging and thus requires reliance on artificial conditions. In this context, the main weakness of the manuscript is the extrapolation of data. The language used could be more careful, especially considering that the way to measure the ER calcium is highly artificial - for example utilising permeabilization and over-loading the experiment with calcium. Measures are also indirect - for example, when the response to ionomycin treatment was not fully in line with the suggested model the authors hypothesise that the result is likely affected by other storage, but there is no direct support for that.

      The Mag-Fluo-4-based protocol for measuring intraluminal calcium is well established and has been extensively used in mammalian cells, DT40 cells and other cells for measuring intraluminal calcium, activity of SERCA and response to IP3 (Some examples: PMID: 32179239, PMID: 15963563, PMID: 19668195, PMID: 30185837, PMID: 19920131).

      Furthermore, we have successfully employed this protocol in previous work, including the characterization of the Trypanosoma brucei IP3R (PMID: 23319604) and the assessment of SERCA activity in Toxoplasma (PMID: 40043955 and 34608145). The citation PMID: 32179239 provides a detailed description of the protocol, including references to its prior use. In addition, the schematic at the top of Figure 2 summarizes the experimental workflow, reinforcing that the protocol follows established methodologies. We included more references and an expanded discussion, lines 425-435.

      We respectfully disagree with the concern regarding potential calcium overloading. The cells used in our assays were permeabilized, which is a critical step that allows to precisely control calcium concentrations. All experiments were conducted at 220 nM free calcium, a concentration within the physiological range of cytosolic calcium fluctuations. This concentration was consistently used across all studies described above. Importantly, permeabilization ensures that the dye present in the cytosol becomes diluted, and allows MgATP (which cannot cross intact membranes) to access the ER membrane, in addition to be able to expose the ER to precise calcium concentrations.

      The Mag-Fluo-4 loading conditions are designed to allow compartmentalization of the indicator to all intracellular compartments and the calcium uptake stimulated by MgATP exclusively occurs in the compartment occupied by SERCA as only SERCA is responsive to MgATP-dependent transport in this experimental setup

      Regarding the use of IO, we would like to clarify that its broad-spectrum activity is welldocumented. As a calcium ionophore, IO facilitates calcium release across multiple membranes, and not just the ER leading to a more substantial calcium release compared to the more selective effect of TG. The results observed with IO were consistent with this expected broader activity and support our interpretation.

      Lastly, we emphasize that the experiment in Figure 2 was designed specifically to assess SERCA activity in situ under defined conditions. It was not intended to provide a comprehensive characterization of the role of TgSERCA in the parasite. We now clarify this distinction in the revised Discussion lines 425-435.

      Below we provide some suggestions to improve controls, however, even with those included, we would still be in favour of revising the language and trying to avoid making strong and definitive conclusions. For example, in the discussion perhaps replace "showed" with "provide evidence that are consistent with..."; replace or remove words like "efficiently" and "impressive"; revise the definitive language used in the last few lines of the abstract (lines 13-17); etc. Importantly we recommend reconsidering whether the data is sufficiently direct and unambiguous to justify the model proposed in Figure 7 (we are in favour of removing this figure at this early point of our understanding of the calcium dynamic between organelles in Toxoplasma).

      We thank the reviewer for the suggestions and we modified the language as suggested. We limited the use of the word "showed" to references to previously published work. We deleted the other words

      Figure 7 is intended as a conceptual model to summarize our proposed pathways, and, like all models, it represents a working hypothesis that may not fully capture the complexity of calcium dynamics in the parasite. In light of the reviewer’s comments, we revised the figure and legend to clearly distinguish between pathways for which there is experimental evidence from those that are hypothetical.

      Another important weakness is poor referencing of previous work in the field. Lines 248250 read almost as if the authors originally hypothesised the idea that calcium is shuttled between ER and mitochondria via membrane contact sites (MCS) - but there is extensive literature on other eukaryotes which should be first cited and discussed in this context. Likewise, the discussion of MCS in Toxoplasma does not include the body of work already published on this parasite by several groups. It is informative to discuss observations in light of what is already known.

      The sentence in which we state the hypothesis about the calcium transfer refers specifically to Toxoplasma. To clarify this, we have now added the phrase “In mammalian cells” (Line 311) and included additional citations, as suggested by the reviewer. While only a few studies have described membrane contact sites (MCSs) in Toxoplasma, we do cite several pertinent articles (e.g., lines 479-486). We believe that we cited all articles mentioning MCS in T. gondii

      However, we must clarify to the reviewer that the primary focus of our study is not to characterize or confirm the presence of MCSs in T. gondii, but rather to demonstrate functional calcium transfer between the ER and mitochondria. Our data support the conclusion that this transfer requires close apposition of these organelles, consistent with the presence of MCSs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Line 45: change influx to release as Ca2+ influx usually referred to Ca2+ entry from the extracellular space. Same for line 71.

      Corrected, line 47 and 73

      (2) Line 54: consider toning down the strong statement of 'widely' accepted as ER Ca2+ subdomain heterogeneity remains somewhat debated.

      Changed the sentence to “it has been proposed”, Line 56

      (3) Line 119-21: A lower release in response to TG is typical and does not reflect TG specific for SERCA. It is due to the slow kinetics of Ca2+ leak out of the ER allowing other buffering and transport mechanisms to act. Also, could be a reflection of the duration after TG treatment to allow complete store depletion. Figure S1A-B shows that there is still Ca2+ in the stores following TG but the TG signal does not go back to baseline arguing that the leak is still active. Hence the current data does not address the specificity of TG for TgSERCA. Please revise the statement accordingly.

      Thank for the suggestion, we changed the sentence to this: “This result could reflect the slow kinetics of Ca²⁺ leak from the ER, allowing other buffering and transport mechanisms to mitigate the phenomenon. Alternatively, it may indicate the duration after TG treatment allowing time to complete store depletion. As shown in Figure S1A-B, residual Ca²⁺ remains in the stores after TG treatment, and the TG-induced phenomenon does not return to baseline, suggesting that the leak remains active”. Lines 124-128

      (4) Figure 1C: the authors interpret the data 'This Ca2+ influx appeared to be immediately taken up by the ER as the response to TG was much greater in parasites previously exposed to extracellular Ca2+'. I don't understand this interpretation, in Ca2+-containing solution it would expected to have a larger signal as TG is likely to activate store-operated Ca2+ entry which would contribute to a larger cytosolic Ca2+ transient. Does T. gondii have SOCE? It cannot be uptake into the ER as SERCA is blocked. Unless the authors are arguing for another ER Ca2+ uptake pathway? But why are Ca2+ uptake in the ER would lower the signal whereas the data show an increased signal?

      We pre-incubated the suspension with calcium to allow filling of the stores, while SERCA is still active, and added thapsigargin (TG) at 400 seconds to measure calcium release. The experiment was designed to introduce the concept that the ER may have access to extracellular calcium, a phenomenon not yet clearly demonstrated in Toxoplasma. We did not expect to have less release by TG but if the ER is not efficient in filling after extracellular calcium entry it would be expected to have a similar response to TG. Yes, it is very possible that when we add TG we are also seeing more calcium entry through the PM as we previously proposed that the increased cytosolic Ca<sup>2+</sup> may regulate Ca<sup>2+</sup> entry. However, the evidence does not support that this increased entry would be triggered by store depletion. The experiments with the SERCA mutant (Fig. 4D) shows that in the conditional knockout mutant, the ER is partially depleted, yet this does not lead to enhanced calcium entry, suggesting that the depletion alone is not sufficient to trigger increased influx.

      There is no experimental evidence supporting the regulation of calcium entry by store depletion in Toxoplasma (PMID: 24867952). We revised the text to clarify this point and expanded the discussion on store-operated calcium entry (SOCE). While it is possible that a channel similar to Orai exists in Toxoplasma, it is highly unlikely to be regulated by store depletion, as there is no gene homologous to STIM. If store-regulated calcium entry does occur in Toxoplasma, it is likely mediated through a different, still unidentified, mechanism. Lines 461-467.

      (5) The choice of adding Ca2+ first followed by TG is curious as it is more difficult to interpret. Would be more informative to add TG, allow the leak to complete, and then add Ca2+ which would allow temporal separation between Ca2+ release from stores and Ca2+ influx from the extracellular space. Was this experiment done? If not would be useful to have the data.

      Yes, this experiment was already published: PMID: 24867952 and PMID: 38382669.

      It mainly highlighted that increased cytosolic calcium may regulate calcium entry most likely through a TRP channel. See our response to point 4 and the description of the new Fig. S2 in the response to point 7.

      (6) Line 136-39: these experiments as designed - partly because of the issues discussed above - do not address the ability of organelles to access extracellular Ca2+ or the state of refilling of intracellular Ca2+ stores. They can simply be interpreted as the different agents (TG, Nig, GPN, CCCP) inducing various levels of Ca2+ influx.

      Concerning TG, the experiment shown in Fig. 4D shows that depletion of the ER calcium does not result in stimulation of calcium entry, indicating the absence of classical SOCE activation in Toxoplasma.

      To our knowledge, neither mitochondria nor lysosomes (or other acidic compartments) are capable of triggering classical SOCE in mammalian cells.

      Given that the ER in Toxoplasma lacks the canonical components required to initiate SOCE, it is unclear why the mitochondria or acidic compartments would be able to do so. While it is possible that T. gondii utilizes an alternative mechanism for store-operated calcium entry, investigating such a pathway would require a comprehensive study. In mammalian systems, it took almost 15 years and the efforts of multiple research groups to identify the molecular components of SOCE. Expecting this complex question to be resolved within the scope of a single study is unrealistic.

      Our current data show that the mitochondrion is unable to access calcium from the cytosol, as shown in Figure 5E. Performing a similar experiment for the PLVAC would be ideal; however, expression of fluorescent calcium indicators in this organelle has not been successful. This is likely due to the presence of several proteases that degrade expressed proteins, as well as the acidic environment, which quenches fluorescence. These challenges have made studying calcium dynamics in the PLVAC particularly difficult.

      To address the reviewer’s comment, we performed an additional experiment presented in Fig. S2A. In this experiment, we first inhibited SERCA with thapsigargin (TG), preventing calcium uptake into the ER, and subsequently added calcium to the suspension. Under these conditions, calcium cannot be sequestered by the ER. We then applied GPN and quantified the response, comparing it to a similar experimental condition without TG. Indeed, under these conditions, we observed a significant but modest increase in the GPN-induced response, suggesting that the PLVAC may be capable of directly taking up calcium from the cytosol. However, this occurs under conditions of SERCA inhibition which creates nonphysiological conditions with elevated cytosolic calcium levels and the presence of TG may promote additional ER leakage, both of which could artificially enhance PLVAC uptake. Under physiological conditions, with functional SERCA activity, the ER would likely sequester cytosolic calcium more efficiently, thereby limiting calcium availability for PLVAC direct uptake. Thus, while the result is intriguing, it may not reflect calcium handling under normal cellular conditions. See lines 172-178.

      (7) Figure 1H-I: I disagree with the authors' interpretation of the results (lines 144-153). The data argue that by blocking ER Ca2+ uptake by TG, other organelles take up Ca2+ from the cytosol where it accumulates due to the leak and Ca2+ influx as is evident from the data allowing more release. The data does not argue for ER Ca2+ tunneling to other organelles. Tunneling would be reduced in the presence of TG (see PMID: 30046136, 24867608).

      We partially agree with this concern. In our experiments, TG was used to inhibit SERCA and block calcium uptake into the ER, allowing calcium to leak into the cytosol. We propose that this leaked calcium is subsequently taken up by other intracellular compartments. This effect is observed immediately upon TG addition. However, pre-incubation with TG or knockdown of SERCA reduces calcium storage in the ER, thereby diminishing the transfer of calcium to other stores.

      To further support our claim, we performed additional experiments in the absence of extracellular calcium, now presented in Figure 1J-K. We observed that calcium release triggered by GPN or nigericin was significantly enhanced when both agents were added after TG. These results suggest that calcium initially released from the ER can be sequestered by other compartments. As mentioned, we deleted any mention of “tunneling,” but we believe the data support the occurrence of calcium transfer. New results described in lines 166-171.

      The experiment in Fig S2A described in the response to (6) also addresses this concern. Under physiological conditions with functional SERCA, cytosolic calcium would likely be rapidly sequestered by the ER, limiting its availability to other compartments. See lines 172178.

      (8) Line 175: SERCA-dependent Ca2+ uptake is higher at 880 nM as would be expected yet the authors state that it's optimal at 220 nM Ca2+ ?

      Yes, it is true that the SERCA-dependent Ca<sup>2+</sup> uptake rate is higher at elevated Ca²⁺ concentrations. We chose to use 220 nM free calcium because of several reasons: 1) this concentration is close to physiological cytosolic levels fluctuations; 2) it is commonly used in studies of mammalian SERCA; and 3) calcium uptake is readily detectable at this level. While this may not represent the maximal activity conditions for SERCA, we believe it is a reasonable and physiologically relevant choice for assessing calcium transport activity SERCA-dependent. We added one sentence to the results explaining this reasoning (lines 204-207) and we deleted the word optimal.

      (9) Figure 3H: the saponin egress data support the conclusion that organelles Ca2+ take up cytosolic Ca2+ directly without the need for ER tunneling.

      The saponin concentration used permeabilizes the host cell membrane, allowing the intracellular tachyzoite to be surrounded with the added higher extracellular calcium concentration. The saponin concentration used does not affect the tachyzoite membrane as the parasite is still moving and calcium oscillations were clearly seen under similar conditions (PMID: 26374900 ). The resulting calcium increase in the tachyzoite cytosol is what stimulates parasite motility and egress. Since SERCA activity is reduced in the mutant, cytosolic calcium accumulates more rapidly, reaching the threshold for egress sooner and thereby accelerating parasite exit. The result does not support that the other stores contribute to this because of the Ionomycin response, which shows that egress is diminished in the mutant, likely because the calcium stores are depleted. We added an explanation in the results, lines 262-269 and the discussion, lines 532-539.

      (10) Figure S2: the HA and SERCA signals do not match perfectly? Could this reflect issues with HA tagging, potentially off-target effects? Was this tested?

      These are not off-target effects, as we did not observe them in the control cells lacking HA tagging. The HA signal also disappeared after treatment with ATc, further confirming that the IFA signal is specific. We agree with the reviewer that the signals do not align perfectly. This discrepancy could be due to differences in antibody accessibility or the fact that the two antibodies recognize different regions of the protein. We added a sentence about this in the result; lines 240-243.

      Reviewer #2 (Recommendations for the authors):

      The description of the data of Figures 1B and S1A starting on line 108 would be easier to follow if Figure S1A was actually incorporated into Figure 1. It is not clear why these two complementary experiments were separated since they are both equally important in understanding and interpreting the data.

      We re-arranged figure 1 and incorporated S1A now as Fig 1C.

      As noted in the public comments, loading of fura2/AM can result in compartmentalized fura2, which can contaminate the cytosolic calcium measurements and might modify free calcium levels and calcium storage capacity in intracellular organelles. This can be assessed using the digitonin permeabilization method used in the MagFluo4 measurements, but in this case, detecting the fura2 signal remaining after cell permeabilization.

      As suggested by the reviewer, we measured Fura-2 compartmentalization by permeabilizing cells with digitonin as we do for the Mag-Fluo-4 and the fluorescence was reduced almost completely and was unresponsive to any additions (see Author response image 1).

      Author response image 1.

      T. gondii tachyzoites in suspension exposed to Thapsigargin Calcium and GPN. The dashed lines shows and experiments using the same conditions but parasites were permeabilized with digitonin shows a similar experiment with parasites exposed to MgATP.to release the cytosolic Fura. Part B

      Following the public comment regarding the residual calcium mobilization response to Zaprinast observed after 24 h ATc knockdown of SERCA (Figsures 4E, 4F, as explained in the legend to Figure 4), was there still a response to Zaprinast after 48 h knockdown, where the thapsigargin response was apparently fully ablated?

      Unfortunately, we were unable to perform this experiment as it is not possible to obtain sufficient cells at 48 h with ATc. Due to the essential role of TgSERCA, parasites are unable to replicate after 24 h.

      As noted in the public comments, the data in Figure 4A vs 4G and Figure 4B vs 4H appear to show that the calcium responses to GPN are similar to that with thapsigargin, which seems unexpected if the acidic compartment is loaded from the ER. The results with GPN addition after thapsigargin (Figure 1H) argue against this, but the authors should still cite the work of Atakpa et al.

      We think that the reviewer is concerned that GPN may also be acting on the ER. This is a possibility that we considered, and we now included the suggested citation (line 457). However, we believe that it is difficult to directly compare the responses, as the kinetics of calcium release from the ER may differ from those of release from the PLVAC. This could be due to differences in the calcium buffering capacity between the two compartments. Additionally, it is possible that calcium leaked from the ER is more efficiently sequestered by other stores or extruded through the plasma membrane than calcium released from the PLVAC. Besides, GPN is known to have a more disruptive effect on membranes compared to TG, which may also influence their responses. As noted by the reviewer, Figure 1H also supports the idea that the acidic compartment is loaded from the ER.

      The abbreviation for the plant-like vacuolar compartment (PLVAC) only appears in a figure legend but should be defined in the main text on first use.

      Corrected, lanes 140-143

      The authors should cite the previous study of Borges-Pereira et al., 2020 (PMID: 32848018) that also demonstrates the incomplete overlap of the calcium pools mobilized by thapsigargin and CPA in P. falciparum. The ability to measure calcium in intracellular stores using MagFluo4 opens the possibility to further investigate this discrepancy between CPA and thapsigargin, but CPA does not appear to have been used in the permeabilized cell experiments with MagFluo4. I would suggest that this could be added to Figure 2 and/or Figure 4, or at least as a supplementary figure.

      In response to this reviewer’s critique we performed additional experiments with Mag-Fluo4 loaded parasites. These are presented in the new Figure S3. We added CPA and TG and combined them to inhibit SERCA and to allow calcium leak from the loaded organelle. Under these conditions, we observed a very similar leak rate after the addition of the inhibitors as measured by the slope of Ca<sup>2+</sup> leak. We believe that the leak rate is most likely determined by the intrinsic ER mechanism. See the discussion of this result in lines 436442 and the previous response to the same reviewer comment.

      Reviewer #3 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data, or analyses

      (1) Figure 1A is not mentioned in the main text even though it is discussed.

      Corrected

      (2) Figure 1G: Values do not match, how can GPN be so high?

      These figures were replaced by new traces and individual quantification analyses for each experiment.

      (3) Figure 1H and I: Is this type of data/results also available for the mitochondrion?

      Unfortunately, we were not able to include this experiment because we were unable to accurately quantify the mitochondrial calcium release. Instead, we used mitochondrial GECIs and the results are shown in Figure 5 to study mitochondrial calcium uptake.

      (4) Figure 1H: where does the calcium go after GPN addition? Taken up by another calcium store?

      Most likely calcium is extruded through the plasma membrane by the activity of the Calcium ATPase TgA1.

      However, the reviewer’s suggestion is also possible, and calcium could be taken by another store like the mitochondrion. In this regard, we did observe a large mitochondrial calcium increase (parasites expressing SOD2-GCaMp6) after adding GPN (Fig 5I) suggesting that the mitochondrion may take calcium from the organelle targeted by GPN. However, the calcium affinity of the mitochondrion is very low, so the concentration of calcium needs to be very high to activate it and these concentrations are most likely achieved at the microdomains formed between the mitochondrion and other organelles.

      (5) Figure 2B-C: Further explanation of why these particular values were chosen for the follow-up experiments would be helpful for the reader.

      We tested a wide range of MgATP and free calcium concentrations to measure ER Ca<sup>2+</sup> uptake catalyzed by TgSERCA. The concentrations shown fall within the linear range.

      We followed the free calcium concentrations used by studies of mammalian SERCA (https://doi.org/10.1016/j.ceca.2020.102188 ). In this protocol they used 220 nM free calcium, which was close to cytosolic Ca<sup>2+</sup> levels. TgSERCA can take up calcium efficiently at this concentration, as shown in Fig 2. We used less MgATP than the mammalian cell protocols, since we did not observe a significant increase in SERCA activity beyond 0.5 mM MgATP. We added one more sentence explaining in the results, lines 204-207.

      (6) Figure 3E: Revise the error bar? (and note that colours do not match the graph legend).

      The colors do match; the problem visualizing it is because vacuoles containing a single parasite are virtually absent in the control group without ATc treatment.

      (7) Figure 3H: 'Interestingly, when testing egress after the addition of saponin in the presence of extracellular Ca2+, we observed that the tachyzoites egressed sooner (Figure 3H, saponin egress).' This is the only graph showing egress timing, and thus it is not clear what is the comparison. The egressed here is sooner compared to what condition? Egress in the absence of Ca2+? This requires clarification and might require the control data to be added.

      In the saponin experiment we compare time to egress of the mutant grown with or without ATc. The measurement is for time to egress after adding saponin. This experiment is in the presence of extracellular calcium. The protocol was previously used to measure time to egress: PMID: 40043955, PMID: 38382669, PMID: 26374900. See also response to question 9 of reviewer 1.

      (8) Figure 4C: There is a small peak appearing right after TG addition this should be discussed and explained.

      This trace was generated in a different fluorometer, F-4000. This was an artifact due to jumping of the signal when adding TG. Multiple repeats of the same experiment in the newer F7000 did not show the peak. We included in the MM the use of the F-4000 fluorometer for some experiments. We apologize for the omission. Lines 609-610

      (9) Figure 5A: An important control that is missing is co-localisation with a mitochondrial marker.

      The expression of the SOD2-GCaMP6 has been characterized: PMID: 31758454

      (10) Figure 5H: This line was made for this study however the line genetic verification is missing.

      In response to this concern we now include a new Figure S5 showing the fluorescence of GCaMP6 in the mitochondrion of the iDTgSERCA mutant (Fig. S5A). We include several parasites. In addition, we show fluorescence measurements after addition of Calcium showing that the cells are unresponsive indicating that the indicator is not in the cytosol. Lines 650-651 and 344-348.

      (11) Figure 6D: since the membranes are hard to see, it is not clear whether the arrows show structures that are in line with the definition of membrane contact sites. The authors should provide an in-depth analysis of the length of the interaction between the membranes where the distance is less than 30 nM, and discuss how many structures corresponding to the definition were analysed.

      All the requested details are now included in the legend to Figure S3.

      Minor corrections to the text and figures

      (1) Unify statistical labelling throughout the paper replacing *** with p values.

      Corrected. We changed the *** with the actual p value in some figures. For figure 2 and Fig S1, we still use the *** due to the space limitation.

      (2) Unify ATC vs ATc throughout the paper.

      Corrected

      (3) Unify capitalization of line name (iΔTgserca/i ΔTgSERCA) throughout the paper.

      Corrected

      (4) Unify capitalization of p value (p/P) throughout the paper.

      Corrected in figures

      (5) Unify Fig X vs Fig. X throughout the text.

      Corrected

      (6) Add values of scale bars to legends (eg Figure S2).

      Corrected

      (7) What is the time point for the data in Figures 4E-H, 5H, and S3? 24hrs? include in the legend.

      Added 24 h to the legends. Fig S3 is now S4.

      (8) Figure 3F: The second graph is NS thus perhaps no need for the p-value?

      Corrected

      (8) Figure 3G: Worth considering swapping the two around: first attachment and then invasion?

      Corrected. Invasion and attachment bars were swapped.

      (10) Figure 4A/B: Wrong colour match for Figure 4B.

      Corrected

      (11) Figure 4F: In the main text, the authors reference to Figure 1F, correct to 4F.

      Corrected

      (12) Figure 4H: In the main text, authors reference to Figure 1H, correct to 4H.

      Corrected

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: The authors of this study sought to define a role for IgM in responses to house dust mites in the lung.

      Strengths:

      Unexpected observation about IgM biology

      Combination of experiments to elucidate function

      Weaknesses:

      Would love more connection to human disease

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations. 

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Hadebe and colleagues describes a striking reduction in airway hyperresponsiveness in Igm-deficient mice in response to HDM, OVA and papain across the B6 and BALB-c backgrounds. The authors suggest that the deficit is not due to improper type 2 immune responses, nor an aberrant B cell response, despite a lack of class switching in these mice. Through RNA-Seq approaches, the authors identify few differences between the lungs of WT and Igm-deficient mice, but see that two genes involved in actin regulation are greatly reduced in IgM-deficient mice. The authors target these genes by CRISPR-Cas9 in in vitro assays of smooth muscle cells to show that these may regulate cell contraction. While the study is conceptually interesting, there are a number of limitations, which stop us from drawing meaningful conclusions.

      Strengths:

      Fig. 1. The authors clearly show that IgMKO mice have striking reduced AHR in the HDM model, despite the presence of a good cellular B cell response.

      Weaknesses:

      Fig. 2. The authors characterize the cd4 t cell response to HDM in IGMKO mice.<br /> They have restimulated medLN cells with antiCD3 for 5 days to look for IL-4 and IL-13, and find no discernible difference between WT and KO mice. The absence of PBS-treated WT and KO mice in this analysis means it is unclear if HDM-challenged mice are showing IL-4 or IL-13 levels above that seen at baseline in this assay.

      We thank the Reviewer for this comment. We would like to mention that a very minimal level of IL-4 and IL-13 in PBS mice was detected. We have indicated with a dotted line on the Figure to show levels in unstimulated or naïve cytokines. Please see Author response image 1 below from anti-CD3 stimulated cytokine ELISA data. The levels of these cytokines are very low and are not changed between WT and IgM<sup>-/-</sup> mice, this is also true for PMA/ionomycin-stimulated cells.

      Author response image 1.

      The choice of 5 days is strange, given that the response the authors want to see is in already primed cells. A 1-2 day assay would have been better.

      We agree with the reviewer that a shorter stimulation period would work. Over the years we have settled for 5-day re-stimulation for both anti-CD3 and HDM. We have tried other time points, but we consistently get better secretion of cytokines after 5 days.

      It is concerning that the authors state that HDM restimulation did not induce cytokine production from medLN cells, since countless studies have shown that restimulation of medLN would induce IL-13, IL-5 and IL-10 production from medLN. This indicates that the sensitization and challenge model used by the authors is not working as it should.

      We thank the reviewer for this observation. In our recent paper showing how antigen load affects B cell function, we used very low levels of HDM to sensitise and challenge mice (1 ug and 3 ug respectively). See below article, Hadebe et al., 2021 JACI. This is because Labs that have used these low HDM levels also suggested that antigen load impacts B cell function, especially in their role in germinal centres. We believe the reason we see low or undetectable levels of cytokines is because of this low antigen load sensitisation and challenge. In other manuscripts we have published or about to publish, we have shown that normal HDM sensitisation load (1 ug or 100 ug) and challenge (10 ug) do induce cytokine release upon restimulation with HDM. See the below article by Khumalo et al, 2020 JCI Insight (Figure 4A).

      Sabelo Hadebe, Jermaine Khumalo, Sandisiwe Mangali, Nontobeko Mthembu, Hlumani Ndlovu, Amkele Ngomti, Martyna Scibiorek, Frank Kirstein, Frank Brombacher. Deletion of IL-4Ra signalling on B cells limits hyperresponsiveness depending on antigen load. doi.org/10.1016/j.jaci.2020.12.635).

      Jermaine Khumalo, Frank Kirstein, Sabelo Hadebe, Frank Brombacher. IL-4Rα signalling in regulatory T cells is required for dampening allergic airway inflammation through inhibition of IL-33 by type 2 innate lymphoid cells. JCI Insight. 2020 Oct 15;5(20):e136206. doi: 10.1172/jci.insight.136206

      The IL-13 staining shown in panel c is also not definitive. One should be able to optimize their assays to achieve a better level of staining, to my mind.

      We agree with the reviewer that much higher IL-13-producing CD4 T cells should be observed. We don’t think this is a technical glitch or non-optimal set-up as we see much higher levels of IL-13-producing CD4 T cells when using higher doses of HDM to sensitise and challenge, say between 7 -20% in WT mice (see Author response image 2, lung stimulated with PMA/ionomycin+Monensin, please note this is for illustration purposes only and it not linked to the current manuscript, its merely to demonstrate a point from other experiments we have conducted in the lab).

      Author response image 2.

      In d-f, the authors perform a serum transfer, but they only do this once. The half life of IgM is quite short. The authors should perform multiple naïve serum transfers to see if this is enough to induce FULL AHR.

      We thank the reviewer for this comment. We apologise if this was not clear enough on the Figure legend and method, we did transfer serum 3x, a day before sensitisation, on the day of sensitisation and a day before the challenge to circumvent the short life of IgM. In our subsequent experiments, we have now used busulfan to deplete all bone marrow in IgM-deficient mice and replace it with WT bone marrow and this method restores AHR (Figure 3).

      This now appears in line 165 to 169 and reads

      “Adoptive transfer of naïve serum

      Naïve wild-type mice were euthanised and blood was collected via cardiac puncture before being spun down (5500rpm, 10min, RT) to collect serum. Serum (200mL) was injected intraperitoneally into IgM-deficient mice. Serum was injected intraperitoneally at day -1, 0, and a day before the challenge with HDM (day 10).”

      The presence of negative values of total IgE in panel F would indicate some errors in calculation of serum IgE concentrations.

      We thank the reviewer for this observation. For better clarity, we have now indicated these values as undetected in Figure , as they were below our detection limit.

      Overall, it is hard to be convinced that IgM-deficiency does not lead to a reduction in Th2 inflammation, since the assays appear suboptimal.

      We disagree with the reviewer in this instance, because we have shown in 3 different models and in 2 different strains and 2 doses of HDM (high and low) that no matter what you do, Th2 remains intact. Our reason for choosing low dose HDM was based on our previous work and that of others, which showed that depending on antigen load, B cells can either be redundant or have functional roles. Since our interest was to tease out the role of B cells and specifically IgM, it was important that we look at a scenario where B cells are known to have a function (low antigen load). We did find similar findings at high dose of HDM load, but effects on AHR were not as strong, but Th2 was not changed, in fact in some instances Th2 was higher in IgM-deficient mice.

      Fig. 3. Gene expression differences between WT and KO mice in PBS and HDM challenged settings are shown. PCA analysis does not show clear differences between all four groups, but genes are certainly up and downregulated, in particular when comparing PBS to HDM challenged mice. In both PBS and HDM challenged settings, three genes stand out as being upregulated in WT v KO mice. these are Baiap2l1, erdr1 and Chil1.

      Noted

      Fig. 4. The authors attempt to quantify BAIAP2L1 in mouse lungs. It is difficult to know if the antibody used really detects the correct protein. A BAIAP2L1-KO is not used as a control for staining, and I am not sure if competitive assays for BAIAP2L1 can be set up. The flow data is not convincing. The immunohistochemistry shows BAIAP2L1 (in red) in many, many cells, essentially throughout the section. There is also no discernible difference between WT and KO mice, which one might have expected based on the RNA-Seq data. So, from my perspective, it is hard to say if/where this protein is located, and whether there truly exists a difference in expression between wt and ko mice.

      We thank the reviewer for this comment. We are certain that the antibody does detect BAIAP2L1, we have used it in 3 assays, which we admit may show varying specificities since it’s a Polyclonal antibody. However, in our western blot, the antibody detects 1 band at 56.7kDa and no other bands, apart from what we think are isoforms. We agree that BAIAP2L1 is expressed by many cell types, including CD45+ cells and alpha smooth muscle negative cells and we show this in our supplementary Figure 9. Where we think there is a difference in expression between WT and IgM-deficient mice is in alpha-smooth muscle-positive cells. We have tested antibodies from different companies, and we find similar findings. We do not have access to BAIAP2L1 KO mice and to test specificity, we have also used single stain controls with or without secondary antibody and isotype control which show no binding in western blot and Immunofluorescence assays and Fluorescence minus one antibody in Flow cytometry, so that way we are convinced that the signal we are seeing is specific to BAIAP2L1.

      Fig. 5 and 6. The authors use a single cell contractility assay to measure whether BAIAP2L1 and ERDR1 impact on bronchial smooth muscle cell contractility. I am not familiar with the assay, but it looks like an interesting way of analysing contractility at the single cell level.

      The authors state that targeting these two genes with Cas9gRNA reduces smooth muscle cell contractility, and the data presented for contractility supports this observation. However, the efficiency of Cas9-mediated deletion is very unclear. The authors present a PCR in supp fig 9c as evidence of gene deletion, but it is entirely unclear with what efficiency the gene has been deleted. One should use sequencing to confirm deletion. Moreover, if the antibody was truly working, one should be able to use the antibody used in Fig 4 to detect BAIAP2L1 levels in these cells. The authors do not appear to have tried this.

      We thank the reviewer for these observations. We are in a process to optimise this using new polyclonal BAIAP2L1 antibodies from other companies, since the one we have tried doesn’t seem to work well on human cells via western blot. So hopefully in our new version, we will be able to demonstrate this by immunofluorescence or western blot.

      Other impressions:

      The paper is lacking a link between the deficiency of IgM and the effects on smooth muscle cell contraction.

      The levels of IL-13 and TNF in lavage of WT and IGMKO mice could be analysed.

      We have measured Th2 cytokine IL-13 in BAL fluid and found no differences between IgM-deficient mice and WT mice challenged with HDM (Author response image 1). We could not detected TNF-alpha in the BAL fluid, it was below detection limit.

      Author response image 3.

      IL-13 levels are not changed in IgM-deficient mice in the lung. Bronchoalveolar lavage fluid in WT or IgM-deficient mice sensitised and challenged with HDM. TNF-a levels were below the detection limit.

      Moreover, what is the impact of IgM itself on smooth muscle cells? In the Fig. 7 schematic, are the authors proposing a direct role for IgM on smooth muscle cells? Does IgM in cell culture media induce contraction of SMC? This could be tested and would be interesting, to my mind.

      We thank the Reviewer for these comments. We are still trying to test this, unfortunately, we have experienced delays in getting reagents such as human IgM to South Africa. We hope that we will be able to add this in our subsequent versions of the article. We agree it is an interesting experiment to do even if not for this manuscript but for our general understanding of this interaction at least in an in vitro system.

      Reviewer #3 (Public Review):

      Summary:

      This paper by Sabelo et al. describes a new pathway by which lack of IgM in the mouse lowers bronchial hyperresponsiveness (BHR) in response to metacholine in several mouse models of allergic airway inflammation in Balb/c mice and C57/Bl6 mice. Strikingly, loss of IgM does not lead to less eosinophilic airway inflammation, Th2 cytokine production or mucus metaplasia, but to a selective loss of BHR. This occurs irrespective of the dose of allergen used. This was important to address since several prior models of HDM allergy have shown that the contribution of B cells to airway inflammation and BHR is dose dependent.

      After a description of the phenotype, the authors try to elucidate the mechanisms. There is no loss of B cells in these mice. However, there is a lack of class switching to IgE and IgG1, with a concomitant increase in IgD. Restoring immunoglobulins with transfer of naïve serum in IgM deficient mice leads to restoration of allergen-specific IgE and IgG1 responses, which is not really explained in the paper how this might work. There is also no restoration of IgM responses, and concomitantly, the phenotype of reduced BHR still holds when serum is given, leading authors to conclude that the mechanism is IgE and IgG1 independent. Wild type B cell transfer also does not restore IgM responses, due to lack of engraftment of the B cells. Next authors do whole lung RNA sequencing and pinpoint reduced BAIAP2L1 mRNA as the culprit of the phenotype of IgM<sup>-/-</sup> mice. However, this cannot be validated fully on protein levels and immunohistology since differences between WT and IgM KO are not statistically significant, and B cell and IgM restoration are impossible. The histology and flow cytometry seems to suggest that expression is mainly found in alpha smooth muscle positive cells, which could still be smooth muscle cells or myofibroblasts. Next therefore, the authors move to CRISPR knock down of BAIAP2L1 in a human smooth muscle cell line, and show that loss leads to less contraction of these cells in vitro in a microscopic FLECS assay, in which smooth muscle cells bind to elastomeric contractible surfaces.

      Strengths:

      (1) There is a strong reduction in BHR in IgM-deficient mice, without alterations in B cell number, disconnected from effects on eosinophilia or Th2 cytokine production

      (2) BAIAP2L1 has never been linked to asthma in mice or humans

      Weaknesses:

      (1) While the observations of reduced BHR in IgM deficient mice are strong, there is insufficient mechanistic underpinning on how loss of IgM could lead to reduced expression of BAIAP2L1. Since it is impossible to restore IgM levels by either serum or B cell transfer and since protein levels of BAIAP2L1 are not significantly reduced, there is a lack of a causal relationship that this is the explanation for the lack of BHR in IgM-deficient mice. The reader is unclear if there is a fundamental (maybe developmental) difference in non-hematopoietic cells in these IgM-deficient mice (which might have accumulated another genetic mutation over the years). In this regard, it would be important to know if littermates were newly generated, or historically bred along with the KO line.

      We thank the reviewer for asking this question and getting us to think of this in a different way. This prompted us to use a different method to try and restore IgM function and since our animal facility no longer allows irradiation, we opted for busulfan. We present this data as new data in Figure 3. We had to go back and breed this strain and then generated bone marrow chimeras. What we have shown now with chimeras is that if we can deplete bone marrow from IgM-deficient mice and replace it with congenic WT bone marrow when we allow these mice to rest for 2 months before challenge with HDM (new Supplementary Figure 6 a-c) We also show that AHR (resistance and elastance) is partially restored in this way (Figure 3 a and b) as mice that receive congenic WT bone marrow after chemical irradiation can mount AHR and those that receive IgM-deficient bone marrow, can’t mount AHR upon challenge with HDM. If the mice had accumulated an unknown genetic mutation in non-hematopoietic cells, the transfer of WT bone marrow would not make a difference. So, we don’t believe the colony could have gained a mutation that we are unaware of. We have also shipped these mice to other groups and in their hands, this strains still only behaves as an IgM only knockout mice. See their publication below.

      Mark Noviski, James L Mueller, Anne Satterthwaite, Lee Ann Garrett-Sinha, Frank Brombacher, Julie Zikherman 2018. IgM and IgD B cell receptors differentially respond to endogenous antigens and control B cell fate. eLife 2018;7:e35074. DOI: https://doi.org/10.7554/eLife.35074 we have also added methods for bone marrow chimaeras and added results sections and new Figures related to this methods.

      Methods (line 171-182).

      “Busulfan Bone marrow chimeras

      WT (CD45.2) and IgM<sup>-/-</sup> (CD45.2) congenic mice were treated with 25 mg/kg busulfan (Sigma-Aldrich, Aston Manor, South Africa) per day for 3 consecutive days (75 mg/kg in total) dissolved in 10% DMSO and Phosphate buffered saline (0.2mL, intraperitoneally) to ablate bone marrow cells. Twenty-four hours after last administration of busulfan, mice were injected intravenously with fresh bone marrow (10x10<sup>6</sup> cells, 100mL) isolated from hind leg femurs of either WT (CD45.1) or IgM<sup>-/-</sup> mice(33). Animals were then allowed to complement their haematopoietic cells for 8 weeks. In some experiments the level of bone marrow ablation was assessed 4 days post-busulfan treatment in mice that did not receive donor cells. At the end of experiment level of complemented cells were also assessed in WT and IgM<sup>-/-</sup> mice that received WT (CD45.1) bone marrow.”

      Results (line 491-521)

      “Replacement of IgM-deficient mice with functional hematopoietic cells in busulfan mice chimeric mice restores airway hyperresponsiveness.

      We then generated bone marrow chimeras by chemical radiation using busulfan(33). We treated mice three times with busulfan for 3 consecutive days and after 24 hrs transferred naïve bone marrow from congenic CD45.1 WT mice or CD45.2 IgM<sup>-/-</sup> mice (Fig. 3a and Supplementary Fig. 5a). We showed that recipient mice that did not receive donor bone marrow after 4 days post-treatment have significantly reduced lineage markers (CD45+Sca-1+) or lineage negative (Lin-) cells in the bone marrow when compared to untreated or vehicle (10% DMSO) treated mice (Supplementary Figure 5b-c). We allowed mice to reconstitute bone marrow for 8 weeks before sensitisation and challenge with low dose HDM (Figure 3a). We showed that WT (CD45.2) recipient mice that received WT (CD45.1) donor bone marrow had higher airway resistance and elastance and this was comparable to IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor WT (CD45.1) bone marrow (Figure 3b). As expected, IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor IgM<sup>-/-</sup> (CD45.2) bone marrow had significantly lower AHR compared to WT (CD45.2) or IgM<sup>-/-</sup> (CD45.2) recipient mice that received WT (CD45.1) bone marrow (Figure 3b). We confirmed that the differences observed were not due to differences in bone marrow reconstitution as we saw similar frequencies of CD45.1 cells within the lymphocyte populations in the lungs and other tissues (Supplementary Fig. 5d). We observed no significant changes in the lung neutrophils, eosinophils, inflammatory macrophages, CD4 T cells or B cells in WT or IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor WT (CD45.1/CD45.2) or IgM<sup>-/-</sup> (CD45.2) bone marrow when sensitised and challenged with low dose HDM (Fig. 3c)

      Restoring IgM function through adoptive reconstitution with congenic CD45.1 bone marrow in non-chemically irradiated recipient mice or sorted B cells into IgM<sup>-/-</sup> mice (Supplementary Fig.  6a) did not replenish IgM B cells to levels observed in WT mice and as a result did not restore AHR, total IgE and IgM in these mice (Supplementary Fig.  6b-c).”

      The 2 new figures are

      Figure 3 which moved the rest of the Figures down and Supplementary Figure 5, which also moved the rest of the supplementary figures down.

      Discussion appears in line 757-766 of the untracked version of the article.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      (2) There is no mention of the potential role of complement in activation of AHR, which might be altered in IgM-deficient mice 

      We thank the reviewer for this comment. We have not directly looked at complement in this instance, however, from our previous work on C3-/- mice, there have been comparable AHR to WT mice under the HDM challenge.

      (3) What is the contribution of elevated IgD in the phenotype of the IgM-deficient mice. It has been described by this group that IgD levels are clearly elevated

      We thank the reviewer for this question. We believe that IgD is essentially what drives partial class switching to IgG, we certainly have shown that in the case of VSV virus and Trypanosoma congolense and Trypanosoma brucei brucei that elevated IgD drive delayed but effective IgG in the absence of IgM (Lutz et al, 2001, Nature). This is also confirmed by Noviski studies where they show that both IgM and IgD do share some endogenous antigens, so its likely that external antigens can activate IgD in a similar manner to prompt class switching.

      (4) How can transfer of naïve serum in class switching deficient IgM KO mice lead to restoration of allergen specific IgE and IgG1?

      We thank the Reviewer for these comments, we believe that naïve sera transferred to IgM deficient mice is able to bind to the surface of B cells via IgM receptors (FcμR / Fcα/μR), which are still present on B cells and this is sufficient to facilitate class switching. Our IgM<sup>-/-</sup> mouse lacks both membrane-bound and secreted IgM, and transferred serum contains at least secreted IgM which can bind to surfaces via its Fc portion. We measured HDM-specific IgE and we found very low levels, but these were not different between WT and IgM<sup>-/-</sup> adoptively transferred with WT serum. We also detected HDM-specific IgG1 in IgM<sup>-/-</sup> transferred with WT sera to the same level as WT, confirming a possible class switching, of course, we can’t rule out that transferred sera also contains some IgG1. We also can’t rule out that elevated IgD levels can partially be responsible for class switched IgG1 as discussed above.

      In the discussion line 804-812, we also added the following

      “We speculate that IgM can directly activate smooth muscle cells by binding a number of its surface receptors including FcμR, Fcα/μR and pIgR(52-54). IgM binds to FcμR strictly, but shares Fcα/μR and pIgR with IgA(5,52,54). Both Fcα/μR and pIgR can be expressed by non-structural cells at mucosal sites(54,55). We would not rule out that the mechanisms of muscle contraction might be through one of these IgM receptors, especially the ones expressed on smooth muscle cells(54,55). Certainly, our future studies will be directed towards characterizing the mechanism by which IgM potentially activates the smooth muscle.”

      We have discussed this section under Discussion section, line 731 to 757. In addition, since we have now performed bone marrow chimaeras we have further added the following in our discussion in line 757-766.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      We removed the following lines, after performing bone marrow chimaeras since this changed some aspects.

      Our efforts to adoptively transfer wild-type bone marrow or sorted B cells into IgM-deficient mice were also largely unsuccessful partly due to poor engraftment of wild-type B cells into secondary lymphoid tissues. Natural secreted IgM is mainly produced by B1 cells in the peritoneal cavity, and it is likely that any transfer of B cells via bone marrow transfer would not be sufficient to restore soluble levels of IgM(3,10).

      (5) Alpha smooth muscle antigen is also expressed by myofibroblasts. This is insufficiently worked out. The histology mentions "expression in cells in close contact with smooth muscle". This needs more detail since it is a very vague term. Is it in smooth muscle or in myofibroblasts.

      Response: We appreciate that alpha-smooth muscle actin-positive cells are a small fraction in the lung and even within CD45 negative cells, but their contribution to airway hyperresponsiveness is major. We also concede that by immunofluorescence BAIAP2L1 seems to be expressed by cells adjacent to alpha-smooth muscle actin (Fig. 5b), however, we know that cells close to smooth muscle (such as extracellular matrix and myofibroblasts) contribute to its hypertrophy in allergic asthma.

      James AL, Elliot JG, Jones RL, Carroll ML, Mauad T, Bai TR, et al. Airway Smooth Muscle Hypertrophy and Hyperplasia in Asthma. Am J Respir Crit Care Med [Internet]. 2012;185:1058–64. Available from: https://doi.org/10.1164/rccm.201110-1849OC

      (6) Have polymorphisms in BAIAP2L1 ever been linked to human asthma?

      No, we have looked in asthma GWAS studies, at least summary statics and we have not seen any SNPs can could be associated with human asthma.

      (7) IgM deficient patients are at increased risk for asthma. This paper suggests the opposite. So the translational potential is unclear

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency as the reviewer correctly points out, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal or higher IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      TMC7 knockout mice were generated by the authors and the phenotype was analyzed. They found that Tmc7 is localized to Golgi and is needed for acrosome biogenesis.

      Strengths:

      The phenotype of infertility is clear, and the results of TMC7 localization and the failed acrosome formation are highly reliable. In this respect, they made a significant discovery regarding spermatogenesis.

      Weaknesses:

      There are also some concerns, which are mainly related to the molecular function of TMC7 and Figure 5.

      (1) It is understandable that TMC7 exhibits some channel activity in the Golgi and somehow affects luminal pH or Ca2+, leading to the failure of acrosome formation. On the other hand, since they are conducting the pH and calcium imaging from the cytoplasm, I do not think that the effect of TMC7 channel function in Golgi is detectable with their methods.

      We agree with the reviewer that there are no direct evidences showing the effect of TMC7 channel function in Golgi. We have changed the description in the revised manuscript.

      (2) Rather, it is more likely that they are detecting apoptotic cells that have no longer normal ion homeostasis.

      We thank the reviewer for raising this concern. We apologize for not labeling the postnatal stage in original Figure 5. We measured intracellular Ca2+, pH and ROS in PD30 testes (revised Fig. S6a-c), no apoptotic cells were observed at this stage (revised Fig. S6e, f). Apoptotic cells were found in the seminiferous tubules and cauda epididymis of 9-week-old Tmc7–/– mice (revised Fig. 5e-f). We have included TUNEL data in testis of PD21, PD30 and 9-week-old mice (revised Fig. 5e, f and Fig. S6e, f). In accordance with our findings, Tmc1 mutation has also been shown to result in reduced Ca2+ permeability, thus triggering hair cell apoptosis (Fettiplace, R, PNAS. 2022) [1].

      (3) Another concern is that n is only 3 for these imaging experiments.

      As suggested by the reviewer, more replicates were included in imaging experiments.

      Reviewer #2 (Public Review):

      Summary:

      This study presents a significant finding that enhances our understanding of spermatogenesis. TMC7 belongs to a family of transmembrane channel-like proteins (TMC1-8), primarily known for their role in the ear. Mutations to TMC1/2 are linked to deafness in humans and mice and were originally characterized as auditory mechanosensitive ion channels. However, the function of the other TMC family members remains poorly characterized. In this study, the authors begin to elucidate the function of TMC7 in acrosome biogenesis during spermatogenesis. Through analysis of transcriptomics datasets, they identify TMC7 as a transmembrane channel-like protein with elevated transcript levels in round spermatids in both mouse and human testis. They then generate Tmc7-/- mice and find that male mice exhibit smaller testes and complete infertility. Examination of different developmental stages reveals spermatogenesis defects, including reduced sperm count, elongated spermatids, and large vacuoles. Additionally, abnormal acrosome morphology is observed beginning at the early-stage Golgi phase, indicating TMC7's involvement in proacrosomal vesicle trafficking and fusion. They observed localization of TMC7 in the cis-Golgi and suggest that its presence is required for maintaining Golgi integrity, with Tmc7-/- leading to reduced intracellular Ca2+, elevated pH, and increased ROS levels, likely resulting in spermatid apoptosis. Overall, the work delineates a new function of TMC7 in spermatogenesis and the authors suggest that its ion channel activity is likely important for Golgi homeostasis. This work is of significant interest to the community and is of high quality.

      Strengths:

      The biggest strength of the paper is the phenotypic characterization of the TMC7-/- mouse model, which has clear acrosome biogenesis/spermatogenesis defects. This is the main claim of the paper and it is supported by the data that are presented.

      Weaknesses:

      The claim is that TMC7 functions as an ion channel. It is reasonable to assume this given what has been previously published on the more well-characterized TMCs (TMC1/2), but the data supporting this is preliminary here, and more needs to be done to solidify this hypothesis. The authors are careful in their interpretation and present this merely as a hypothesis supporting this idea.

      We appreciate the insightful comment. It is indeed a limitation of our study that we lack strong evidences to support that TMC7 functions as an ion channel. We have planned to conduct cellular electrophysiology in GC-1 cells heterologous expression of TMC7. However, TMC7 was trapped in the endoplasmic reticulum like TMC1 and TMC2 (Yu X, PNAS. 2020)[2], and failed to localize to the Golgi. According to the reviewer’s suggestion, we have made careful and more detailed interpretation the molecular function of TMC7 in the revised manuscript.

      Reviewer #3 (Public Review):

      Summary:

      In this study, Wang et al. have demonstrated that TMC7, a testis-enriched multipass transmembrane protein, is essential for male reproduction in mice. Tmc7 KO male mice are sterile due to reduced sperm count and abnormal sperm morphology. TMC7 co-localizes with GM130, a cis-Golgi marker, in round spermatids. The absence of TMC7 results in reduced levels of Golgi proteins, elevated abundance of ER stress markers, as well as changes of Ca2+ and pH levels in the KO testis. However, further confirmation is required because the analyses were performed with whole testis samples in spite of the differences in the germ cell composition in WT and KO testis. In addition, the causal relationships between the reported anomalies await thorough interrogation.

      Strengths:

      The microscopic images are of great quality, all figures are properly arranged, and the entire manuscript is very easy to follow.

      Weaknesses:

      (1) Tmc7 KO male mice show multiple anomalies in sperm production and morphogenesis, such as reduced sperm count, abnormal sperm head, and deformed midpiece. Thus, it is confusing that the authors focused solely on impaired acrosome biogenesis.

      We are grateful to your comments and suggestions. We agree and have added these defects in spermiogenesis of Tmc7–/– mice in the abstract and discussion sections of revised manuscript.

      (2) Further investigations are warranted to determine whether the abnormalities reported in this manuscript (e.g., changes in protein, Ca2+, and pH levels) are directly associated with the molecular function of TMC7 or are the byproducts of partially arrested spermiogenesis. Please find additional comments in "Recommendations for the authors".

      Thank you for raising this concern. Per your comments, we have included data of intracellular Ca2+, pH and ROS in PD21 testes. The intracellular homeostasis was impaired as early as PD21, indicating TMC7 depletion impairs cellular homeostasis which in turn results in arrested spermiogenesis.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      As noted by all three reviewers, current flow cytometry data does not necessarily support the 'ion channel' hypothesis, thus the phenotypic analysis is compelling but the molecular mechanism of how TMC7 facilitates acrosome biogenesis remains incomplete. It is highly recommended for the authors to at least discuss or test alternative hypotheses (as reviewer #2 suggested) such as the possibility of acting as 'lipid scramblase'. Also, the authors need to provide further explanation for other morphological defects if TMC7 is truly a functional ion channel in Golgi (and thus later at acrosome), which is also related to the key question of whether TMC7 is a functional ion channel.

      We thank the reviewing editor for the comments and suggestions. We agree that our study lack strong evidences to support that TMC7 functions as an ion channel. We have discussed the possibility of TMC7 acting as 'lipid scramblase' as suggested. We have also included data of intracellular Ca2+, pH and ROS in PD21, PD30 testes.

      Indeed, Tmc7–/– mice exhibits other defects including abnormal head morphology and disorganized mitochondrial sheaths. As TMC7 is localized to the cis-Golgi apparatus and is required for maintaining Golgi integrity. Previous studies on Golgi localized proteins including GOPC (Yao R, PNAS. 2002)[2], HRB (Kang-Decker N. Science. 2001)[3] and PICK1(Xiao N, JCI. 2009)[4] exhibit similar defects in spermiogenesis with Tmc7–/– mice. It is possible that defects morphologies in Tmc7–/– mice might be due to impaired function of Golgi.

      Reviewer #1 (Recommendations For The Authors):

      (1) The authors should provide more details about the imaging experiments using FACS. Since they only describe catalog numbers (Beyotime, S1056, S1006, S0033S) for imaging reagents, it is not immediately clear what reagents they actually used. Since they used Fluo3, BCECF, and DCFH, it would be better to mention their names.

      Thanks. We have provided more detailed antibody information as suggested.

      (2) I am also concerned that in the FACS there is no information at all about laser wavelength and filter properties. This is especially important for BCECF because the wavelength spectrum changes with pH. Also, if there are any positive controls for these imaging reagents, such as ionophores, it would be more convincing to include them.

      Thank you for your comment. Excitation wavelength is 488nm for detecting Ca2+, pH and ROS in FACS. BCECF is the most popular pH probe to monitor cellular pH and the reagent from Beyotime (S1006) has been used by other studies (Chen S, Blood. 2016)[5], (Liu H, Cell Death Dis. 2022)[6]. To make the results more reliable, we have repeated these experiments in PD21 testes (revised Figure 5a-c). No positive controls for these reagents were used in our experiments.

      (3) As noted above, it is better to avoid directly linking the cell's abnormal ion homeostasis to TMC7 ion channel function in the text. The discussion should be changed to emphasize that the TMC7-deficient cells are apoptotic and that these physiological phenomena are occurring as a side effect of this apoptosis.

      Thank you for raising this concern. We agree with the reviewer that there are no direct evidences showing the effect of TMC7 channel function in Golgi and we have changed the description in the revised manuscript.

      We performed new experiment to measure apoptosis and intracellular Ca2+, pH and ROS in PD21 testes. No apoptotic cells were observed at this stage. However, impaired cellular homeostasis was still found in testis of PD21 Tmc7-/- mice. These data suggest that TMC7 depletion impairs cellular homeostasis and hence induces spermatid apoptosis.

      (4) While I understand that it appears to be difficult to experimentally verify the ion channel function of TMC7, it may be supportive to compare its amino acid sequence and/or 3D predicted structure with that of TMC1/2. Including a supplemental figure for this purpose would emphasize the possibility that TMC7 functions as an ion channel.

      We thank the reviewer for making this great suggestion. We compared the amino acid sequence and structure of TMC1, TMC2 with TMC7 respectively. TMC1 had 81% sequence similarity with TMC7 and the RMSD (Root Mean Square Deviation) was 3.079. TMC2 had 82% sequence similarity with TMC7, the RMSD was 2.176. These data suggest that TMC7 has similar amino acid sequence and predicted structure with TMC1/2 and might functions as an ion channel. We have included the predicted structures in revised Fig. S7.

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):

      I do not have any experimental comments or concerns to address, but I do ask that the authors consider an alternative hypothesis. Based on prior data demonstrating that TMC1 is a mechanosensitive ion channel, the authors reasonably assume that TMC7 may also function as an ion channel. Although the authors observe alterations in cytosolic Ca2+ and pH upon loss of TMC7 by flow cytometry, which begins to support this hypothesis, these data do not directly demonstrate ion channel activity.

      I was wondering if the authors had considered whether TMC7 could also function as a lipid scramblase. TMC1 has also been proposed to function as a Ca2+-inhibited scramblase, where knockout of TMC1 leads to a loss of phosphatidylserine (PS) exposure and membrane blebbing at the apical region of hair cells (Ballesteros, A. and Swartz, K., Science Advances, 2022). Furthermore, TMC proteins are structurally related to the Anoctamin/TMEM16 family of chloride channels and lipid scramblases, where TMEM16A-B are bona fide Ca2+-activated chloride channels, and TMEM16C-H are characterized as Ca2+-dependent scramblases. Based on their structural similarity and the observation that TMC1 may also exhibit lipid scrambling properties based on the PS exposure, I wonder if the authors may have data that support a TMC7 scramblase hypothesis. I was intrigued by this idea, especially given the authors' observations of large vacuoles in the seminiferous tubules and cauda epididymis and the vesicle accumulation phenotype in their TEM data. Incorporating this hypothesis into the discussion section, at minimum, could provide a valuable perspective, and this line of thought may lead to interesting data interpretation throughout the paper.

      We thank the reviewer for the valuable suggestion. We have discussed the possibility of TMC7 acting as 'lipid scramblase' as suggested.

      Reviewer #3 (Recommendations For The Authors):

      (1) Gene symbols should be italicized, and protein symbols should be capitalized.

      Thanks. We have made changes to the manuscript as recommended.

      (2) Tmc7 KO males show reduced sperm count, which alters the germ cell composition in the testis (Figure 2g). Thus, it is inappropriate to compare protein levels using whole testis lysates (Figure 3e, 4h, 5d, 5f). Instead, the same immunoblotting analyses could be done with purified round spermatids or 3-wk-old testis. Likewise, the significance of the intracellular Ca2+ and pH measurements is potentially diminished by the differences in the germ cell composition in WT and KO mice.

      We appreciate this constructive suggestion. We agree with the reviewer that whole testis lysates diminished the differences between WT and _Tmc7-/-_mice. However, we are unable purify round spermatids due to the lack of specific markers.

      (3) Figures 2i, 2j: How sperm motility was measured should be specified in the Methods.

      We thank you for your significant reminding and have added sperm motility assessment in Methods section.

      (4) Figure 4g: It does not make sense to compare the fluorescence intensity of these proteins without making sure that the seminiferous tubules are in the same stage. As shown in Figures S5a and S5b, TMC7 exhibits varied abundance in spermatids at different steps.

      We thank the reviewer for the insightful comment. We have replaced images in the same stage seminiferous tubules and compared the fluorescence intensity of new images as suggested.

      (5) Figure 4h: How were the band intensities measured? The third band from the left is visually stronger than the first one, but it does not seem to be so according to the column graph. The reviewer measured the intensity of GRASP65 bands relative to alpha-tubulin by ImageJ and obtained relative intensities of 0.35, 0.87, 0.6, and 0.08 for the bands from left to right. Additional replicates of the western blots should be included in the supplementary figures.

      Thank you for this insightful comment. The density and size of the blots were quantified by Image J. We have checked the first band from the left of GRASP65 and it seems that the protein was not fully transferred onto the PVDF membrane. We have performed new experiments and replaced the original bands (Revised Fig. 4h). Additional replicates of the western blots have been included in revised Fig. S8.

      (6) Figures 5a, 5b: Based on the observation of abnormal intracellular Ca2+ and pH levels in the KO germ cells, the authors concluded that TMC7 maintains the homeostasis of Golgi pH and ion (Lines 223-224, 263-264). However, intracellular Ca2+ and pH levels do not directly reflect those in the Golgi apparatus.

      We thank the reviewer for this important comment. We agree and have changed “Golgi” to “intracellular” as suggested.

      (7) Figure 5c: ROS is produced during apoptosis. Thus, it is not appropriate to conclude that the increased ROS levels in Tmc7 KO germ cells lead to apoptosis.

      According to the reviewer’s comment, we measured ROS and apoptosis in testis of PD21 and PD30 mice. ROS levels were increased, but no apoptotic cells were observed in testis of PD21 and PD30 Tmc7–/– mice. Apoptotic cells were observed in testis of 9-week-old Tmc7–/– mice (Revised Fig. 5e-f). These data suggest that TMC7 depletion results in the accumulation of ROS, thereby leads to apoptosis.

      (1) Fettiplace, R., D.N. Furness, and M. Beurg, The conductance and organization of the TMC1-containing mechanotransducer channel complex in auditory hair cells. Proc Natl Acad Sci U S A, 2022. 119(41): p. e2210849119.

      (2) Yu, X., et al., Deafness mutation D572N of TMC1 destabilizes TMC1 expression by disrupting LHFPL5 binding. Proc Natl Acad Sci U S A, 2020. 117(47): p. 29894-29903.

      (3) Kang-Decker, N., et al., Lack of acrosome formation in Hrb-deficient mice. Science, 2001. 294(5546): p. 1531-3.

      (4) Xiao, N., et al., PICK1 deficiency causes male infertility in mice by disrupting acrosome formation. J Clin Invest, 2009. 119(4): p. 802-12.

      (5) Chen, S., et al., Sympathetic stimulation facilitates thrombopoiesis by promoting megakaryocyte adhesion, migration, and proplatelet formation. Blood, 2016. 127(8): p. 1024-35.

      (6) Liu, H., et al., PRMT5 critically mediates TMAO-induced inflammatory response in vascular smooth muscle cells. Cell Death Dis, 2022. 13(4): p. 299.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This research advance arctile describes a valuable image analysis method to identify individual cells (neurons) within a population of fluorescently labeled cells in the nematode C. elegans. The findings are solid and the method succeeds to identify cells with high precision. The method will be valuable to the C. elegans research community.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this paper, the authors developed an image analysis pipeline to automatically identify individual neurons within a population of fluorescently tagged neurons. This application is optimized to deal with multi-cell analysis and builds on a previous software version, developed by the same team, to resolve individual neurons from whole-brain imaging stacks. Using advanced statistical approaches and several heuristics tailored for C. elegans anatomy, the method successfully identifies individual neurons with a fairly high accuracy. Thus, while specific to C. elegans, this method can become instrumental for a variety of research directions such as in-vivo single-cell gene expression analysis and calcium-based neural activity studies.

      The analysis procedure depends on the availability of an accurate atlas that serves as a reference map for neural positions. Thus, when imaging a new reporter line without fair prior knowledge of the tagged cells, such an atlas may be very difficult to construct. Moreover, usage of available reference atlases, constructed based on other databases, is not very helpful (as shown by the authors in Fig 3), so for each new reporter line a de-novo atlas needs to be constructed.

      We thank the reviewer for pointing out a place where we can use some clarification. While in principle that every new reporter line would need fair prior knowledge, atlases are either already available or not difficult to construct. If one can make the assumption that the anatomy of a particular line is similar to existing atlases (Yemini 2021,Nejatbakhsh 2023,Toyoshima 2020), the cell ID can be immediately performed. Even in the case that one suspects the anatomy may have changes from existing atlases (e.g. in the case of examining mutants), existing atlases can serve as a starting point to provide a draft ID, which facilitates manual annotation. Once manual annotations on ~5 animals are available as we have shown in this work (which is a manageable number in practice), this new dataset can be used to build an updated atlas that can be used for future inferences. We have added this discussion in the manuscript: “If one determines that the anatomy of a particular animal strain is substantially different from existing atlases, new atlases can be easily constructed using existing atlases as starting points.” (page 18).

      I have a few comments that may help to better understand the potential of the tool to become handy.

      1. I wonder the degree by which strain mosaicism affects the analysis (Figs 1-4) as it was performed on a non-integrated reporter strain. As stated, for constructing the reference atlas, the authors used worms in which they could identify the complete set of tagged neurons. But how senstiive is the analysis when assaying worms with different levels of mosaicism? Are the results shown in the paper stem from animals with a full neural set expression? Could the authors add results for which the assayed worms show partial expression where only 80%, 70%, 50% of the cells population are observed, and how this will affect idenfication accuracy? This may be important as many non-integrated reporter lines show high mosaic patterns and may therefore not be suitable for using this analytic method. In that sense, could the authors describe the mosaic degree of their line used for validating the method.

      We appreciate the reviewer for this comment. We want to clarify that most of the worms used in the construction of the atlas are indeed affected by mosaicism and thus do not express the full set of candidate neurons. We have added such a plot as requested (Figure 3 – figure supplement 2, copied below). Our data show that there is no correlation between the fraction of cells expressed in a worm and neuron ID correspondence. We agree with the reviewer this additional insight may be helpful; we have modified the text to include this discussion: “Note that we observed no correlation between the degree of mosaicism and neuron ID correspondence (Figure 3- figure supplement 2).” (page 10).

      Author response image 1.

      No correlation between the degree of mosaicism (fraction of cells expressed in the worm) and neuron ID correspondence.

      1. For the gene expression analysis (Fig 5), where was the intensity of the GFP extracted from? As it has no nuclear tag, the protein should be cytoplasmic (as seen in Fig 5a), but in Fig 5c it is shown as if the region of interest to extract fluorescence was nuclear. If fluorescence was indeed extracted from the cytoplasm, then it will be helpful to include in the software and in the results description how this was done, as a huge hurdle in dissecting such multi-cell images is avoiding crossreads between adjacent/intersecting neurons.

      For this work, we used nuclear-localized RFP co-expressed in the animal, and the GFP intensities were extracted from the same region RFP intensities were extracted. If cytosolic reporters are used, one would imagine a membrane label would be necessary to discern the border of the cells. We clarified our reagents and approach in the text: “The segmentation was done on the nuclear-localized mCherry signals, and GFP intensities were extracted from the same region.” (page21).

      1. In the same mater: In the methods, it is specified that the strain expressing GCAMP was also used in the gene expression analysis shown in Figure 5. But the calcium indicator may show transient intensities depending on spontaneous neural activity during the imaging. This will introduce a significant variability that may affect the expression correlation analysis as depicted in Figure 5.

      We apologize for the error in text. The strain used in the gene expression analysis did not express GCaMP. We did not analyze GCaMP expression in figure 5. We have corrected the error in the methods.

      Reviewer #2 (Public Review):

      The authors succeed in generalizing the pre-alignment procedure for their cell idenfication method to allow it to work effectively on data with only small subsets of cells labeled. They convincingly show that their extension accurately identifies head angle, based on finding auto fluorescent tissue and looking for a symmetric l/r axis. They demonstrate that the method works to identify known subsets of neurons with varying accuracy depending on the nature of underlying atlas data. Their approach should be a useful one for researchers wishing to identify subsets of head neurons in C. elegans, for example in whole brain recording, and the ideas might be useful elsewhere.

      The authors also strive to give some general insights on what makes a good atlas. It is interesting and valuable to see (at least for this specific set of neurons) that 5-10 ideal examples are sufficient. However, some critical details would help in understanding how far their insights generalize. I believe the set of neurons in each atlas version are matched to the known set of cells in the sparse neuronal marker, however this critical detail isn't explicitly stated anywhere I can see.

      This is an important point. We have made text modifications to make it clear to the readers that for all atlases, the number of entities (candidate list) was kept consistent as listed in the methods. In the results section under “CRF_ID 2.0 for automatic cell annotation in multi-cell images,” we added the following sentence: “Note that a truncated candidate list can be used for subse-tspecific cell ID if the neuronal expression is known” (page 3). In the methods section, we added the following sentence: “For multi-cell neuron predictions on the glr-1 strain, a truncated atlas containing only the above 37 neurons was used to exclude neuron candidates that are irrelevant for prediction” (Page 20).

      In addition, it is stated that some neuron positions are missing in the neuropal data and replaced with the (single) position available from the open worm atlas. It should be stated how many neurons are missing and replaced in this way (providing weaker information).

      We modified the text in the result section as follows: “Eight out of 37 candidate neurons are missing in the neuroPAL atlas, which means 40% of the pairwise relationships of neurons expressing the glr-1p::NLS-mcherry transgene were not augmented with the NeuroPAL data but were assigned the default values from the OpenWorm atlas” (page 10).

      It also is not explicitly stated that the putative identities for the uncertain cells (designated with Greek letters) are used to sample the neuropal data. Large numbers of openworm single positions or if uncertain cells are misidentified forcing alignment against the positions of nearby but different cells would both handicap the neuropal atlas relative to the matched florescence atlas. This is an important question since sufficient performance from an ideal neuropal atlas (subsampled) would avoid the need for building custom atlases per strain.

      The putative identities are not used to sample the NeuroPAL data. They were used in the glr-1 multi-cell case to indicate low confidence in manual identification/annotation. For all steps of manual annotation and CRF_ID predictions, we used real neuron labels, and the Greek labels were used for reporting purposes only. It is true that the OpenWorm values (40% of the atlas) would be a handicap for the neuroPAL atlas. This is mainly due to the difficulty of obtaining NeuroPAL data as it requires 3-color fluorescence microscopy and significant time and labor to annotate the large set of neurons. This is one reason to take a complementary approach as we do in this paper.

      Reviewer #1 (Recommendations For The Authors):

      1. Figure 3, there is a confusion in the legend relating to panels c-e (e.g. panel c is neuron ID accuracy but it is described per panel e in the legend.

      We made the necessary changes.

      1. Figure 3, were statistical tests performed for panels d-e? if so, and the outcome was not significant, then it might be good to indicate this in the legend.

      We have added results of statistical tests in the legend as the following sentence: “All distributions in panel d and e had a p-value of less than 0.0001 for one sample t-test against zero.” One sample t-tests were performed because what is plotted already represents each atlas’ differences to the glr-1 25 dataset atlas, we didn’t think the statistical analyses between the other atlases would add significant value.

      1. Figure 4, no asterisks are shown in the figure so it is possible to remove the sentence in the legend describing what the asterisk stands for.

      Thank you. We made the necessary changes.

      Reviewer #2 (Recommendations For The Authors):

      Comparison with deep learning approaches could be more nuanced and structured, the authors (prior) approach extended here combines a specific set of comparative relationship measurements with a general optimization approach for matching based on comparative expectations. Other measurements could be used whether explicit (like neighbor expectations) or learned differences in embeddings. These alternate measurements would both need to be extensively re-calibrated for different sets of cells but might provide significant performance gains. In addition deep learning approaches don't solve the optimization part of the matching problem, so the authors approach seems to bring something strong to the table even if one is committed to learned methods (necessary I suspect for human level performance in denser cell sets than the relatively small number here). A more complete discussion of these themes might better frame the impact of the work and help readers think about the advantages and disadvantages or different methods for their own data.

      We thank the reviewer for bringing up this point. We apologize perhaps not making the point clearer in the original submission. This extension of the original work (Chaudhary et al) is not changing the CRF-based framework, but only augmenting the approach with a better defined set of axes (solely because in multicell and not whole-brain datasets, the sparsity of neurons degrades the axis definition and consequently the neuron ID predictions). We are not fundamentally changing the framework, and therefore all the advantages (over registration-based approaches for example) also apply here. The other purpose of this paper is to demonstrate a couple of use-cases for gene expression analysis, which is common in studies in C. elegans (and other organisms). We hope that by showing a use-case others can see how this approach is useful for their own applications.

      We have clarified these points in the paper (page 18). “The fundamental framework has not been changed from CRF_ID 1.0, and therefore the advantages of CRF_ID outlined in the original work apply for CRF_ID 2.0 as well.”

      The atribution of anatomical differences to strain is interesting, but seems purely speculative, and somewhat unlikely. I would suspect the fundamentally more difficult nature of aligning N items to M>>N items in an atlas accounts for the differences in using the neuroPAL vs custom atlas here. If this is what is meant, it could be stated more clearly.

      It is important to note that the same neuron candidate list (listed in methods) was used for all atlases, so there is no difference among the atlases in terms of the number of cells in the query vs. candidate list. In other words, the same values for M and for N are used regardless of the reference atlas used.

      We have preliminary data indicating differences between the NeuroPAL and custom atlas. For instance, the NeuroPAL atlas scales smaller than the custom glr-1 atlas. Since direct comparisons of the different atlases are beyond the scope of this paper, we will leave the exact comparisons for future work. We suspect that the differences are from a combination of differences in anatomy and imaging conditions. While NeuroPAL atlas may not be exactly fitting for the custom dataset, it can serve as a good starting point for guesses when no custom atlases are available, as we have discussed earlier (response to Public Comments from Reviewer 1 Point 1). As explained earlier, we have added these discussions in the paper (see page 18).

      I was also left wondering if the random removal of landmarks had to be adjusted in this work given it is (potentially) helping cope with not just occasional weak cells but the systematic loss of most of the cells in the atlas. If the parameters of this part of the algorithm don't influence the success for N to M>>N alignment (here when the neuroPAL or OpenWorm atlas is used) this seems interesting in itself and worth discussing. Conversely, if these parameters were opitmized for the matched atlas and used for the others, this would seem to bias performance results.

      We may have failed to make this clear in the main text. As we have stated in our responses in the public review section, we do systematically limit the neuron labels in the candidate list to neurons that are known to be expressed by the promotor. The candidate list, which is kept consistent for all atlases, has more neurons than cells in the query, so it is always an N-to-M matching where M>N. We did not use landmarks, but such usage is possible and will only improve the matching.

      We have attempted to clarify these points in the manuscript. In the results section under “CRF_ID 2.0 for automatic cell annotation in multi-cell images,” we added the following sentence: “Note that a truncated candidate list can be used for subset-specific cell ID if the neuronal expression is known” (page 3). In the methods section, we added the following sentence: “For multi-cell neuron predictions on the glr-1 strain, a truncated atlas containing only the above 37 neurons was used to exclude neuron candidates that are irrelevant for prediction” (Page 20).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      We thank the reviewers for the detailed assessment of our work as well as their praise and constructive feedback which helped us to significantly improve our manuscript.

      Reviewer #1 (Public Review):

      The inferior colliculus (IC) is the central auditory system's major hub. It integrates ascending brainstem signals to provide acoustic information to the auditory thalamus. The superficial layers of the IC ("shell" IC regions as defined in the current manuscript) also receive a massive descending projection from the auditory cortex. This auditory cortico-collicular pathway has long fascinated the hearing field, as it may provide a route to funnel "high-level" cortical signals and impart behavioral salience upon an otherwise behaviorally agnostic midbrain circuit.

      Accordingly, IC neurons can respond differently to the same sound depending on whether animals engage in a behavioral task (Ryan and Miller 1977; Ryan et al., 1984; Slee & David, 2015; Saderi et al., 2021; De Franceschi & Barkat, 2021). Many studies also report a rich variety of non-auditory responses in the IC, far beyond the simple acoustic responses one expects to find in a "low-level" region (Sakurai, 1990; Metzger et al., 2006; Porter et al., 2007). A tacit assumption is that the behaviorally relevant activity of IC neurons is inherited from the auditory cortico-collicular pathway. However, this assumption has never been tested, owing to two main limitations of past studies:

      (1) Prior studies could not confirm if data were obtained from IC neurons that receive monosynaptic input from the auditory cortex.

      (2) Many studies have tested how auditory cortical inactivation impacts IC neuron activity; the consequence of cortical silencing is sometimes quite modest. However, all prior inactivation studies were conducted in anesthetized or passively listening animals. These conditions may not fully engage the auditory cortico-collicular pathway. Moreover, the extent of cortical inactivation in prior studies was sometimes ambiguous, which complicates interpreting modest or negative results.

      Here, the authors' goal is to directly test if auditory cortex is necessary for behaviorally relevant activity in IC neurons. They conclude that surprisingly, task relevant activity in cortico-recipient IC neuron persists in absence of auditory cortico-collicular transmission. To this end, a major strength of the paper is that the authors combine a sound-detection behavior with clever approaches that unambiguously overcome the limitations of past studies.

      First, the authors inject a transsynaptic virus into the auditory cortex, thereby expressing a genetically encoded calcium indicator in the auditory cortex's postsynaptic targets in the IC. This powerful approach enables 2-photon Ca2+ imaging from IC neurons that unambiguously receive monosynaptic input from auditory cortex. Thus, any effect of cortical silencing should be maximally observable in this neuronal population. Second, they abrogate auditory cortico-collicular transmission using lesions of auditory cortex. This "sledgehammer" approach is arguably the most direct test of whether cortico-recipient IC neurons will continue to encode task-relevant information in absence of descending feedback. Indeed, their method circumvents the known limitations of more modern optogenetic or chemogenetic silencing, e.g. variable efficacy.

      I also see three weaknesses which limit what we can learn from the authors' hard work, at least in the current form. I want to emphasize that these issues do not reflect any fatal flaw of the approach. Rather, I believe that their datasets likely contain the treasure-trove of knowledge required to completely support their claims.

      (1) The conclusion of this paper requires the following assumption to be true: That the difference in neural activity between Hit and Miss trials reflects "information beyond the physical attributes of sound." The data presentation complicates asserting this assumption. Specifically, they average fluorescence transients of all Hit and all Miss trials in their detection task. Yet, Figure 3B shows that mice's d' depends on sound level, and since this is a detection task the smaller d' at low SPLs presumably reflects lower Hit rates (and thus higher Miss rates). As currently written, it is not clear if fluorescence traces for Hits arise from trials where the sound cue was played at a higher sound level than on Miss trials. Thus, the difference in neural activity on Hit and Miss trials could indeed reflect mice's behavior (licking or not licking). But in principle could also be explained by higher sound-evoked spike rates on Hit compared to Miss trials, simply due to louder click sounds. Indeed, the amplitude and decay tau of their indicator GCaMP6f is non-linearly dependent on the number and rate of spikes (Chen et al., 2013), so this isn't an unreasonable concern.

      (2) The authors' central claim effectively rests upon two analyses in Figures 5 and 6. The spectral clustering algorithm of Figure 5 identifies 10 separate activity patterns in IC neurons of control and lesioned mice; most of these clusters show distinct activity on averaged Hit and Miss trials. They conclude that although the proportions of neurons from control and lesioned mice in certain clusters deviates from an expected 50/50 split, neurons from lesioned mice are still represented in all clusters. A significant issue here is that in addition to averaging all Hits and Miss trials together, the data from control and lesioned mice are lumped for the clustering. There is no direct comparison of neural activity between the two groups, so the reader must rely on interpreting a row of pie charts to assess the conclusion. It's unclear how similar task relevant activity is between control and lesioned mice; we don't even have a ballpark estimate of how auditory cortex does or does not contribute to task relevant activity. Although ideally the authors would have approached this by repeatedly imaging the same IC neurons before and after lesioning auditory cortex, this within-subjects design may be unfeasible if lesions interfere with task retention. Nevertheless, they have recordings from hundreds to thousands of neurons across two groups, so even a small effect should be observable in a between-groups comparison.

      (3) In Figure 6, the authors show that logistic regression models predict whether the trial is a Hit or Miss from their fluorescence data. Classification accuracy peaks rapidly following sound presentation, implying substantial information regarding mice's actions. The authors further show that classification accuracy is reduced, but still above chance in mice with auditory cortical lesions. The authors conclude from this analysis task relevant activity persists in absence of auditory cortex. In principle I do not disagree with their conclusion.

      The weakness here is in the details. First, the reduction in classification accuracy of lesioned mice suggests that auditory cortex does nevertheless transmit some task relevant information, however minor it may be. I feel that as written, their narrative does not adequately highlight this finding. Rather one could argue that their results suggest redundant sources of task-relevant activity converging in the IC. Secondly, the authors conclude that decoding accuracy is impaired more in partially compared to fully lesioned mice. They admit that this conclusion is at face value counterintuitive, and provide compelling mechanistic arguments in the Discussion. However, aside from shaded 95% CIs, we have no estimate of variance in decoding accuracy across sessions or subjects for either control or lesioned mice. Thus we don't know if the small sample sizes of partial (n = 3) and full lesion (n = 4) groups adequately sample from the underlying population. Their result of Figure 6B may reflect spurious sampling from tail ends of the distributions, rather than a true non-monotonic effect of lesion size on task relevant activity in IC.

      Our responses to the ‘recommendations for the authors’ below lay out in detail how we addressed each comment and concern. Besides filling in key information about how our original analysis aimed at minimizing any potential impact of differences in sound level distributions - namely that trials used for decoding were limited to a subset of sound levels - and which was accidentally omitted in the original manuscript, we have now carried out several additional analyses.

      We would like to highlight one of these because it supplements both the clustering and decoding analysis that we conducted to compare hit and miss trial activity, and directly addresses what the reviewer identified as our work’s main weakness (a possible confound between animal behavior and sound level distributions) and the request for an analysis that operates at the level of single units rather than the population level. Specifically, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. We hope that this, together with the rest of our responses, convincingly demonstrates that the shell of the IC encodes mouse sound detection behavior even when top-down input from the auditory cortex is absent.

      Reviewer #2 (Public Review):

      Summary:

      This study takes a new approach to studying the role of corticofugal projections from auditory cortex to inferior colliculus. The authors performed two-photon imaging of cortico-recipient IC neurons during a click detection task in mice with and without lesions of auditory cortex. In both groups of animals, they observed similar task performance and relatively small differences in the encoding of task-response variables in the IC population. They conclude that non-cortical inputs to the IC provide can substantial task-related modulation, at least when AC is absent. Strengths:

      This study provides valuable new insight into big and challenging questions around top-down modulation of activity in the IC. The approach here is novel and appears to have been executed thoughtfully. Thus, it should be of interest to the community.

      Weaknesses: There are, however, substantial concerns about the interpretation of the findings and limitations to the current analysis. In particular, Analysis of single unit activity is absent, making interpretation of population clusters and decoding less interpretable. These concerns should be addressed to make sure that the results can be interpreted clearly in an active field that already contains a number of confusing and possibly contradictory findings.

      Our responses to the ‘recommendations for the authors’ below lay out in detail how we addressed each comment and concern. Several additional analyses have now been carried out including ones that operate at the level of single units rather than the population level, as requested by the reviewer. We would like to briefly highlight one here because it supplements both the clustering and decoding analysis that we conducted to compare hit and miss trial activity and directly addresses what the other reviewers identified as our work’s main weakness (a possible confound between animal behavior and sound level distributions). Specifically, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. We hope that this, together with the rest of our responses, convincingly demonstrates that the shell of the IC encodes mouse sound detection behavior even when top-down input from the auditory cortex is absent.

      Reviewer #3 (Public Review):

      Summary:

      This study aims to demonstrate that cortical feedback is not necessary to signal behavioral outcome to shell neurons of the inferior colliculus during a sound detection task. The demonstration is achieved by the observation of the activity of cortico-recipient neurons in animals which have received lesions of the auditory cortex. The experiment shows that neither behavior performance nor neuronal responses are significantly impacted by cortical lesions except for the case of partial lesions which seem to have a disruptive effect on behavioral outcome signaling. Strengths:

      The experimental procedure is based on state of the art methods. There is an in depth discussion of the different effects of auditory cortical lesions on sound detection behavior. Weaknesses:

      The analysis is not documented enough to be correctly evaluated. Have the authors pooled together trials with different sound levels for the key hit vs miss decoding/clustering analysis? If so, the conclusions are not well supported, as there are more misses for low sound levels, which would completely bias the outcome of the analysis. It would possible that the classification of hit versus misses actually only reflects a decoding of sound level based on sensory responses in the colliculus, and it would not be surprising then that in the presence or absence of cortical feedback, some neurons responds more to higher sound levels (hits) and less to lower sound levels (misses). It is important that the authors clarify and in any case perform an analysis in which the classification of hits vs misses is done only for the same sound levels. The description of feedback signals could be more detailed although it is difficult to achieve good temporal resolution with the calcium imaging technique necessary for targeting cortico-recipient neurons.

      Our responses to the ‘recommendations for the authors’ below lay out in detail how we addressed each comment and concern. Besides filling in key information about how our original analysis aimed at minimizing any potential impact of differences in sound level distributions - namely that trials used for decoding were limited to a subset of sound levels - and which was accidentally omitted in the original manuscript, we have now carried out several additional analyses to directly address what the reviewer identified as our work’s main weakness (a possible confound between animal behavior and sound level distributions). This includes an analysis in which we were able to demonstrate for one imaging session with a sufficiently large number of trials that limiting the trials entered into the decoding analysis to those from a single sound level did not meaningfully impact decoding accuracy. We would like to highlight another new analysis here because it supplements both the clustering and decoding analyses that we conducted to compare hit and miss trial activity and addresses the other reviewers’ request for an analysis that operates at the level of single units rather than the population level. Specifically, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. We hope that this, together with the rest of our responses, convincingly demonstrates that the shell of the IC encodes mouse sound detection behavior even when top-down input from the auditory cortex is absent.

      Reviewer #1 (Recommendations For The Authors):

      Thank you for the opportunity to read your paper. I think the conclusion is exciting. Indeed, you indicate that perhaps contrary to many of our (untested) assumptions, task-relevant activity in the IC may persist in absence of auditory cortex.

      As mentioned in my public review: Despite my interest in the work, I also think that there are several opportunities to significantly strengthen your conclusions. I feel this point is important because your work will likely guide the efforts of future students and post-docs working on this topic. The data can serve as a beacon to move the field away from the (somewhat naïve) idea that the evolved forebrain imparts behavioral relevance upon an otherwise uncivilized midbrain. This knowledge will inspire a search for alternative explanations. Indeed, although you don't highlight it in your narrative, your results dovetail nicely with several studies showing task-relevant activity in more ventral midbrain areas that project to the IC (e.g., pedunculopontine nuclei; see work from Hikosaka in monkeys, and more recently in mice from Karel Svoboda's lab).

      Thanks for the kind words.

      These studies, in particular the work by Inagaki et al. (2022) outlining how the transformation of an auditory go signal into movement could be mediated via a circuit involving the PPN/MRN (which might rely on the NLL for auditory input) and the motor thalamus, are indeed highly relevant.

      We made the following changes to the manuscript text.

      Line 472:”...or that the auditory midbrain, thalamus and cortex are bypassed entirely if simple acousticomotor transformations, such as licking a spout in response to a sound, are handled by circuits linking the auditory brainstem and motor thalamus via pedunculopontine and midbrain reticular nuclei (Inagaki et al., 2022).”

      The beauty of the eLife experiment is that you are free to incorporate or ignore these suggestions. After all, it's your paper, not mine. Nevertheless, I hope you find my comments useful.<br /> First, a few suggestions to address my three comments in the public review.

      Suggestion for public comment #1: An easy way to address this issue is to average the neural activity separately for each trial outcome at each sound level. That way you can measure if fluorescence amplitude (or integral) varies as a function of mice's action rather than sound level. This approach to data organization would also open the door to the additional analyses for addressing comment #2, such as directly comparing auditory and putatively non-auditory activity in neurons recorded from control and lesioned mice.

      We have carried out additional analyses for distinguishing between the two alternative explanations of the data put forward by the reviewer: That the difference in neural activity between hit and miss trials reflects a) behavior or b) sound level (more precisely: differences in response magnitude arising from a higher proportion of high-sound-level trials in the hit trial group than in the miss trial group). If the data favored b), we would expect no difference in activity between hit and miss trials when plotted separately for each sound level. The new Figure 4 - figure supplement 1 indicates that this is not the case. Hit and miss trial activity are clearly distinct even when plotted separately for different sound levels, confirming that this difference in activity reflects the animals’ behavior rather than sensory information.

      Changes to manuscript.

      Line 214: “While averaging across all neurons cannot capture the diversity of responses, the averaged response profiles suggest that it is mostly trial outcome rather than the acoustic stimulus and neuronal sensitivity to sound level that shapes those responses (Figure 4 – figure supplement 1).”

      Additionally, we assessed for each neuron separately whether there was a significant difference between hit and miss trial activity and therefore whether the activity of the neuron could be considered “task-modulated”. To achieve this, we used equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions and thus rule out any potential confound between sound level distributions and trial outcome. This analysis revealed that the proportion of task-modulated neurons was very high (close to 50%) and not significantly different between lesioned and non-lesioned mice (Figure 6 - figure supplement 3).

      Changes to the manuscript.

      Line 217: “Indeed, close to half (1272 / 2649) of all neurons showed a statistically significant difference in response magnitude between hit and miss trials…”

      Line 307: “Although the proportion of individual neurons with distinct response magnitudes in hit and miss trials in lesioned mice did not differ from that in non-lesioned mice, it was significantly lower when separating out mice with partial lesions (Figure 6 – figure supplement 3).”

      Differences in the distributions of sound levels in the different trial types could also potentially confound the decoding into hit and miss trials. Our original analysis was actually designed to take this into account but, unfortunately, we failed to include sufficient details in the methods section.

      Changes to the manuscript.

      Line 710: “Rather than including all the trials in a given session, only trials of intermediate difficulty were used for the decoding analysis. More specifically, we only included trials across five sound levels, comprising the lowest sound level that exceeded a d’ of 1.5 plus the two sound levels below and above that level. That ensured that differences in sound level distributions would be small, while still giving us a sufficient number of trials to perform the decoding analysis.“

      In this context, it is worth bearing in mind that a) the decoding analysis was done on a frame-byframe basis, meaning that the decoding score achieved early in the trial has no impact on the decoding score at later time points in the trial, b) sound-driven activity predominantly occurs immediately after stimulus onset and is largely over about 1 s into the trial (see cluster 3, for instance, or average miss trial activity in Figure 4 – figure supplement 1), c) decoding performance of the behavioral outcome starts to plateau 500-1000 ms into the trial and remains high until it very gradually begins to decline after about 2 s into the trial. In other words, decoding performance remains high far longer than the stimulus would be expected to have an impact on the neurons’ activity. Therefore, we would expect any residual bias due to differences in the sound level distribution that our approach did not control for to be restricted to the very beginning of the trial and not to meaningfully impact the conclusions derived from the decoding analysis.

      Finally, we carried out an additional decoding analysis for one imaging session in which we had a sufficient number of trials to perform the analysis not only over the five (59, 62, 65, 68, 71 dB SPL) original sound levels, but also over a reduced range of three (62, 65, 68 dB SPL) sound levels, as well as a single (65 dB SPL) sound level (Figure 6 - figure supplement 1). The mean sound level differences between the hit trial distributions and miss trial distributions for these three conditions were 3.08, 1.01 and 0 dB, respectively. This analysis suggests that decoding performance is not meaningfully impacted by changing the range of sound levels (and sound level distributions), other than that including fewer sound levels means fewer trials and thus noisier decoding.

      Changes to manuscript.

      Line 287: ”...and was not meaningfully affected by differences in sound level distributions between hit and miss trials (Figure 6 – figure supplement 1).”

      Suggestion for public comment #2: Perhaps a solution would be to display example neuron activity in each cluster, recorded in control and lesioned mice. The reader could then visually compare example data from the two groups, and immediately grasp the conclusion that task relevant activity remains in absence of auditory cortex. Additionally, one possibility might be to calculate the difference in neural activity between Hit and Miss trials for each task-modulated neuron. Then, you could compare these values for neurons recorded in control and lesion mice. I feel like this information would greatly add to our understanding of cortico-collicular processing.

      I would also argue that it's perhaps more informative to show one (or a few) example recordings rather than averaging across all cells in a cluster. Example cells would give the reader a better handle on the quality of the imaging, and this approach is more standard in the field. Finally, it would be useful to show the y axis calibration for each example trace (e.g. Figure 5 supp 1). That is also pretty standard so we can immediately grasp the magnitude of the recorded signal.

      We agree that while the information we provided shows that neurons from lesioned and nonlesioned groups are roughly equally represented across the clusters, it does not allow the reader to appreciate how similar the activity profiles of neurons are from each of the two groups. However, picking examples can be highly subjective and thus potentially open to bias. We therefore opted instead to display, separately for lesioned and non-lesioned mice, the peristimulus time histograms of all neurons in each cluster, as well as the cluster averages of the response profiles (Figure 5 - figure supplement 3). This, we believe, convincingly illustrates the close correspondence between neural activity in lesioned and non-lesioned mice across different clusters. All our existing and new figures indicate the response magnitude either on the figures’ y-axis or via scale/color bars.

      Changes to manuscript.

      Line 254: “Furthermore, there was a close correspondence between the cluster averages of lesioned and non-lesioned mice (Figure 5 – figure supplement 3).”

      Furthermore, we’ve now included a video of the imaging data which, we believe, gives the reader a much better handle on the data quality than further example response profiles would.

      Changes to manuscript.

      Line 197: ”...using two-photon microscopy (Figure 4B, Video 1).”

      Suggestion for public comment #3: In absence of laborious and costly follow-up experiments to boost the sample size of partial and complete lesion groups, it may be more prudent to simply tone down the claims that lesion size differentially impacts decoding accuracy. The results of this analysis are not necessary for your main claims.

      Our new results on the proportions of ‘task-modulated’ neurons (Figure 6 - figure supplement 3) across different experimental groups show that there is no difference between non-lesioned and lesioned mice as a whole, but mice with partial lesions have a smaller proportion of taskmodulated neurons than the other two groups. While this corroborates the results of the decoding analysis, we certainly agree that the small sample size is a caveat that needs to be acknowledged.

      Changes to manuscript.

      Line 477: ”Some differences were observed for mice with only partial lesions of the auditory cortex.

      Those mice had a lower proportion of neurons with distinct response magnitudes in hit and miss trials than mice with (near-)complete lesions. Furthermore, trial outcomes could be read out with lower accuracy from these mice. While this finding is somewhat counterintuitive and is based on only three mice with partial lesions, it has been observed before that smaller lesions…”

      A few more suggestions unrelated to public review:

      Figure 1: This is somewhat of an oddball in this manuscript, and its inclusion is not necessary for the main point. Indeed, the major conclusion of Fig 1 is that acute silencing of auditory cortex impairs task performance, and thus optogenetic methods are not suitable to test your hypothesis. However, this conclusion is also easily supported from decades of prior work, and thus citations might suffice.

      We do not agree that these data can easily be substituted with citations of prior published work. While previous studies (Talwar et al., 2001, Li et al., 2017) have demonstrated the impact of acute pharmacological silencing on sound detection in rodents, pharmacological and optogenetic silencing are not equivalent. Furthermore, we are aware of only one published study (Kato et al., 2015) that investigated the impact of optogenetically perturbing auditory cortex on sound detection (others have investigated its impact on discrimination tasks). Kato et al. (2015) examined the effect of acute optogenetic silencing of auditory cortex on the ability of mice to detect the offsets of very long (5-9 seconds) sounds, which is not easily comparable to the click detection task employed by us. Furthermore, when presenting our work at a recent meeting and leaving out the optogenetics results due to time constraints, audience members immediately enquired whether we had tried an optogenetic manipulation instead of lesions. Therefore, we believe that these data represent a valuable piece of information that will be appreciated by many readers and have decided not to remove them from the manuscript.

      A worst case scenario is that Figure 1 will detract from the reader's assessment of experimental rigor. The data of 1C are pooled from multiple sessions in three mice. It is not clear if the signed-rank test compares performance across n = 3 mice or n = 13 sessions. If the latter, a stats nitpicker could argue that the significance might not hold up with a nested analysis considering that some datapoints are not independent of one another. Finally, the experiment does not include a control group, gad2-cre mice injected with a EYFP virus. So as presented, the data are equally compatible with the pessimistic conclusion that shining light into the brain impairs mice's licking. My suggestion is to simply remove Figure 1 from the paper. Starting off with Figure 3 would be stronger, as the rest of the study hinges upon the knowledge that control and lesion mice's behavior is similar.

      Instead of reporting the results session-wise and doing stats on the d’ values, we now report results per mouse and perform stats on the proportions of hits and false alarms separately for each mouse. The results are statistically significant for each mouse and suggest that the differences in d’ are primarily caused by higher false alarm rates during the optogenetic perturbation than in the control condition.

      Changes to manuscript.

      New Figure 1.

      We agree that including control mice not expressing ChR2 would be important for fully characterizing the optogenetic manipulation and that the lack of this control group should be acknowledged. However, in the context of this study, the outcome of performing this additional experiment would be inconsequential. We originally considered using an optogenetic approach to explore the contribution of cortical activity to IC responses, but found that this altered the animals’ sound detection behavior. Whether that change in behavior is due to activation of the opsin or simply due to light being shone on the brain has no bearing on the conclusion that this type of manipulation is unsuitable for determining whether auditory cortex is required for the choice-related activity that we recorded in the IC.

      Changes to manuscript.

      Line 106: ”Although a control group in which the auditory cortex was injected with an EYFP virus lacking ChR2 would be required to confirm that the altered behavior results from an opsindependent perturbation of cortical activity, this result shows that this manipulation is also unsuitable… ”

      Figure 2, comment #1: The micrograph of panel B shows the densest fluorescence in the central IC. You interpret this as evidence of retrograde labeling of central IC neurons that project to the shell IC. This is a nice finding, but perhaps a more relevant micrograph would be to show the actual injection site in the shell layers. The rest of Figure 2 documents the non-auditory cortical sources of forebrain feedback. Since non-auditory cortical neurons may or may not target distinct shell IC sub-circuits, it's important to know where the retrograde virus was injected. Stylistic comment: The flow of the panels is somewhat unorthodox. Panel A and B follow horizontally, then C and D follow vertically, followed by E-H in a separate column. Consider sequencing either horizontally or vertically to maximize the reader's experience.

      Figure 2, comment # 2: It would also be useful to show more rostral sections from these mice, perhaps as a figure supplement, if you have the data. I think there is a lot of value here given a recent paper (Olthof et al., 2019 Jneuro) arguing that the IC receives corticofugal input from areas more rostral to the auditory cortex. So it would be beneficial for the field to know if these other cortical sources do or do not represent likely candidates for behavioral modulation in absence of auditory cortex.

      Figure 2, comment #3: You have a striking cluster of retrogradely labeled PPC neurons, and I'm not sure PPC has been consistently reported as targeting the IC. It would be good to confirm that this is a "true" IC projection as opposed to viral leakage into the SC. Indeed, Figure 2, supplement 2 also shows some visual cortex neurons that are retrogradely labeled. This has bearing on the interpretations, because choice-related activity is rampant in PPC, and thus could be a potential source of the task relevant activity that persists in your recordings. This could be addressed as the point above, by showing the SC sections from these same mice.

      All IC injections were made under visual guidance with the surface of the IC and adjacent brain areas fully exposed after removal of the imaging window. Targeting the IC and steering clear of surrounding structures, including the SC, was therefore relatively straightforward.

      We typically observed strong retrograde labeling in the central nucleus after viral injections into the dorsal IC and, given the moderate injection volume (~50 nL at each of up to three sites), it was also typical to see spatially fairly confined labeling at the injection sites. For the mouse shown in Figure 2, we do not have further images of the IC. This was one of the earliest mice to be included in the study and we did not have access to an automatic slide scanner at the time. We had to acquire confocal images in a ‘manual’ and very time-consuming manner and therefore did not take further IC images for this mouse. We have now included, however, a set of images spanning the whole IC and the adjacent SC sections for the mouse for which we already show sections in Figure 2 - figure supplement 2. These were added as Figure 2 - figure supplement 3A to the manuscript. These images show that the injections were located in the caudal half of the IC and that there was no spillover into the SC - close inspection of those sections did not reveal any labeled cell bodies in the SC. Furthermore, we include as Figure 2 - figure supplement 3B a dozen additional rostral cortical sections of the same mouse illustrating corticocollicular neurons in regions spanning visual, parietal, somatosensory and motor cortex. Given the inclusion of the IC micrographs in the new supplementary figure, we removed panel B from Figure 2. This should also make it easier for the reader to follow the sequencing of the remaining panels.

      Changes to manuscript.

      New Figure 2 - figure supplement 3.

      Line 159: “After the experiments, we injected a retrogradely-transported viral tracer (rAAV2-retrotdTomato) into the right IC to determine whether any corticocollicular neurons remained after the auditory cortex lesions (Figure 2, Figure 2 – figure supplement 2, Figure 2 – figure supplement 3). The presence of retrogradely-labeled corticocollicular neurons in non-temporal cortical areas (Figure 2) was not the result of viral leakage from the dorsal IC injection sites into the superior colliculus (Figure 2 – figure supplement 3).”

      Line 495: “...projections to the IC, such as those originating from somatosensory cortical areas (Lohse et al., 2021; Lesicko et al., 2016) and parietal cortex may have contributed to the response profiles that we observed.

      Figure 5 (see also public review point #2): I am not convinced that this unsupervised method yields particularly meaningful clusters; a grain of salt should be provided to the reader. For example, Clusters 2, 5, 6, and 7 contain neurons that pretty clearly respond with either short latency excitation or inhibition following the click sound on Hits. I would argue that neurons with such diametrically opposite responses should not be "classified" together. You can see the same issue in some of Namboodiri/Stuber's clustering (their Figure 1). It might be useful to make it clear to the reader that these clusters can reflect idiosyncrasies of the algorithm, the behavior task structure, or both.

      We agree.

      Changes to manuscript.

      Line 666: “While clustering is a useful approach for organizing and visualizing the activity of large and heterogeneous populations of neurons, we need to be mindful that, given continuous distributions of response properties, the locations of cluster boundaries can be somewhat arbitrary and/or reflect idiosyncrasies of the chosen method and thus vary from one algorithm to another. We employed an approach very similar to that described in Namboodiri et al. (2019) because it is thought to produce stable results in high-dimensional neural data (Hirokawa et al. 2019).”

      Methods:

      How was a "false alarm" defined? Is it any lick happening during the entire catch trial, or only during the time period corresponding to the response window on stimulus trials?

      The response window was identical for catch and stimulus trials and a false alarm was defined as licking during the response window of a catch trial.

      Changes to manuscript.

      Line 598: “During catch trials, neither licking (‘false alarm’) during the 1.5-second response window …”

      L597 and so forth: What's the denominator in the conversion from the raw fluorescence traces into DF/F? Did you take the median or mode fluorescence across a chunk of time? Baseline subtract average fluorescence prior to click onset? Similarly, please provide some more clarification as to how neuropil subtraction was achieved. This information will help us understand how the classifier can decode trial outcome from data prior to sound onset.

      Signal processing did not involve the subtraction of a pre-stimulus period.

      Changes to manuscript.

      Line 629: ”Neuropil extraction was performed using default suite2p parameters (https://suite2p.readthedocs.io/en/latest/settings.html), neuropil correction was done using a coefficient of 0.7, and calcium ΔF/F signals were obtained by using the median over the entire fluorescence trace as F0. To remove slow fluctuations in the signal, a baseline of each neuron’s entire trace was calculated by Gaussian filtering in addition to minimum and maximum filtering using default suite2p parameters. This baseline was then subtracted from the signal.”

      Was the experimenter blinded to the treatment group during the behavior experiments? If not, were there issues that precluded blinding (limited staffing owing to lab capacity restrictions during the pandemic)? This is important to clarify for the sake of rigor and reproducibility.

      Changes to manuscript.

      Line 574: “The experimenters were not blinded to the treatment group, i.e. lesioned or non-lesioned, but they were blind to the lesion size both during the behavior experiments and most of the data processing.”

      Minor:

      L127-128: "In order to test...lesioned the auditory cortex bilaterally in 7 out of 16 animals". I would clarify this by changing the word animals to "mice" and 7 out of 16 by stating n = 9 and n = 7 are control and lesion groups, respectively.

      Agreed.

      Changes to manuscript.

      Line 129: “...compared the performance of mice with bilateral lesions of the auditory cortex (n = 7) with non-lesioned controls (n = 9)”

      L225-226: You rule out self-generated sounds as a likely source of behavioral modulation by citing Nate Sawtell's paper in the DCN. However, Stephen David's lab suggested that in marmosets, post sound activity in central IC may in fact reflect self-generated sounds during licking. I suggest addressing this with a nod to SVD's work (Singla et al., 2017; but see Shaheen et al., 2021).

      Agreed.

      Changes to manuscript.

      Line 243: “(Singla et al., 2017; but see Shaheen et al., 2021)”

      Line 238 - 239: You state that proportions only deviate greater than 10% for one of the four statistically significant clusters. Something must be unclear here because I don't understand: The delta between the groups in the significant clusters of Fig 5C is (from left to right) 20%, 20%, 38%, and 12%. Please clarify.

      Our wording was meant to convey that a deviation “from a 50/50 split” of 10% means that each side deviates from 50 by 10% resulting in a 40/60 (or 60/40) split. We agree that that has the potential to confuse readers and is not as clear as it could be and have therefore dropped the ambiguous wording.

      Changes to manuscript.

      Line 253: ”,..the difference between the groups was greater than 20% for only one of them.”

      L445: I looked at the cited Allen experiment; I'd be cautious with the interpretation here. A monosynaptic IC->striatum projection is news to me. I think Allen Institute used an AAV1-EGFP virus for these experiments, no? As you know, AAV1 is quite transsynaptic. The labeled fibers in striatum of that experiment may reflect disynaptic labeling of MGB neurons (which do project to striatum).

      Agreed. We deleted the reference to this Allen experiment.

      L650: Please define "network activity". Is this the fluorescence value for each ROI on each frame of each trial? Averaged fluorescence of each ROI per frame? Total frame fluorescence including neuropil? Depending on who you ask, each of these measures provides some meaningful readout of network activity, so clarification would be useful.

      Changes to manuscript.

      Line 707: “Logistic regression models were trained on the network activity of each session, i.e., the ΔF/F values of all ROIs in each session, to classify hit vs miss trials. This was done on a frame-by-frame basis, meaning that each time point (frame) of each session was trained separately.

      Figure 3 narrative or legend: Listing the F values for the anova would be useful. There is pretty clearly a main effect of training session for hits, but what about for the false alarms? That information is important to solidify the result, and would help more specialized readers interpret the d-prime plot in this figure.

      Agreed. There were significant main effects of training day for both hit rates and false alarm rates (as well as d’).

      Changes to manuscript.

      Line 165: “The ability of the mice to learn and perform the click detection task was evident in increasing hit rates and decreasing false alarm rates across training days (Figure 3A, p < 0.01, mixed-design ANOVAs).”

      In summary, thank you for undertaking this work. Your conclusions are provocative, and thus will likely influence the field's direction for years to come.

      Thank you for those kind words and valuable and constructive feedback, which has certainly improved the manuscript.

      Reviewer #2 (Recommendations For The Authors):

      MAJOR CONCERNS

      (1) (Fig. 5) What fraction of individual neurons actually encode task-related information in each animal group? How many neurons respond to sound? The clustering and decoding analyses are interesting, but they obscure these simple questions, which get more directly at the main questions of the study. Suggested approach: For a direct comparison of AC-lesioned and -non-lesioned animals, why not simply compare the mean difference between PSTH response for each neuron individually? To test for trial outcome effects, compare Hit and Miss trials (same stimulus, different behavior) and for sound response effects, compare Hit and False alarm trials (same behavior, different response). How do you align for time in the latter case when there's no stimulus? Align to the first lick event. The authors should include this analysis or explain why their approach of jumping right to analysis of clusters is justified.

      We have now calculated the fraction of neurons that encode trial outcome by comparing hit and miss trial activity. That fraction does not differ between non-lesioned animals and lesioned animals as a whole, but is significantly smaller in mice with partial lesions. The author’s suggestion of comparing hit and false alarm trial activity to assess sound responsiveness is problematic because hit trials involve reward delivery and consumption. Consequently, they are behaviorally very different from false alarm trials (not least because hit trials tend to contain much more licking). Therefore, we calculated the fraction of neurons that respond to the acoustic stimulus by comparing activity before and after stimulus onset in miss trials. We found no significant difference between the non-lesioned and lesioned mice or between subgroups.

      We have addressed these points with the following changes to the manuscript:

      Line 217: “Indeed, close to half (1272 / 2649) of all neurons showed a statistically significant difference in response magnitude between hit and miss trials, while only a small fraction (97 / 2649) exhibited a significant response to the sound.”

      Line 307: “Although the proportion of individual neurons with distinct response magnitudes in hit and miss trials in lesioned mice did not differ from that in non-lesioned mice, it was significantly lower when separating out mice with partial lesions (Figure 6 – figure supplement 3).”

      Line 648: “Analysis of task-modulated and sound-driven neurons. To identify individual neurons that produced significantly different response magnitudes in hit and miss trials, we calculated the mean activity for each stimulus trial by taking the mean activity over the 5 seconds following stimulus presentation and subtracting the mean activity over the 2 seconds preceding the stimulus during that same trial. A Mann-Whitney U test was then performed to assess whether a neuron showed a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) in response magnitude between hit and miss trials. The analysis was performed using equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions. If, for a given sound level, there were more hit than miss trials, we randomly selected a sample of hit trials (without substitution) to match the sample size for the miss trials and vice versa. Sounddriven neurons were identified by comparing the mean miss trial activity before and after stimulus presentation. Specifically, we performed a Mann-Whitney U test to assess whether there was a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) between the mean activity over the 2 seconds preceding the stimulus and the mean activity over the 1 second period following stimulus presentation.”

      Some more specific concerns about focusing only on cluster-level and population decoding analysis are included below.

      (2) (L 234) "larger field of view". Do task-related or lesion-dependent effects depend on the subregion of IC imaged? Some anatomists would argue that the IC shell is not a uniform structure, and concomitantly, task-related effects may differ between fields. Did coverage of IC subregions differ between experimental groups? Is there any difference in task related effects between subregions of IC? Or maybe all this work was carried out only in the dorsal area? The differences between lesioned and non-lesioned animals are relatively small, so this may not have a huge impact, but a more nuanced discussion that accounts for observed or potential (if not tested) differences between regions of the IC.

      The specific subregion coverage could also impact the decoding analysis (Fig 6), and if possible it might be worth considering an interaction between field of view and lesion size on decoding.

      Each day we chose a new imaging location to avoid recording the same neurons more than once and aimed to sample widely across the optically accessible surface of the IC. We typically stopped the experiment only when there were no more new areas to record from. In terms of the depth of the imaged neurons, we were limited by the fact that corticorecipient neurons become sparser with depth and that the signal available from the GCaMP6f labeling of the Ai95 mice becomes rapidly weaker with increasing distance from the surface. This meant that we recorded no deeper than 150 µm from the surface of the IC. Consequently, while there may have been some variability in the average rostrocaudal and mediolateral positioning of imaging locations from animal to animal due to differences between mice in how much of the IC surface was visible, cranial window positioning, and in neuronal labeling etc, our dataset is anatomically uniform in that all recorded neurons receive input from the auditory cortex and are located within 150 µm of the surface of the IC. Therefore, we think it highly unlikely that small sampling differences across animals could have a meaningful impact on the results.

      Given that there is no consensus as to where the border between the dorsal and external/lateral cortices of the IC is located and that it is typically difficult to find reliable anatomical reference points (the location of the borders between the IC and surrounding structures is not always obvious during imaging, i.e. a transition from a labeled area to a dark area near the edge of the cranial window could indicate a border with another structure, but also the IC surface sloping away from the window or simply an unlabeled area within the IC), we made no attempt to assign our recordings from corticorecipient neurons to specific subdivisions of the IC.

      Changes to manuscript.

      Line 195: “We then proceeded to record the activity of corticorecipient neurons within about 150 µm of the dorsal surface of the IC using two-photon microscopy (Figure 4B, Video 1).”

      Line 375: “We imaged across the optically accessible dorsal surface of the IC down to a depth of about 150 µm below the surface. Consequently, the neurons we recorded were located predominantly in the dorsal cortex. However, identifying the borders between different subdivisions of the IC is not straightforward and we cannot rule out the possibility that some were located in the lateral cortex.”

      (3) (L 482-483) "auditory cortex is not required for the task-related activity recording in IC neurons of mice performing a sound detection task". Most places in the text are clearer, but this statement is confusing. Yes, animals with lesions can have a "normal"-looking IC, but does that mean that AC does not strongly modulate IC during this behavior in normal animals? The authors have shown convincingly that subcortical areas can both shape behavior and modulate IC normally, but AC may still be required for IC modulation in non-lesioned animals. Given the complexity of this system, the authors should make sure they summarize their results consistently and clearly throughout the manuscript.

      The reviewer raises an important point. What we have shown is that corticorecipient dorsal IC neurons in mice without auditory cortex show neural activity during a sound detection task that is largely indistinguishable from the activity of mice with an intact auditory cortex. In lesioned mice, the auditory cortex is thus not required. Whether the IC activity of the non-lesioned group can be shaped by input from the auditory cortex in a meaningful way in other contexts, such as during learning, is a question that our data cannot answer.

      Changes to manuscript.

      Line 508: "While modulation of IC activity by this descending projection has been implicated in various functions, most notably in the plasticity of auditory processing, we have shown in mice performing a sound detection task that IC neurons show task-related activity in the absence of auditory cortical input."

      LESSER CONCERNS

      (L. 106-107) "Optogenetic suppression of cortical activity is thus also unsuitable..." It appears that behavior is not completely abolished by the suppression. One could also imagine using a lower dose of muscimol for partial inactivation of AC feedback. When some behavior persists, it does seem possible to measure task-related changes in the IC. This may not be necessary for the current study, but the authors should consider how these transient methods could be applied usefully in the Discussion. What about inactivation of cortical terminals in the IC? Is that feasible?

      Our argument is not that acute manipulations are unsuitable because they completely abolish the behavior, but because they significantly alter the behavior. Although it would not be trivial to precisely measure the extent of pharmacological cortical silencing in behaving mice that have been fitted with a midbrain window, it should be possible to titrate the size of a muscimol injection to achieve partial silencing of the auditory cortex that does not fully abolish the ability to detect sounds. However, such an outcome would likely render the data uninterpretable. If no effect on IC activity was observed, it would not be possible to conclude whether this was due to the fact that the auditory cortex was only partially silenced or that projections from the auditory cortex have no influence on the recorded IC activity. Similarly, if IC activity was altered, it would not be possible to say whether this was due to altered descending modulation resulting from the (partially) silenced auditory cortex or to the change in behavior, which would likely be reflected in the choice-related activity measured in the IC.

      Silencing of corticocollicular axons in the IC is potentially a more promising approach and we did devote a considerable amount of time and effort to establishing a method that would allow us to simultaneously image IC neurons while silencing corticocollicular axons, trying both eNpHR3.0 and Jaws with different viral labeling approaches and mouse lines. However, we ultimately abandoned those attempts because we were not convinced that we had achieved sufficient silencing or that we would be able to convincingly verify this. Furthermore, axonal silencing comes with its own pitfalls and the interpretation of its consequences is not straightforward. Given that our discussion already contains a section (line 421) on axonal silencing, we do not feel there would be any benefit in adding to that.

      (Figure 1). Can the authors break down the performance for FA and HR, as they do in Fig. 3? It would be helpful to know what aspect of behavior is impaired by the transient inactivation.

      Good point. Figure 1 has been updated to show the results separately for hit rates, false alarms and d’. The new figure indicates that the change in d’ is primarily a consequence of altered false alarm rates. Please also see our response to a related comment by reviewer #1.

      Changes to manuscript.

      New figure 1.

      (Figure 4 legend). Minor: Please clarify, what is time 0 in panel C? Time of click presentation?

      Yes, that is correct.

      Changes to manuscript.

      Line 209: ”Vertical line at time 0 s indicates time of click presentation.”

      (L. 228-229). There has been a report of lick and other motor related activity in the IC - e.g., see Shaheen, Slee et al. (J Neurosci 2021), the timing of which suggests that some of it may be acoustically driven.

      Thanks for pointing this out. Shaheen et al., 2021 should certainly have been cited by us in this context as well as in other parts of the manuscript.

      Changes to manuscript.

      Line 243: “(Singla et al., 2017; but see Shaheen et al., 2021)”

      Also, have the authors considered measuring a peri-lick response? The difference between hit and miss trials could be perceptual or it could reflect differences in motor activity. This may be hard to tease apart, but, for example, one can test whether activity is stronger on trials with many licks vs. few licks?

      (L. 261) "Behavior can be decoded..." similar or alternative to the previous question of evoked activity, can you decode lick events from the population activity?

      The difference between hit and miss trial activity almost certainly partially reflects motor activity associated with licking. This was stated in the Discussion, but to make that point more explicitly, we now include a plot of average false alarm trial activity, i.e. trials without sound (catch trials) in which animals licked (but did not receive a reward).

      Given a sufficient number of catch trials, it should be possible to decode false alarm and correct rejection trials. However, our experiment was not designed with that in mind and contains a much smaller number of catch trials than stimulus trials (approximately one tenth the number of stimulus trials), so we have not attempted this.

      Changes to manuscript.

      New Figure 4 - figure supplement 1.

      (L. 315) "Pre-stimulus activity..." Given reports of changes in activity related to pupil-indexed arousal in the auditory system, do the authors by any chance have information about pupil size in these datasets?

      Given that all recordings were performed in the dark, fluctuations in pupil diameter were relatively small. Therefore, we have not made any attempt to relate pupil diameter to any of the variables assessed in this manuscript.

      (L. 412) "abolishes sound detection". While not exactly the same task, the authors might comment on Gimenez et al (J Neurophys 2015) which argued that temporary or permanent lesioning of AC did not impair tone discrimination. More generally, there seems to be some disagreement about what effects AC lesions have on auditory behavior.

      Thank you for this suggestion. Gimenez et al. (2015) investigated the ability of freely moving rats to discriminate sounds (and, in addition, how they adapt to changes in the discrimination boundary). Broadly consistent with later reports by Ceballo et al. (2019) (mild impairment) and O’Sullivan et al. (2019) (no impairment), Gimenez et al. (2015) reported that discrimination performance is mildly impaired after lesioning auditory cortex. Where the results of Gimenez et al. (2015) stand out is in the comparatively mild impairments that were seen in their task when they used muscimol injections, which contrast with the (much) larger impairments reported by others (e.g. Talwar et al., 2001; Li et al., 2017; Jaramillo and Zador, 2014).

      Changes to manuscript.

      Line 433: ”However, transient pharmacological silencing of the auditory cortex in freely moving rats (Talwar et al., 2001), as well as head-fixed mice (Li et al., 2017), completely abolishes sound detection (but see Gimenez et al., 2015).”

      (L. 649) "... were generally separable" Is the claim here that the clusters are really distinct from each other? This is unexpected, and it might be helpful if the authors could show this result in a figure.

      The half-sentence that this comment refers to has been removed from the methods section. Please also see a related comment by reviewer #1 which prompted us to add the following to the methods section.

      Changes to manuscript.

      Line 666: “While clustering is a useful approach for organizing and visualizing the activity of large and heterogeneous populations of neurons we need to be mindful that, given continuous distributions of response properties, the locations of cluster boundaries can be somewhat arbitrary and/or reflect idiosyncrasies of the chosen method and thus vary from one algorithm to another. We employed an approach very similar to that described in Namboodiri et al. (2019) because it is thought to produce stable results in high-dimensional neural data (Hirokawa et al. 2019).”

      Reviewer #3 (Recommendations For The Authors):

      (1) The authors must absolutely clarify if the hit versus misses decoding and clustering analysis is done for a single sound level or for multiple sound levels (what is the fraction of trials for each sound leve?). If the authors did it for multiple sound levels they should redo all analyses sound-level by sound-level, or for a single sound level if there is one that dominates. No doubt that there is information about the trial outcome in IC, but it should not be over-estimated by a confound with stimulus information.

      This is an important point. The original clustering analysis was carried out across different sound levels. We have now carried out additional analysis for distinguishing between two alternative explanations of the data, which were also raised by reviewer #1. – that the difference in neural activity between hit and miss trials could reflect a) the animals’ behavior or b) relatively more hit trials at higher sound levels, which would be expected to produce stronger responses. If the data favored b), we would expect no difference in activity between hit and miss trials when plotted separately for different sound levels. The new figure 4 - figure supplement 1 indicates that that is not the case. Hit and miss trial activity are clearly distinct even when plotted separately for different sound levels, confirming that this difference in activity reflects the animals’ behavior rather than sensory information.

      We made the following changes to manuscript.

      Line 214: “While averaging across all neurons cannot capture the diversity of responses, the averaged response profiles suggest that it is mostly trial outcome rather than the acoustic stimulus and neuronal sensitivity to sound level that shapes those responses (Figure 4 – figure supplement 1).”

      Differences in the distributions of sound levels in the different trial types could also potentially confound the decoding into hit and miss trials. Our analysis actually aimed to take this into account but, unfortunately, we failed to include sufficient details in the methods section.

      Changes to manuscript.

      Line 710: “Rather than including all the trials in a given session, only trials of intermediate difficulty were used for the decoding analysis. More specifically, we only included trials across five sound levels, comprising the lowest sound level that exceeded a d’ of 1.5 plus the two sound levels below and above that level. That ensured that differences in sound level distributions would be small, while still giving us a sufficient number of trials to perform the decoding analysis.“

      In this context, it is worth bearing in mind that a) the decoding analysis was done on a frame-byframe basis, meaning that the decoding score achieved early in the trial has no impact on the decoding score at later time points in the trial, b) sound-driven activity predominantly occurs immediately after stimulus onset and is largely over about 1 s into the trial (see cluster 3, for instance, or average miss trial activity in figure 4 - figure supplement 1), c) decoding performance of the behavioral outcome starts to plateau 500-1000 ms into the trial and remains high until it very gradually begins to decline after about 2 s into the trial. In other words, decoding performance remains high far longer than the stimulus would be expected to have an impact on the neurons’ activity. Therefore, we would expect any residual bias due to differences in the sound level distribution that our approach did not control for to be restricted to the very beginning of the trial and not to meaningfully impact the conclusions derived from the decoding analysis.

      Furthermore, we carried out an additional decoding analysis for one imaging session in which we had a sufficient number of trials to perform the analysis not only over the five (59, 62, 65, 68, 71 dB SPL) original sound levels, but also over a reduced range of three (62, 65, 68 dB SPL) sound levels, as well as a single (65 dB SPL) sound level (Figure 6 - figure supplement 1). The mean sound level difference between the hit trial distributions and miss trial distributions for these three conditions were 3.08, 1.01 and 0 dB, respectively. This analysis suggests that decoding performance is not meaningfully impacted by changing the range of sound levels (and sound level distributions) other than that including fewer sound levels means fewer trials and thus noisier decoding.

      Changes to manuscript.

      Line 287: ”...and was not meaningfully affected by differences in sound level distributions between hit and miss trials (Figure 6 – figure supplement 1).”

      Finally, in order to supplement the decoding analysis, we determined for each individual neuron whether there was a significant difference between the average hit and average miss trial activity. Note that this was done using equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions and to rule out any potential confound of sound level. This revealed that the proportion of neurons containing “information about trial outcome” was generally very high, close to 50% on average, and not significantly different between lesioned and non-lesioned mice.

      Changes to manuscript.

      Line 307: “Although the proportion of individual neurons with distinct response magnitudes in hit and miss trials in lesioned mice did not differ from that in non-lesioned mice, it was significantly lower when separating out mice with partial lesions (Figure 6 – figure supplement 3).”

      Line 648: “Analysis of task-modulated and sound-driven neurons. To identify individual neurons that produced significantly different response magnitudes in hit and miss trials, we calculated the mean activity for each stimulus trial by taking the mean activity over the 5 seconds following stimulus presentation and subtracting the mean activity over the 2 seconds preceding the stimulus during that same trial. A Mann-Whitney U test was then performed to assess whether a neuron showed a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) in response magnitude between hit and miss trials. The analysis was performed using equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions. If, for a given sound level, there were more hit than miss trials we randomly selected a sample of hit trials (without substitution) to match the sample size for the miss trials and vice versa. ”

      (2) I have the feeling that the authors do not exploit fully the functional data recorded with two-imaging. They identify several cluster but do not describe their functional differences. For example, cluster 3 is obviously mainly sensory driven as it is not modulated by outcome. This could be mentioned. This could also be used to rule out that trial outcome is the results of insufficient sensory inputs. Could this cluster be used to predict trial outcome at the onset response? Could it be used to predict the presence of the sound, and with which accuracy. The authors discuss a bit the different cluster type, but in a very elusive manner. I recognize that one should be careful with the use of signal analysis methods in calcium imaging but a simple linear deconvolution of the calcium dynamic who help to illustrate the conclusions that the authors propose based on peak responses. It would also be very interesting to align the clusters responses (deconvolved) to the timing of licking and rewards event to check if some clusters do not fire when mice perform licks before the sound comes. It would help clarify if the behavioral signals described here require both the presence of the sound and the behavioral action or are just the reflection of the motor command. As noted by the authors, some clusters have late peak responses (2 and 5). However, 2 and 5 are not equivalent and a deconvolution would evidence that much better. 2 has late onset firing. 5 has early onset but prolonged firing.

      We agree with the reviewer’s statement that “cluster 3 is obviously mainly sensory driven”. In the Discussion we refer to cluster 3 as having a “largely behaviorally invariant response profile to the auditory stimulus” (line X), which is consistent with the statement of the reviewer. With regard to the reviewer’s suggestion to describe the “functional differences” between the clusters, we would like to refer to the subsequent three sentences of the same paragraph in which we speculate on the cognitive and behavioral variables that may underlie the response profiles of different clusters. Given the limitations imposed by the task structure, we do not think it is justified to expand on this.

      We have added an additional analysis in order to explicitly address the question of which neurons are sound responsive (please also see response to point 3 below and to point 1 of reviewer #2). That trial outcome could be predicted on the basis of only the sound-responsive neurons’ activity during the initial period of the trial (“predict trial outcome at the onset response”) is unlikely given their small number (only 97 of 2649 neurons show a statistically significant sound-evoked response) and given that only a minority (42/98) of those sound-driven neurons are also modulated by trial outcome within that initial trial period (i.e. 0-1s after stimulus onset; data not shown).

      Changes to manuscript.

      Line 219: “..., while only a small fraction (97 / 2649) exhibited a significant response to the sound.”

      Line 658: “Sound-driven neurons were identified by comparing the mean miss trial activity before and after stimulus presentation. Specifically, we performed a Mann-Whitney U test to assess whether there was a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) between the mean activity over the 2 seconds preceding the stimulus and the mean activity over the 1 second period following stimulus presentation. This analysis was performed using miss trials with click intensities from 53 dB SPL to 65 dB SPL (many sessions contained very few or no miss trials at higher sound levels).”

      While calcium traces represent an indirect measure of neural activity, deconvolution does not necessarily provide an accurate picture of the spiking underlying those traces and has the potential to introduce additional problems. For instance, deconvolution algorithms tend to perform poorly at inferring the spiking of inhibited neurons (Vanwalleghem et al., 2021). Given that suppression is such a prominent feature of IC activity and is evident both in our calcium data as well as in the electrophysiology data of others (Franceschi and Barkat, 2021), we decided against using deconvolved spikes in our analyses. See also the side-by-side comparison below of the hit and miss trial activity of one example neuron based on either the calcium trace (left) or deconvolved spikes (right) (extracted using the OASIS algorithm (Friedrich et al., 2017) incorporated into suite2p (Pachitariu et al., 2016).

      Author response image 1.

      (3) Along the same line, the very small proportion of really sensory driven neurons (cluster 3) is not discussed. Is it what on would expect in typical shell or core IC neurons?

      As requested by reviewer #2 and mentioned in response to the previous point, we have now quantified the number of neurons in the dataset that produced significant responses to sound (97 / 2649). For a given imaging area, the fraction of neurons that show a statistically significant change in neural activity following presentation of a click of between 53 dB SPL and 65 dB SPL rarely exceeded ten percent. While that number is low, it is not necessarily surprising given the moderate intensity and very short duration of the stimuli. For comparison: Using the same transgenics, labeling approach and imaging setup and presenting 200-ms long pure tones at 60 dB SPL with frequencies between 2 kHz and 64 kHz, we typically find that between a quarter and a third of neurons in a given imaging area exhibit a statistically significant response (data not shown).

      Changes to manuscript.

      Line 219: “..., while only a small fraction (97 / 2649) exhibited a significant response to the sound.”

      Line 658: “Sound-driven neurons were identified by comparing the mean miss trial activity before and after stimulus presentation. Specifically, we performed a Mann-Whitney U test to assess whether there was a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) between the mean activity over the 2 seconds preceding the stimulus and the mean activity over the 1 second period following stimulus presentation. This analysis was performed using miss trials with click intensities from 53 dB SPL to 65 dB SPL (many sessions contained very few or no miss trials at higher sound levels).”

      Line 220: “While the number of sound-responsive neurons is low, it is not necessarily surprising given the moderate intensity and very short duration of the stimuli. For comparison: Using the same transgenics, labeling approach and imaging setup and presenting 200-ms long pure tones at 60 dB SPL with frequencies between 2 kHz and 64 kHz, we typically find that between a quarter and a third of neurons in a given imaging area exhibit a statistically significant response (data not shown).”

      (4) In the discussion, the interpretation of different transient and permanent cortical inactivation experiment is very interesting and well balanced given the complexity of the issue. There is nevertheless a comment that is difficult to follow. The authors state:

      If cortical lesioning results in a greater weight being placed on the activity in spared subcortical circuits for perceptual judgements, we would expect the accuracy with which trial-by-trial outcomes could be read out from IC neurons to be greater in mice without auditory cortex. However, that was not the case.

      However, there is no indication that the activity they observe in shell IC is causal to the behavioral decision and likely it is not. There is also no indication that the behavioral signals seen by the authors reflect the weight put on the subcortical pathway for behavior. I find this argument handwavy and would remove it.

      While we are happy to amend this section, we would not wish to remove it because a) we believe that the point we are trying to make here is an important and reasonable one and b) because it is consistent with the reviewer’s comment. Hopefully, the following will make this clearer: In order for the mouse to make a perceptual judgment and act upon it - in the context of our task, hearing a sound and then licking a spout - auditory information needs to be read out and converted into a motor command. If the auditory cortex normally plays a key role in such perceptual judgments, cortical lesions would require the animal to base its decisions on the information available from the remaining auditory structures, potentially including the auditory midbrain. This might result in a greater correspondence between the mouse’s behavior and the neural activity in those structures. That we did not observe this outcome for the IC could mean that the auditory cortex did not contribute to the relevant perceptual judgments (sound detection) in the first place. Therefore, no reweighting of signals from the other structures is necessary. Alternatively, greater weight might be placed exclusively on structures other than the auditory midbrain, e.g. the thalamus. The latter would imply that the contribution of the IC remains the same. This includes the possibility that the IC shell does not play a causal role in the behavioral decision – in either control mice or mice with cortical lesions – as suggested by the reviewer.

      Changes to manuscript.

      Line 471: “This could imply that, following cortical lesions, greater weight is placed on structures other than the IC, with the thalamus being the most likely candidate, ..”

      (5) In Fig. 5 the two colors used in B and C are the same although they describe different categories.

      The dark green and ‘deep orange’ we used to distinguish between non-lesioned and lesioned in Figure 5C are slightly lighter than the colors used to distinguish between these two categories in other figures and therefore might be more easily confused with the blue and red in Figure 5B. This has been changed.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the Reviewers and Editors for the constructive comments, which we believe have significantly improved the quality of our manuscript.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      (1) With respect to the predictions, the authors propose that the subjects, depending on their linguistic background and the length of the tone in a trial, can put forward one or two predictions. The first is a short-term prediction based on the statistics of the previous stimuli and identical for both groups (i.e. short tones are expected after long tones and vice versa). The second is a long-term prediction based on their linguistic background. According to the authors, after a short tone, Basque speakers will predict the beginning of a new phrasal chunk, and Spanish speakers will predict it after a long tone.

      In this way, when a short tone is omitted, Basque speakers would experience the violation of only one prediction (i.e. the short-term prediction), but Spanish speakers will experience the violation of two predictions (i.e. the short-term and long-term predictions), resulting in a higher amplitude MMN. The opposite would occur when a long tone is omitted. So, to recap, the authors propose that subjects will predict the alternation of tone durations (short-term predictions) and the beginning of new phrasal chunks (long-term predictions).

      The problem with this is that subjects are also likely to predict the completion of the current phrasal chunk. In speech, phrases are seldom left incomplete. In Spanish is very unlikely to hear a function-word that is not followed by a content-word (and the opposite happens in Basque). On the contrary, after the completion of a phrasal chunk, a speaker might stop talking and a silence might follow, instead of the beginning of a new phrasal chunk.

      Considering that the completion of a phrasal chunk is more likely than the beginning of a new one, the prior endowed to the participants by their linguistic background should make us expect a pattern of results actually opposite to the one reported here.

      We thank the Reviewer #1 for this pertinent comment and the opportunity to address this issue. A very similar concern was also raised by Reviewer #2. Below we try to clarify the motivations that led us to predict that the hypothesized long-term predictions should manifest at the onset (and not within or the end) of a perceptual chunk. 

      Reviewers #1 and #2 contest a critical assumption of our study i.e., the fact that longterm predictions should occur at the beginning of a rhythmic chunk as opposed to its completion. They also contest the prediction deriving from this view i.e., omitting the first sound in a perceptual chunk (short for Spanish, long for Basque) would lead to larger error responses than omitting a later element. They suggest an alternative view: the omission of tones at the end of a perceptual rhythmic chunk would evoke larger error responses than omissions at its onset, as subjects are more likely to predict the completion of the chunk than its beginning. This view predicts an interaction effect in the opposite direction of our findings. 

      While we acknowledge this as a plausible hypothesis, we believe that the current literature provides strong support for our view. Indeed, many studies in the rhythm and music perception literature have investigated the ERP responses to deviant sounds and omissions placed at different positions within rhythmic patterns (e.g., Ladinig et al., 2009; Bouwer et al., 2016; Brochard et al., 2003; Potter et al., 2009; Yabe et al., 2001). For instance, Lading et al., 2009 presented participants with metrical rhythmical sound sequences composed of eight tones. In some deviant sequences, the first or a later tone was omitted. They found that earlier omissions elicited earlier and higher-amplitude MMN responses than later omissions (irrespective of attention). Overall, this and other studies showed that the amplitude of ERP responses are larger when deviants occur at positions that are expected to be the “start” of a perceptual group - “on the beat” in musical terms - and decline toward the end of the chunk. According to some of these studies, the first element of a chunk is particularly important to track the boundaries of temporal sequences, which is why more predictive resources are invested at that position. We believe that this body of evidence provides robust bases for our hypotheses and the directionality of our predictions.

      An additional point that should be considered concerns the amplitude of the prediction error response elicited by the omission. From a predictive coding perspective, the omission of the onset of a chunk should elicit larger error responses because the system is expecting the whole chunk (i.e., two tones/more acoustic information). On the other hand, the omission of the second tone - in the transition between two tones within the chunk - should elicit a smaller error response because the system is expecting only the missing tone (i.e. less acoustic information). 

      Given the importance of these points, we have now included them in the updated version of the paper, in which we try to better clarify the rationale behind our hypothesis (see Introduction section, around the 10th paragraph).

      (2) The authors report an interaction effect that modulates the amplitude of the omission response, but caveats make the interpretation of this effect somewhat uncertain. The authors report a widespread omission response, which resembles the classical mismatch response (in MEG) with strong activations in sensors over temporal regions. Instead, the interaction found is circumscribed to four sensors that do not overlap with the peaks of activation of the omission response.

      We thank the Reviewer for this comment. As mentioned in the provisional response, the approach employed to identify the presence of an interaction effect was conservative: We utilized a non-parametric test on combined gradiometers data, without making a priori assumptions about the location of the effect, and employed small cluster thresholds (cfg.clusteralpha = 0.05) to increase the chances of detecting highly localized clusters with large effect sizes. The fact that the interaction effect arises in a relatively small cluster of sensors does not alter its statistical robustness. It should be also considered that in the present analyses we focused on planar gradiometer data that, compared to magnetometers and axial gradiometers, present more fine-grained spatial resolution and are more suited for picking up relatively small effects. 

      The partial overlap of the cluster with the activation peaks may simply reflect the fact that different sources contribute to the generation of the omission-MMN, which has been reported in several studies (e.g., Zhang et al., 2018; Ross & Hamm, 2020).  We value the Reviewer’s input and are grateful for the opportunity to address these considerations.

      Furthermore, the boxplot in Figure 2E suggests that part of the interaction effect might be due to the presence of two outliers (if removed, the effect is no longer significant). Overall, it is possible that the reported interaction is driven by a main effect of omission type which the authors report, and find consistently only in the Basque group (showing a higher amplitude omission response for long tones than for short tones). Because of these points, it is difficult to interpret this interaction as a modulation of the omission response.

      We thank the Reviewer for the comment and appreciate the opportunity to address these concerns. We have re-evaluated the boxplot in Figure 2E and want to clarify that the two participants mentioned by Reviewer #1, despite being somewhat distant from the rest of the group, are not outliers according to the standard Tukey’s rule. As shown in the figure below, no participant fell outside the upper (Q3+1.5xIQR) and lower whiskers (Q1-1.5xIQR) of the boxplot. 

      Moreover, we believe that the presence of a main effect of omission type does not impact the interpretation of the interaction, especially considering that these effects emerge over distinct clusters of channels (see Fig. 1 C; Supplementary Fig. 2 A). 

      Based on these considerations - and along with the evidence collected in the control study and the source reconstruction data reported in the new version of the manuscript - we find it unlikely that the interaction effect is driven by outliers or by a main effect of omission type. We appreciate the opportunity provided by the Reviewer to address these concerns, as we believe they strengthen the claim that the observed effect is driven by the hypothesized long-term linguistic priors rather than uncontrolled group differences.

      Author response image 1.

      It should also be noted that in the source analysis, the interaction only showed a trend in the left auditory cortex, but in its current version the manuscript does not report the statistics of such a trend.

      We  appreciate  the  Reviewer’s  suggestion  to  incorporate  more comprehensive source analyses. In the new version of the paper, we perform new analyses on the source data using a new Atlas with more fine-grained parcellations of the regions of interests (ROIs) (Brainnetome atlas; Fan et al., 2016) and focusing on peak activity to increase response’s sensitivity in space and time. We therefore invite the Reviewer to read the updated part on source reconstruction included in the Results and Methods sections of the paper.  

      Reviewer #1 (Recommendations For The Authors):

      While I have described my biggest concerns with respect to this work in the public review, here I list more specific points that I hope will help to improve the manuscript. Some of these are very minor, but I hope you will still find them constructive. 

      (1) I understand the difficulties implied in recruiting subjects from two different linguistic groups, but with 20 subjects per group and a between-groups design, the current study is somewhat underpowered. A post-hoc power analysis shows an achieved power of 46% for medium effect sizes (d = 0.5, and alpha = 0.05, one-sided test). A sensitivity analysis shows that the experiment only has 80% power for effect sizes of d = 0.8 and above. It would be important to acknowledge this limitation in the manuscript. 

      We thank the Reviewer for reporting these analyses. It must be noted that our effect of interest was based on Molnar et al.’s (2016) behavioral experiment, in which a sample size of 16 subjects per group was sufficient to detect the perceptual grouping effect. In Yoshida et al., (2010), the perceptual grouping effect emerged with two groups of 20 7–8-month-old Japanese and English-learning infants. Based on these previous findings, we believe that a sample size of 20 participants per group can be considered appropriate for the current MEG study. We clarified these aspects in the Participants section of the manuscript, in which we specified that previous behavioral studies detected the perceptual grouping with similar sample sizes. Moreover, to acknowledge the limitation highlighted by the Reviewer, we also include the power and sensitivity analysis in a note in the same section (see note 2 in the Participants section).

      (2) All the line plots in the manuscript could be made much more informative by adding 95% CI bars. For example, in Figure 4A, the omission response for the long tone departs from the one for the short tone very early. Adding CIs would help to assess the magnitude of that early difference. Error bars are present in Figure 3, but it is not specified what these bars represent. 

      Thanks for the comments. We added the explanation of the error bars in the new version of Figure 3. For the remaining figures, we prefer maintaining the current version of the ERF, as the box-plots accompanying them provide information about the distribution of the effect across participants.

      (3) In the source analysis, there is only mention of an interaction trend in the left auditory cortex, but no statistics are presented. If the authors prefer to mention such a trend, I think it would be important to provide its stats to allow the reader to assess its relevance. 

      We performed new analysis on the source data, all reported in the updated version of the manuscript.

      (4) In the discussion section, the authors refer to the source analysis and state that "the interaction is evident in the left". But if only a statistical trend was observed, this statement would be misleading. 

      We agree with this comment. We invite the Reviewer to check the new part on source reconstruction, in which contrasts going in the same direction of the sensor level data are performed.

      (5) In the discussion the authors argue that "This result highlights the presence of two distinct systems for the generation of auditory" that operate at different temporal scales, but the current work doesn't offer evidence for the existence of two different systems. The effects of long-term priors and short-term priors presented here are not dissociated and instead sum up. It remains possible that a single system is in place, collecting statistics of stimuli over a lifetime, including the statistics experienced during the experiment. 

      Thanks for pointing that out. We changed the sentence above as follows: “This result highlights the presence of an active predictive system that relies on natural sound statistics learned over a lifetime to process incoming auditory input”.

      (6) In the discussion, the authors acknowledge that the omission response has been interpreted both as pure prediction and as pure prediction error. Then they declare that "Overall, these findings are consistent with the idea that omission responses reflect, at least in part, prediction error signals.". However an argument for this statement is not provided. 

      Thanks for pointing out this lack of argument. In the new version of the manuscript, we explained our rationale as follows: “Since sensory predictive signals primarily arise in the same regions as the actual input, the activation of a broader network of regions in omission responses compared to tones suggests that omission responses reflect, at least in part, prediction error signals”.

      (7) In the discussion the authors present an alternative explanation in which both groups might devote more resources to the processing of long events, because these are relevant content words. Following this, they argue that "Independently on the interpretation, the lack of a main effect of omission type in the control condition suggests that the long omission effect is driven by experience with the native language." However as there was no manipulation of duration in the control experiment, a lack of the main effect of omission type there does not rule out the alternative explanation that the authors put forward. 

      This is correct; thanks for noticing it. We removed the sentence above to avoid ambiguities.

      Minor points: 

      (8) The scale of the y-axis in Figure 2C might be wrong, as it goes from 9 to 11 and then to 12. If the scale is linear, the top value should be 13, or the bottom value should be 10. 

      Figure 2C has been modified accordingly, thanks for noticing the error.

      (9) There is a very long paragraph starting on page 7 and ending on page 8. Toward the end of the paragraph, the analysis of the control condition is presented. That could start a new paragraph.

      Thanks for the suggestion. We modified the manuscript as suggested.

      Reviewer #2 (Public Review):

      (1) Despite the evidence provided on neural responses, the main conclusion of the study reflects a known behavioral effect on rhythmic sequence perceptual organization driven by linguistic background (Molnar et al. 2016, particularly). Also, the authors themselves provide a good review of the literature that evidences the influence of longterm priors in neural responses related to predictive activity. Thus, in my opinion, the strength of the statements the authors make on the novelty of the findings may be a bit far-fetched in some instances.

      Thanks for the suggestion. A similar point was also advanced by Reviewer 1. In general, we believe our work speaks about the predictive nature of such experiencedependent  effects, and show that these linguistic priors shape sensory processes at very early stages. This is discussed in the sixth and seventh paragraphs of the Discussion section. In the new version of the article, we modified some statements and tried to make them more coherent with the scope of the present work. For instance, we changed "This result highlights the presence of two distinct systems for the generation of auditory predictive models, one relying on the transition probabilities governing the recent past, and another relying on natural sound statistics learned over a lifetime“ with “This result highlights the presence of an active predictive system that relies on natural sound statistics learned over a lifetime to process incoming auditory input”.

      (2) Albeit the paradigm is well designed, I fail to see the grounding of the hypotheses laid by the authors as framed under the predictive coding perspective. The study assumes that responses to an omission at the beginning of a perceptual rhythmic pattern will be stronger than at the end. I feel this is unjustified. If anything, omission responses should be larger when the gap occurs at the end of the pattern, as that would be where stronger expectations are placed: if in my language a short sound occurs after a long one, and I perceptually group tone sequences of alternating tone duration accordingly, when I hear a short sound I will expect a long one following; but after a long one, I don't necessarily need to expect a short one, as something else might occur.

      A similar point was advanced by Reviewer #1. We tried to clarify the rationale behind our hypothesis. Please refer to the response provided to the first comment of Reviewer #1 above.

      (3) In this regard, it is my opinion that what is reflected in the data may be better accounted for (or at least, additionally) by a different neural response to an omission depending on the phase of an underlying attentional rhythm (in terms of Large and Jones rhythmic attention theory, for instance) and putative underlying entrained oscillatory neural activity (in terms of Lakatos' studies, for instance). Certainly, the fact that the aligned phase may differ depending on linguistic background is very interesting and would reflect the known behavioral effect.

      We thank the Reviewer for this comment. We explored in more detail the possibility that the aligned phase may differ depending on linguistic background, which is indeed a very interesting hypothesis. In the phase analyses reported below we focused on the instantaneous phase angle time locked to the onset of short and long tones presented in the experiment.

      In short, we extracted time intervals of two seconds centered on the onset of the tones for each participant (~200 trials per condition) and using a wavelet transform (implemented in Fieldtrip ft_freqanalysis) we targeted the 0.92 Hz frequency that corresponds to the rhythm of presentation of our pairs of tones. We extracted the phase angle for each time point and using the circular statistics toolbox implemented in Matlab we computed the Raleigh z scores across all the sensor space for each tone (long and short tone) and group (Spanish (Spa) dominants and Basque (Eus) dominants). This method evaluates the instantaneous phase clustering at a specific time point, thus evaluating the presence of a specific oscillatory pattern at the onset of the specific tone. 

      Author response image 2.

      Here we observe that the phase clustering was stronger in the right sensors for both groups. The critical point is to evaluate the phase angle (estimated in phase radians) for the two groups and the two tones and see if there are statistical differences. We focused first on the sensor with higher clustering (right temporal MEG1323) and observed very similar phase angles for the two groups both for long and short tones (see image below). We then focused on the four left fronto-temporal sensor pairs who showed the significant interaction: here we observed one sensor (MEG0412) with different effects for the two groups (interaction group by tone was significant, p=0.02): for short tones the “Watson (1961) approximation U2 test” showed a p-value of 0.11, while for long tones the p-value was 0.03 (after correction for multiple comparisons). 

      Overall, the present findings suggest the tendency to phase aligning differently in the two groups to long and short tones in the left fronto-temporal hemisphere. However, the effect could be detected only in one gradiometer sensor and it was not statistically robust. The effect in the right hemisphere was statistically more robust, but it was not sensitive to group language dominance. 

      Due to the inconclusive nature of these analyses regarding the role of language experience in shaping the phase alignment to rhythmic sound sequences, we prefer to keep these results in the public review rather than incorporating them in the article.  Nonetheless, we believe that this decision does not undermine the main finding that the group differences in the MMN amplitude are driven by long-term predictions – especially in light of the many studies indicating the MMN as a putative index of prediction error (e.g., Bendixen et al., 2012; Heilbron and Chait, 2018). Moreover, as suggested in the preliminary reply, despite evoked responses and oscillations are often considered distinct electrophysiological phenomena, current evidence suggests that these phenomena are interconnected (e.g., Studenova et al., 2023). In our view, the hypotheses that the MMN reflects differences in phase alignment and long-term prediction errors are not mutually exclusive.

      Author response image 3.

      (4) Source localization is performed on sensor-level significant data. The lack of  sourcelevel statistics weakens the conclusions that can be extracted. Furthermore, only the source reflecting the interaction pattern is taken into account in detail as supporting their hypotheses, overlooking other sources. Also, the right IFG source activity is not depicted, but looking at whole brain maps seems even stronger than the left. To sum up, source localization data, as informative as it could be, does not strongly support the author's claims in its current state. 

      A similar comment was also advanced by Reviewer #1 (comment 2). We appreciate the suggestion to incorporate more comprehensive source analyses. In the new version of the paper, we perform new analyses on the source data using a new Atlas with more fine-grained parcellations of the ROIs, and focusing on peak activity to increase response’s sensitivity in space and time. We therefore invite the Reviewer to read the updated part on source reconstruction included in the Results and Methods sections of the paper. 

      In the article, we report only the source reconstruction data from ROIs in the left hemisphere, because it is there that the interaction effect arises at the sensor level. However, we also explored the homologous regions in the right hemisphere, as requested by the Reviewer. A cluster-based permutation test focusing on the interaction between language group and omission type was performed on both the right STG and IFG data. No significant interaction emerged in any of these regions. Below a plot of the source activity time series over ROIs in the right STG and IFG. 

      Author response image 4.

      Reviewer #2 (Recommendations For The Authors):

      In this set of private recommendations for the authors, I will outline a couple of minor comments and try to encourage additional data analyses that, in my opinion, would strengthen the evidence provided by the study. 

      (1) As I noted in the public review, I believe an oscillatory analysis of the data would, on one hand, provide stronger support for the behavioral effect of rhythmic perceptual organization given the lack of behavioral direct evidence; and, on the other hand, provide evidence (to be discussed if so) for a role of entrained oscillation phase in explaining the different pattern of omission responses. One analysis the authors could try is to measure the phase angle of an oscillation, the frequency of which relates to the length of the binary pattern, at the onset of short and long tones, separately, and compare it across groups. Also, single trials of omission responses could be sorted according to that phase. 

      Thanks for the suggestion. Please see phase analyses reported above.

      (2) I wonder why source activity for the right IFG was not shown. I urge the authors to provide and discuss a more complete picture of the source activity found. Given the lack of source statistics (which could be performed), I find it a must to give an overall view. I find it so because I believe the distinction between perceptual grouping effects due to inherent acoustic differences across languages or semantic differences is so interesting. 

      Thanks again for the invitation to provide a more complete picture of the source activity data. As mentioned in the response above, we invite the Reviewer to read the new related part included in the Results and Methods sections of the paper. In our updated source reconstruction analysis, we find that some regions around the left STG show a pattern that resembles the one found at the sensor-level, providing further support for the “acoustic” (rather than syntactic/semantic) nature of the effect. 

      We did not report ROI analysis on the right hemisphere because the interaction effect at sensor level emerged on the left hemisphere. Yet, we included a summary of this analysis in the public response above. 

      (3) Related to this, I have to acknowledge I had to read the whole Molnar et al. (2016) study to find the only evidence so far that, acoustically, in terms of sound duration, Basque and Spanish differ. This was hypothesized before but only at Molnar, an acoustic analysis is performed. I think this is key, and the authors should give it a deeper account in their manuscript. I spend my review of this study thinking, well, but when we speak we actually bind together different words and the syllabic structure does not need to reflect the written one, so maybe the effect is due to a high-level statistical prior related to the content of the words... but Molnar showed me that actually, acoustically, there's a difference in accent and duration: "Taken together, Experiments 1a and 1b show that Basque and Spanish exhibit the predicted differences in terms of the position of prosodic prominence in their phonological phrases (Basque: trochaic, Spanish: iambic), even though the acoustic realization of this prominence involves not only intensity in Basque but duration, as well. Spanish, as predicted, only uses duration as a cue to mark phrasal prosody." 

      Thanks for the suggestion, the distinction in terms of sound duration in Spanish and Basque reported by Molnar is indeed very relevant for the current study. 

      We add a few sentences to highlight the acoustic analysis by Molnar and the consequent acoustic nature of the reported effect.

      In the introduction: “Specifically, the effect has been proposed to depend on the quasiperiodic alternation of short and long auditory events in the speech signal – reported in previous acoustic analyses (Molnar et al., 2016) – which reflect the linearization of function words (e.g., articles, prepositions) and content words (e.g., nouns, adjectives, verbs).”

      In the discussion, paragraph 3, we changed “We hypothesized that this effect is linked to a long-term “duration prior” originating from the syntactic function-content word order of language, and specifically, from its acoustic consequences on the prosodic structure” with “We hypothesized that this effect is linked to a long-term “duration prior” originating from the acoustic properties of the two languages, specifically from the alternation of short and long auditory events in their prosody”.

      In the discussion, end of paragraph eight: “The reconstruction of cortical sources associated with the omission of short and long tones in the two groups showed that an interaction effect mirroring the one at the sensor level was present in the left STG, but not in the left IFG (fig. 3, B, C, D). Pairwise comparisons within different ROIs of the left STG indicated that the interaction effect was stronger over primary (BA 41/42) rather than associative (BAs 22) portions of the auditory cortex. Overall, these results suggest that the “duration prior” is linked to the acoustic properties of a given language rather than its syntactic configurations”.

      Now, some minor comments: 

      (1) Where did the experiments take place? Were they in accordance with the Declaration of Helsinki? Did participants give informed consent? 

      All the requested information has been added to the updated version of the manuscript. Thanks for pointing out this.

      (2) The fixed interval should be called inter-stimulus interval. 

      Thanks for pointing this out. We changed the wording as suggested.

      (3) The authors state that "Omission responses allow to examine the presence of putative error signals decoupled from bottom-up sensory input, offering a critical test for predictive coding (Walsh et al 2020, Heilbron and Chait, 2018).". However the way omission responses are computed in their study is by subtracting the activity from the previous tone. This necessarily means that in the omission activity analyzed, there's bottom-up sensory input activity. As performing another experiment with a control condition in which a sequence of randomly presented tones with different durations to compare directly the omission activity in both sequences (experimental and control) is possibly too demanding, I at least urge the authors to incorporate the fact that their omission responses do reflect also tone activity. And consider, for future experiments, the inclusion of further control conditions. 

      Thanks for the opportunity to clarify this aspect. Actually, the way we computed the omission MMN is not by subtracting the activity of the previous tone from the omission, but by subtracting the activity of randomly selected tones across the whole experiment. That is, we randomly selected around 120 long and short tones (i.e., about the same number as the omissions); we computed the ERF for the long and short tones; we subtracted these ERF from the ERF of the corresponding short and long omissions. We clarified these aspects in both the Materials and Methods (ERF analysis paragraph) and Results section.

      Moreover, the subtraction strategy - which is the standard approach to calculate the MMN - allows to handle possible neural carryover effects arising from the perception of the tone preceding the omission.

      The sentence "Omission responses allow to examine the presence of putative error signals decoupled from bottom-up sensory input, offering a critical test for predictive coding (Walsh et al 2020, Heilbron and Chait, 2018)." simply refer to the fact that the error responses resulting from an omission are purely endogenous, as omissions are just absence of an expected input (i.e., silence). On the other hand, when a predicted sequence of tones is disrupted by an auditory deviants (e.g., a tone with a different pitch or duration than the expected one), the resulting error response is not purely endogenous, but it partially includes the response to the acoustic properties of the deviant.

      (4) When multiple clusters emerged from a comparison, only the most significant cluster was reported. Why? 

      We found more than one significant cluster only in the comparison between pure omissions vs tones (figure 2 A, B). The additional significant cluster from this comparison is associated with a P-value of 0.04, emerges slightly earlier in time, and goes in the same direction as the cluster reported in the paper i.e., larger ERF responses for omission vs tones. We added a note specifying the presence of this second cluster, along with a figure on the supplementary material (Supplementary Fig. 1 A, B).

      (5) Fig 2, if ERFs are baseline corrected -50 to 0ms, why do the plots show pre-stimulus amplitudes not centered at 0? 

      This is because we combined the latitudinal and longitudinal gradiometers on the ERF obtained after baseline correction, by computing the root mean square of the signals at each sensor position (see also  https://www.fieldtriptoolbox.org/example/combineplanar_pipelineorder/). This information is reported in the methods part of the article.

      (6) Fig 2, add units to color bars. 

      Sure.

      (7) Fig 2 F and G, put colorbar scale the same for all topographies. 

      Sure, thanks for pointing this out.

      (8) The interaction effect language (Spanish; Basque) X omission type (short; long) appears only in a small cluster of 4 sensors not located at the locations with larger amplitudes to omissions. Authors report it as left frontotemporal, but it seems to me frontocentral with a slight left lateralization.

      (1) the fact that the cluster reflecting the interaction effect does not overlap with the peaks of activity is not surprising in our view. Many sources contribute to the generation of the MMN. The goal of our work was to establish whether there is also evidence for a long-term system (among the many) contributing to this. That is why we perform a first analysis on the whole omission response network (likely including many sources and predictive/attentional systems), and then we zoom in and focus on our hypothesized interaction. We never claim that the main source underlying the omissionMMM is the long-term predictive system. 

      (2) The exact location of those sensors is at the periphery of the left-hemisphere omission response, which mainly reflects activity from the left temporal regions. The sensor location of this cluster could be influenced by multiple factors, including (i) the direction of the source dipoles determining an effect; (ii) the combination of multiple sources contributing to the activity measured at a specific sensor location, whose unmixing could be solved only with a beamforming source approach. Based on the whole evidence we collected also in the source analyzes we concluded that the major contributors to the sensor-level interaction are emerging from both frontal and temporal regions.

      Reviewer #3 (Public Review):

      (1) The main weaknesses are the strength of the effects and generalisability. The sample size is also relatively small by today's standards, with N=20 in each group. Furthermore, the crucial effects are all mostly in the .01>P<.05 range, such as the crucial interaction P=.03. It would be nice to see it replicated in the future, with more participants and other languages. It would also have been nice to see behavioural data that could be correlated with neural data to better understand the real-world consequences of the effect.

      We appreciate the positive feedback from Reviewer #3. We agree that it would be nice to see this study replicated in the future with larger sample sizes and a behavioral counterpart. Below are a few comments concerning the weakness highlighted: 

      (i) Concerning the sample size: a similar point was raised by Reviewer #1. We report our reply as presented above: “Despite a sample size of 20 participants per group can be considered relatively small for detecting an effect in a between-group design, it must be noted that our effect of interest was based on Molnar et al.’s (2016) experiment, where a sample size of 16 subjects per group was sufficient to detect the perceptual grouping effect. In Yoshida et al., 2010, the perceptual grouping effect arose with two groups of 20 7–8-month-old Japanese and English-learning infants. Based on these findings, we believe that a sample size of 20 participants per group can be considered appropriate for the current study”. We clarified these aspects in the new version of the manuscript.

      (ii) We believe that the lack of behavioral data does not undermine the main findings of this study, given the careful selection of the participants and the well-known robustness of the perceptual grouping effect (e.g., Iversen 2008; Yoshida et al., 2010; Molnar et al. 2014; Molnar et al. 2016). As highlighted by Reviewer #2, having Spanish and Basque dominant “speakers as a sample equates that in Molnar et al. (2016), and thus overcomes the lack of direct behavioral evidence for a difference in rhythmic grouping across linguistic groups. Molnar et al. (2016)'s evidence on the behavioral effect is compelling, and the evidence on neural signatures provided by the present study aligns with it”. (iii) Regarding the fact that the “crucial effects are all mostly in the .01>P<.05 range”: we want to stress that the approach we used to detect the interaction effect was conservative, using a cluster-based permutation approach with no a priori assumptions about the location of the effect. The robustness of our approach has also been highlighted by Reviewer 2: “Data analyses. Sound, state-of-the-art methodology in the event-related field analyses at the sensor level.” In sum, despite some crucial effects being in the .01>P<.05 range, we believe that the statistical soundness of our analysis, combined with the lack of effect in the control condition, provides compelling evidence for our H1.

      Reviewer #3 (Recommendations For The Authors):

      Figures - Recommend converting all diagrams and plots to vector images to ensure they remain clear when zoomed in the PDF format. 

      Sure, thanks. 

      Figure 1: To improve clarity, the representation of sound durations in panels C and D should be revisited. The use of quavers/eighth notes can be confusing for those familiar with musical notation, as they imply isochrony. If printed in black and white, colour distinctions may be lost, making it difficult to discern the different durations. A more universal representation, such as spectrograms, might be more effective. 

      Thanks for the suggestion. It’s true that the quavers/eighth notes might be confusing in that respect. However, we find this notation as a relatively standard approach to define paradigms in auditory neuroscience, see for instance the two papers below. In the new version of the manuscript, we specified in the captions under the figure that the notes refer to individual tones, in order to avoid ambiguities.

      - Wacongne, C., Labyt, E., Van Wassenhove, V., Bekinschtein, T., Naccache, L., & Dehaene, S. (2011). Evidence for a hierarchy of predictions and prediction errors in human cortex. Proceedings of the National Academy of Sciences, 108(51), 20754-20759.

      - Dehaene, S., Meyniel, F., Wacongne, C., Wang, L., & Pallier, C. (2015). The neural representation of sequences: from transition probabilities to algebraic patterns and linguistic trees. Neuron, 88(1), 2-19.

      Figure 2 : In panel C of Figure 2, please include the exact p-value for the interaction observed. Refrain from using asterisks or "n.s." and opt for exact p-values throughout for the sake of clarity. 

      Thank you for your suggestion. We have included the exact p-value for the interaction in panel C of Figure 2. However, for the remaining figures, we have chosen to maintain the use of asterisks and "n.s.". We would like our pictures to convey the key findings concisely, while the numerical details can be found in the article text. The caption below the image also provides guidance on the interpretation of the p-values: (statistical significance: **p < 0.01, *p < 0.05, and ns p > 0.05).  

      Figure 3 Note typo "Omission reponse"

      Fixed. Thanks for noticing the typo. 

      A note: we moved the figure reflecting the main effect of long tone omission and the lack of main effect of language background (Figure 4 in the previous manuscript) in the supplementary material (Supplementary Figure 2).

      References

      Bendixen, A., SanMiguel, I., & Schröger, E. (2012). Early electrophysiological indicators for predictive processing in audition: a review. International Journal of Psychophysiology, 83(2), 120-131.

      Heilbron, M., & Chait, M. (2018). Great expectations: is there evidence for predictive coding in auditory cortex?. Neuroscience, 389, 54-73.

      Iversen, J. R., Patel, A. D., & Ohgushi, K. (2008). Perception of rhythmic grouping depends on auditory experience. The Journal of the Acoustical Society of America, 124(4), 22632271.

      Molnar, M., Lallier, M., & Carreiras, M. (2014). The amount of language exposure determines nonlinguistic tone grouping biases in infants from a bilingual environment. Language Learning, 64(s2), 45-64.

      Molnar, M., Carreiras, M., & Gervain, J. (2016). Language dominance shapes non-linguistic rhythmic grouping in bilinguals. Cognition, 152, 150-159.

      Ross, J. M., & Hamm, J. P. (2020). Cortical microcircuit mechanisms of mismatch negativity and its underlying subcomponents. Frontiers in Neural Circuits, 14, 13.

      Simon, J., Balla, V., & Winkler, I. (2019). Temporal boundary of auditory event formation: An electrophysiological marker. International Journal of Psychophysiology, 140, 53-61.

      Studenova, A. A., Forster, C., Engemann, D. A., Hensch, T., Sander, C., Mauche, N., ... & Nikulin, V. V. (2023). Event-related modulation of alpha rhythm explains the auditory P300 evoked response in EEG. bioRxiv, 2023-02.

      Yoshida, K. A., Iversen, J. R., Patel, A. D., Mazuka, R., Nito, H., Gervain, J., & Werker, J. F. (2010). The development of perceptual grouping biases in infancy: A Japanese-English cross-linguistic study. Cognition, 115(2), 356-361.

      Zhang, Y., Yan, F., Wang, L., Wang, Y., Wang, C., Wang, Q., & Huang, L. (2018). Cortical areas associated with mismatch negativity: A connectivity study using propofol anesthesia. Frontiers in Human Neuroscience, 12, 392.

      Ladinig, O., Honing, H., Háden, G., & Winkler, I. (2009). Probing attentive and preattentive emergent meter in adult listeners without extensive music training. Music Perception, 26(4), 377-386. 

      Brochard, R., Abecasis, D., Potter, D., Ragot, R., & Drake, C. (2003). The “ticktock” of our internal clock: Direct brain evidence of subjective accents in isochronous sequences. Psychological Science, 14(4), 362-366.

      Potter, D. D., Fenwick, M., Abecasis, D., & Brochard, R. (2009). Perceiving rhythm where none exists: Event-related potential (ERP) correlates of subjective accenting. Cortex, 45(1), 103-109.

      Bouwer, F. L., Werner, C. M., Knetemann, M., & Honing, H. (2016). Disentangling beat perception from sequential learning and examining the influence of attention and musical abilities on ERP responses to rhythm. Neuropsychologia, 85, 80-90.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides important evidence supporting the ability of a new type of neuroimaging, OPM-MEG system, to measure beta-band oscillation in sensorimotor tasks on 2-14 years old children and to demonstrate the corresponding development changes, since neuroimaging methods with high spatiotemporal resolution that could be used on small children are quite limited. The evidence supporting the conclusion is solid but lacks clarifications about the much-discussed advantages of OPM-MEG system (e.g., motion tolerance), control analyses (e.g., trial number), and rationale for using sensorimotor tasks. This work will be of interest to the neuroimaging and developmental science communities.

      We thank the editors and reviewers for their time and comments on our manuscript. We have responded in detail to the comments, on a point-by-point basis, below. Included in our responses (and our revised manuscript) are additional analyses to control for trial count, clarification of the advantages of OPM-MEG, and justification of our use of sensory (as distinct from motor) stimulation. In what follows, our responses are in bold typeface; additions to our manuscript are in bold italic typeface. 

      Reviewer #1 (Public Review):

      Summary:

      Compared with conventional SQUID-MEG, OPM-MEG offers theoretical advantages of sensor configurability (that is, sizing to suit the head size) and motion tolerance (the sensors are intrinsically in the head reference frame). This study purports to be the first to experimentally demonstrate these advantages in a developmental study from age 2 to age 34. In short, while the theoretical advantages of OPM-MEG are attractive - both in terms of young child sensitivity and in terms of motion tolerance - neither was in fact demonstrated in this manuscript. We are left with a replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      Thank you for reviewing our manuscript. We agree that our results demonstrate substantial equivalence with conventional MEG. However, as mentioned by Reviewer 3, most past studies have “focused on older children and adolescents (e.g., 9-15 years old)” whereas our youngest group is 25 years. We believe that by obtaining data of sufficient quality in these age groups, without the need for any restriction of head movement, we have demonstrated the advantage of OPM-MEG. We now have made this clear in our discussion:

      “…our primary aim was to test the feasibility of OPM-MEG for neurodevelopmental studies. Our results demonstrate we were able to scan children down to age 2 years, measuring high-fidelity electrophysiological signals and characterising the neurodevelopmental trajectory of beta oscillations. The fact that we were able to complete this study demonstrates the advantages of OPM-MEG over conventional-MEG, the latter being challenging to deploy across such a large age range…”

      Strengths:

      A replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      As noted above the demonstration of equivalence was one of our primary aims. We have elaborated further on the advantages below.

      Weaknesses:

      The authors describe 64 tri-axial detectors, which they refer to as 192 channels. This is in keeping with some of the SQUID-MEG description, but possibly somewhat disingenuous. For the scientific literature, perhaps "64 tri-axial detectors" is a more parsimonious description.

      The number of channels in a MEG system refers to the number of independent measurements of magnetic field. This, in turn, tells us the number of degrees of freedom in the data that can be exploited by algorithms like signal space separation or beamforming. E.g. the MEGIN (cryogenic) MEG system has 306 channels, 102 magnetometers and 204 planar gradiometers. Sensors are constructed as “triple sensor elements” with one magnetometer and 2 gradiometers (in orthogonal orientations) centred on a single location. In our system, each sensor has three orthogonal metrics of magnetic field which are (by definition) independent. We have 64 such sensors, and therefore 192 independent channels – indeed when implementing algorithms like SSS we have shown we can exploit this number of degrees of freedom.1 192 channels is therefore an accurate description of the system.

      A small fraction (<20%) of trials were eliminated for analysis because of "excess interference" - this warrants further elaboration.

      We agree that this is an important point. We now state in our methods section:

      “…Automatic trial rejection was implemented with trials containing abnormally high variance (exceeding 3 standard deviations from the mean) removed. All experimental trials were also inspected visually by an experienced MEG scientist, to exclude trials with large spikes/drifts that were missed by the automatic approach. In the adult group, there was a significant overlap between automatically and manually detected bad trials (0.7+-1.6 trials were only detected manually). In the children 10.0 +-9.4 trials were only detected manually)…”

      We also note that the other reviewers and editor questioned whether the higher rejection rate in children had any bearing on results. This is an extremely important question. In revising the manuscript this has also been taken into account with all data reanalysed with equal trial counts in children and adults. Results are presented in Supplementary Information Section 5.

      Figure 3 shows a reduced beta ERD in the youngest children. Although the authors claim that OPMMEG would be similarly sensitive for all ages and that SQUID-MEG would be relatively insensitive to young children, one trivial counterargument that needs to be addressed is that OPM has NOT in fact increased the sensitivity to young child ERD. This can possibly be addressed by analogous experiments using a SQUID-based system. An alternative would be to demonstrate similar sensitivity across ages using OPM to a brain measure such as evoked response amplitude. In short, how does Figure 3 demonstrate the (theoretical) sensitivity advantage of OPM MEG in small heads ?

      We completely understand the referees’ point – indeed the question of whether a neuromagnetic effect really changes with age, or apparently changes due to a drop in sensitivity (caused by reduced head size or - in conventional MEG and fMRI - increased subject movement) is a question that can be raised in all neurodevelopmental studies.

      Our authors have many years’ experience conducting studies using conventional MEG (including in neurodevelopment) and agreed that the idea of scanning subjects down to age two in conventional MEG would not be practical; their heads are too small and they typically fail to tolerate an environment where they are forced to remain still for long periods. Even if we tried a comparative study using conventional MEG, the likely data exclusion rate would be so high that the study would be confounded. This is why most conventional MEG studies only scan older children and adolescents. For this reason, we cannot undertake the comparative study the reviewer suggests. There are however two reasons why we believe sensitivity is not driving the neurodevelopmental effects that we observe:

      Proximity of sensors to the head: 

      For an ideal wearable MEG system, the distance between the sensors and the scalp surface (sensor proximity) would be the same regardless of age (and size), ensuring maximum sensitivity in all subjects. To test how our system performed in this regard, we undertook analyses to compute scalp-to-sensor distances. This was done in two ways:

      (1) Real distances in our adaptable system: We took the co-registered OPM sensor locations and computed the Euclidean distance from the centre of the sensitive volume (i.e. the centre of the vapour cell) to the closest point on the scalp surface. This was measured independently for all sensors, and an average across sensors calculated. We repeated this for all participants (recall participants wore helmets of varying size and this adaptability should help minimise any relationship between sensor proximity and age).

      (2) Simulated distances for a non-adaptable system: Here, the aim was to see how proximity might have changed with age, had only a single helmet size been used. We first identified the single example subject with the largest head (scanned wearing the largest helmet) and extracted the scalpto-sensor distances as above. For all other subjects, we used a rigid body transform to co-register their brain to that of the example subject (placing their head (virtually) inside the largest helmet). Proximity was then calculated as above and an average across sensors calculated. This was repeated for all participants.

      In both analyses, sensor proximity was plotted against age and significant relationships probed using Pearson correlation. 

      In addition, we also wanted to probe the relation between sensor proximity and head circumference. Head circumference was estimated by binarising the whole head MRI (to delineate volume of the head), and the axial slice with the largest circumference around was selected. We then plotted sensor proximity versus head circumference, for both the real (adaptive) and simulated (nonadaptive) case (expecting a negative relationship – i.e. larger heads mean closer sensor proximity). The slope of the relationship was measured and we used a permutation test to determine whether the use of adaptable helmets significantly lowered the identified slope (i.e. do adaptable helmets significantly improve sensor proximity in those with smaller head circumference).

      Results are shown in Figure R1. We found no measurable relationship between sensor proximity and age (r = -0.195; p = 0.171) in the case of the real helmets (panel A). When simulating a non-adaptable helmet, we did see a significant effect of age on scalp-to-sensor distance (r = -0.46; p = 0.001; panel B). This demonstrates the advantage of the adaptability of OPM-MEG; without the ability to flexibly locate sensors, we would have a significant confound of sensor proximity. 

      Plotting sensor proximity against head circumference we found a significant negative relationship in both cases (r = -0.37; p = 0.007 and  r = -0.78; p = 0.000001); however, the difference between slopes was significant according to a permutation test (p < 0.025) suggesting that adaptable has indeed improved sensor proximity in those with smaller head circumference. This again shows the benefits of adaptability to head size.

      Author response image 1.

      Scalp-to-sensor distance as a function of age (A/B) and head circumference (C/D). A and C show the case for the real helmets; B and D show the simulated non-adaptable case.

      In sum, the ideal wearable system would see sensors located on the scalp surface, to get as close as possible to the brain in all subjects. Our system of multiple helmet sizes is not perfect in this regard (there is still a significant relationship between proximity and head circumference). However, our solution has offered a significant improvement over a (simulated) non-adaptable system. Future systems should aim to improve even further on this, either by using additively manufactured bespoke helmets for every subject (this is a gold standard, but also costly for large studies), or potentially adaptable flexible helmets.

      Burst amplitudes:

      The reviewer suggested to “demonstrate similar sensitivity across ages using OPM to a brain measure”. We decided not to use the evoked response amplitude (as suggested), since this would be expected to change with age. Instead, we used the amplitude of the bursts.

      Our manuscript shows a significant correlation between beta modulation and burst probability – implying that the stimulus-related drop in beta amplitude occurs because bursts are less likely to occur. Further, we showed significant age-related changes in both beta amplitude and burst probability leading to a conclusion that the age dependence of beta modulation was caused by changes in the likelihood of bursts (i.e. bursts are less likely to ’switch off’ during sensory stimulation in children). We have now extended these analyses to test whether burst amplitude also changes significantly with age – we reasoned that if burst amplitude remained the same in children and adults, this would not only suggest that beta modulation is driven by burst probability (distinct from burst amplitude), but also show directly that the beta effects we see are not attributable to a lack of sensitivity in younger people. 

      We took the (unnormalized) beamformer projected electrophysiological time series from sensorimotor cortex and filtered it 5-48 Hz (the motivation for the large band was because bursts are known to be pan-spectral and have lower frequency content in children; this band captures most of the range of burst frequencies highlighted in our spectra). We then extracted the timings of the bursts, and for each burst took the maximum projected signal amplitude. These values were averaged across all bursts in an individual subject, and plotted for all subjects against age.

      Author response image 2.

      Beta burst amplitude as a function of age; A) shows index finger simulation trials; B shows little finger stimulation trials. In both case there was no significant modulation of burst amplitude with age.

      Results (see Figure R2) showed that the amplitude of the beta burst showed no significant age-related modulation (R2 = 0.01, p = 0.48 for index finger and R2 = 0.01, p = 0.57 for the little finger). This is distinct from both burst probability and task induced beta modulation. This adds weight to the argument that the diminished beta modulation in children is not caused by a lack of sensitivity to the MEG signal and supports our conclusion that burst probability is the primary driver of the agerelated changes in beta oscillations.

      Both of the above analyses have been added to our supplementary information and mentioned in the main manuscript. The first shows no confound of sensor proximity to the scalp with age in our study. The second shows that the bursts underlying the beta signal are not significantly lower amplitude in children – which we reasoned they would be if sensitivity was diminished at younger ages. We believe that the two together suggest that we have mitigated a sensitivity confound in our study.

      The data do not make a compelling case for the motion tolerance of OPM-MEG. Although an apparent advantage of a wearable system, an empirical demonstration is still lacking. How was motion tracked in these participants?

      We agree that this was a limitation of our experiment. 

      We have the equipment to track motion of the head during an experiment, using IR retroreflective markers placed on the helmet and a set of IR cameras located inside the MSR. However, the process takes a long time to set up, it lacks robustness, and would have required an additional computer (the one we typically use was already running the somatosensory stimulus and video). When the study was designed, we were concerned that the increased set up time for motion tracking would cause children to get bored, and result in increased participant drop out. For this reason we decided not to capture motion of the head during this study.

      With hindsight this was a limitation which – as the reviewer states – makes us unable to prove that motion robustness was a significant advantage for this study. That said, during scanning there was both a parent and an experimenter in the room for all of the children scanned, and anecdotally we can say that children tended to move their head during scans – usually to talk to the parent. Whilst this cannot be quantified (and is therefore unsatisfactory) we thought it worth mentioning in our discussion, which reads:

      “…One limitation of the current study is that practical limitations prevented us from quantitatively tracking the extent to which children (and adults) moved their head during a scan. Anecdotally however, experimenters present in the room during scans reported several instances where children moved, for example to speak to their parents who were also in the room. Such levels of movement could not be tolerated in conventional MEG or MRI and so this again demonstrates the advantages afforded by OPM-MEG…”

      As a note, empirical demonstrations of the motion tolerance of OPM-MEG have been published previously: Early demonstrations included Boto et al. 2 who captured beta oscillations in adults playing a ball game and Holmes et al. who measured visual responses as participants moved their head to change viewing angle3. In more recent demonstrations, Seymour et al. measured the auditory evoked field in standing mobile participants4; Rea et al. measured beta modulation as subjects carried out a naturalistic handwriting task5 and Holmes et al measured beta modulation as a subject walked around a room.6

      Furthermore, while the introduction discusses at some length the phenomenon of PMBR, there is no demonstration of the recording of PMBR (or post-sensory beta rebound). This is a shame because there is literature suggesting an age-sensitivity to this, that the optimal sensitivity of OPM-MEG might confirm/refute. There is little evidence in Figure 3 for adult beta rebound. Is there an explanation for the lack of sensitivity to this phenomenon in children/adolescents? Could a more robust paradigm (button-press) have shed light on this?

      We understand the question. There are two limitations to the current study in respect to measuring the PMBR:

      Firstly, sensory tasks generally do not induce as strong a PMBR as motor tasks and with this in mind a stronger rebound response could have been elicited using a button press. However, it was our intention to scan children down to age 2 and we were sceptical that the youngest children would carry out a button press as instructed. For this reason we opted for entirely passive stimulation, requiring no active engagement from our participants. The advantages of this was a stimulus that all subjects could engage with. However, this was at the cost of a diminished rebound.

      The second limitation relates to trial length. Multiple studies have shown that the PMBR can last over ~10 s 7,8. Indeed, Pfurtscheller et al. argued in 1999 that it was necessary to leave 10 s between movements to allow the PMBR to return to a true baseline9, though this has rarely been adhered to in the literature. Here, we wanted to keep recordings short for the comfort of the younger participants, so we adopted a short trial duration. However, a consequence of this short trial length is that it becomes impossible to access the PMBR directly; one can only measure beta modulation with the task. This limitation has now been addressed explicitly in our discussion:

      “…this was the first study of its kind using OPM-MEG, and consequently aspects of the study design could have been improved. Firstly, the task was designed for children; it was kept short while maximising the number of trials (to maximise signal to noise ratio). However, the classical view of beta modulation includes a PMBR which takes ~10 s to reach baseline following task cessation7–9. Our short trial duration therefore doesn’t allow the rebound to return to baseline between trials, and so conflates PMBR with rest. Consequently, we cannot differentiate the neural generators of the task induced beta power decrease and the PMBR; whilst this helped ensure a short, child friendly task, future studies should aim to use longer rest windows to independently assess which of the two processes is driving age related changes…”

      Data on functional connectivity are valuable but do not rely on OPM recording. They further do not add strength to the argument that OPM MEG is more sensitive to brain activity in smaller heads - in fact, the OPM recordings seem plagued by the same insensitivity observed using conventional systems.

      Given the demonstration above that bursts are not significantly diminished in amplitude in children relative to adults; and further given the demonstrations in the literature (e.g. Seedat et al.10) that functional connectivity is driven by bursts, we would argue that the effects of connectivity changing with age are not related to sensitivity but rather genuinely reflect a lack of coordination of brain activity.

      The discussion of burst vs oscillations, while highly relevant in the field, is somewhat independent of the OPM recording approach and does not add weight to the OPM claims.

      We agree that the burst vs. oscillations discussion does not add weight to the OPM claims per se. However, we had two aims of our paper, the second being to “investigate how task-induced beta modulation in the sensorimotor cortices is related to the occurrence of pan-spectral bursts, and how the characteristics of those bursts change with age.” As the reviewer states, this is highly relevant to the field, and therefore we believe adds impact, not only to the paper, but also by extension to the technology.

      In short, while the theoretical advantages of OPM-MEG are attractive - both in terms of young child sensitivity and in terms of motion tolerance, neither was in fact demonstrated in this manuscript. We are left with a replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      We thank the referee for the time and important contributions to this paper. We believe the fact that we were able to record good data in children as young as two years old was, in itself, an experimental realisation of the ‘theoretical advantages’ of OPM-MEG. Our additional analyses, inspired by the reviewers comments, help to clarify the advantages of OPM-MEG over conventional technology. The reviewers’ insights have without doubt improved the paper.

      Reviewer #2 (Public Review):

      Summary:

      The authors introduce a new 192-channel OPM system that can be configured using different helmets to fit individuals from 2 to 34 years old. To demonstrate the veracity of the system, they conduct a sensorimotor task aimed at mapping developmental changes in beta oscillations across this age range. Many past studies have mapped the trajectory of beta (and gamma) oscillations in the sensorimotor cortices, but these studies have focused on older children and adolescents (e.g., 9-15 years old) and used motor tasks. Thus, given the study goals, the choice of a somatosensory task was surprising and not justified. The authors recorded a final sample of 27 children (2-13 years old) and 24 adults (21-34 years) and performed a time-frequency analysis to identify oscillatory activity. This revealed strong beta oscillations (decreases from baseline) following the somatosensory stimulation, which the authors imaged to discern generators in the sensorimotor cortices. They then computed the power difference between 0.3-0.8 period and 1.0-1.5 s post-stimulation period and showed that the beta response became stronger with age (more negative relative to the stimulation period). Using these same time windows, they computed the beta burst probability and showed that this probability increased as a function of age. They also showed that the spectral composition of the bursts varied with age. Finally, they conducted a whole-brain connectivity analysis. The goals of the connectivity analysis were not as clear as prior studies of sensorimotor development have not conducted such analyses and typically such whole-brain connectivity analyses are performed on resting-state data, whereas here the authors performed the analysis on task-based data. In sum, the authors demonstrate that they can image beta oscillations in young children using OPM and discern developmental effects.

      Thank you for this summary and for taking the time to review our manuscript.

      Strengths:

      Major strengths of the study include the novel OPM system and the unique participant population going down to 2-year-olds. The analyses are also innovative in many respects.

      Thank you – we also agree that the major strength is in the unique cohort.

      Weaknesses:

      Several weaknesses currently limit the impact of the study. 

      First, the choice of a somatosensory stimulation task over a motor task was not justified. The authors discuss the developmental motor literature throughout the introduction, but then present data from a somatosensory task, which is confusing. Of note, there is considerable literature on the development of somatosensory responses so the study could be framed with that.

      We completely understand the referee’s point, and we agree that the motivation for the somatosensory task was not made clear in our original manuscript.

      Our choice of task was motivated completely by our targeted cohort; whilst a motor task would have been our preference, it was generally felt that making two-year-olds comply with instructions to press a button would have been a significant challenge. In addition, there would likely have been differences in reaction times. By opting for a passive sensory stimulation we ensured compliance, and the same stimulus for all subjects. We have added text on this to our introduction as follows:

      “…Here, we combine OPM-MEG with a burst analysis based on a Hidden Markov Model (HMM) 10–12 to investigate beta dynamics. We scanned a cohort of children and adults across a wide age range (upwards from 2 years old). Because of this, we implemented a passive somatosensory task which can be completed by anyone, regardless of age…”

      We also state in our discussion:

      “…here we chose to use passive (sensory) stimulation. This helped ensure compliance with the task in subjects of all ages and prevented confounds of e.g. reaction time, force, speed and duration of movement which would be more likely in a motor task.7,8 However, there are many other systems to choose and whether the findings here regarding beta bursts and the changes with age also extend to other brain networks remains an open question.…”

      Regarding the neurodevelopmental literature – we are aware of the literature on somatosensory evoked responses – particularly median nerve stimulation – but we can find little on the neurodevelopmental trajectory of somatosensory induced beta oscillations (the topic of our paper). We have edited our introduction as follows:

      “…All these studies probed beta responses to movement execution; in the case of tactile stimulation (i.e. sensory stimulation without movement) both task induced beta power loss, and the post stimulus rebound have been consistently observed in adults9,13–18. Further, beta amplitude in sensory cortex has been related to attentional processes19 and is broadly thought to carry top down top down influence on primary areas20. However, there is less literature on how beta modulation changes with age during purely sensory tasks.…”

      We would be keen for the reviewer to point to any specific papers in the literature that we may have missed.

      Second, the primary somatosensory response actually occurs well before the time window of interest in all of the key analyses. There is an established literature showing mechanical stimulation activates the somatosensory cortex within the first 100 ms following stimulation, with the M50 being the most robust response. The authors focus on a beta decrease (desynchronization) from 0.3-0.8 s which is obviously much later, despite the primary somatosensory response being clear in some of their spectrograms (e.g., Figure 3 in older children and adults). This response appears to exhibit a robust developmental effect in these spectrograms so it is unclear why the authors did not examine it. This raises a second point; to my knowledge, the beta decrease following stimulation has not been widely studied and its function is unknown. The maps in Figure 3 suggest that the response is anterior to the somatosensory cortex and perhaps even anterior to the motor cortex. Since the goal of the study is to demonstrate the developmental trajectory of well-known neural responses using an OPM system, should the authors not focus on the best-understood responses (i.e., the primary somatosensory response that occurs from 0.0-0.3 s)?

      We understand the reviewer’s point. The original aim of our manuscript was to investigate the neurodevelopmental trajectory of beta oscillations, not the evoked response. In fact, the evoked response in this paradigm is complicated by the fact that there are three stimuli in a very short (<500 ms) time window. For this reason, we prefer the focus of our paper to remain on oscillations.

      Nevertheless, we agree that not including the evoked responses was a missed opportunity.  We have now added evoked responses to our analysis pipeline and manuscript. As surmised by the reviewer, the M50 shows neurodevelopmental changes (an increase with age). Our methods section has been updated accordingly and Figure 3 has been modified. The figure and caption are copied below for the convenience of the reviewer.

      Author response image 3.

      Beta band modulation with age: (A) Brain plots show slices through the left motor cortex, with a pseudo-T-statistical map of beta modulation (blue/green) overlaid on the standard brain. Peak MNI coordinates are indicated for each subgroup. Time frequency spectrograms show modulation of the amplitude of neural oscillations (fractional change in spectral amplitude relative to the baseline measured in the 2.5-3 s window). Vertical lines indicate the time of the first braille stimulus. In all cases results were extracted from the location of peak beta desynchronisation (in the left sensorimotor cortex). Note the clear beta amplitude reduction during stimulation. The inset line plots show the 4-40 Hz trial averaged phase-locked evoked response, with the expected prominent deflections around 20 and 50 ms. (B) Maximum difference in beta-band amplitude (0.3-0.8 s window vs 1-1.5 s window) plotted as a function of age (i.e., each data point shows a different participant; triangles represent children, circles represent adults). Note significant correlation (𝑅2 \= 0.29, 𝑝 = 0.00004 *). (C) Amplitude of the P50 component of the evoked response plotted against age. There was no significant correlation (𝑅2 \= 0.04, 𝑝 = 0.14 ). All data here relate to the index finger stimulation; similar results are available for the little finger stimulation in Supplementary Information Section 1.

      Regarding the developmental effects, the authors appear to compute a modulation index that contrasts the peak beta window (.3 to .8) to a later 1.0-1.5 s window where a rebound is present in older adults. This is problematic for several reasons. First, it prevents the origin of the developmental effect from being discerned, as a difference in the beta decrease following stimulation is confounded with the beta rebound that occurs later. A developmental effect in either of these responses could be driving the effect. From Figure 3, it visually appears that the much later rebound response is driving the developmental effect and not the beta decrease that is the primary focus of the study. Second, these time windows are a concern because a different time window was used to derive the peak voxel used in these analyses. From the methods, it appears the image was derived using the .3-.8 window versus a baseline of 2.5-3.0 s. How do the authors know that the peak would be the same in this other time window (0.3-0.8 vs. 1.0-1.5)? Given the confound mentioned above, I would recommend that the authors contrast each of their windows (0.3-0.8 and 1.0-1.5) with the 2.5-3.0 window to compute independent modulation indices. This would enable them to identify which of the two windows (beta decrease from 0.3-0.8 s or the increase from 1.0-1.5 s) exhibited a developmental effect. Also, for clarity, the authors should write out the equation that they used to compute the modulation index. The direction of the difference (positive vs. negative) is not always clear.

      We completely understand the referee’s point; referee 1 made a similar point. In fact, there are two limitations of our paradigm regarding the measurement of PMBR versus the task-induced beta decrease:

      Firstly, sensory tasks generally do not induce as strong a PMBR as motor tasks and with this in mind a stronger rebound response could have been elicited using a button press. However, as described above it was our intention to scan children down to age 2 and we were sceptical that the youngest children would carry out a button press as instructed.

      The second limitation relates to trial length. Multiple studies have shown that the PMBR can last over ~10 s7,8. Indeed, Pfurtscheller et al. argued in 1999 that it was necessary to leave 10 s between movements to allow the PMBR to return to a true baseline9 Here, we wanted to keep recordings relatively short for the younger participants, and so we adopted a short trial duration. However, a consequence of this short trial length is that it becomes impossible to access the PMBR directly because the PMBR of the nth trial is still ongoing when the (n+1)th trial begins. Because of this, there is no genuine rest period, and so the stimulus induced beta decrease and subsequent rebound cannot be disentangled. This limitation has now been made clear in our discussion as follows:

      “…this was the first study of its kind using OPM-MEG, and consequently aspects of the study design could have been improved. Firstly, the task was designed for children; it was kept short while maximising the number of trials (to maximise signal to noise ratio). However, the classical view of beta modulation includes a PMBR which takes ~10 s to reach baseline following task cessation7–9. Our short trial duration therefore doesn’t allow the rebound to return to baseline between trials, and so conflates PMBR with rest. Consequently, we cannot differentiate the neural generators of the task induced beta power decrease and the PMBR; whilst this helped ensure a short, child friendly task, future studies should aim to use longer rest windows to independently assess which of the two processes is driving age related changes…”

      To clarify our method of calculating the modulation index, we have added the following statement to the methods:

      “The beta modulation index was calculated using the equation , where , and are the average Hilbert-envelope-derived amplitudes in the stimulus (0.3-0.8s), post-stimulus (1-1.5s) and baseline (2.5-3s) windows, respectively.”

      Another complication of using a somatosensory task is that the literature on bursting is much more limited and it is unclear what the expectations would be. Overall, the burst probability appears to be relatively flat across the trial, except that there is a sharp decrease during the beta decrease (.3-.8 s). This matches the conventional trial-averaging analysis, which is good to see. However, how the bursting observed here relates to the motor literature and the PMBR versus beta ERD is unclear.

      Again, we agree completely; a motor task would have better framed the study in the context of existing burst literature – but as mentioned above, making 2-year-olds comply with the instructions for a motor task would have been difficult. Interestingly in a recent paper, Rayson et al. used EEG to investigate burst activity in infants (9 and 12 months) and adults during observed movement execution, with results showing stimulus induced decrease in beta burst rate at all ages, with the largest effects in adults21. This paper was not yet published when we submitted our article but does help us to frame our burst results since there is strong agreement between their study and ours. We now mention this study in both our introduction and discussion. 

      Another weakness is that all participants completed 42 trials, but 19% of the trials were excluded in children and 9% were excluded in adults. The number of trials is proportional to the signal-to-noise ratio. Thus, the developmental differences observed in response amplitude could reflect differences in the number of trials that went into the final analyses.

      This is an important observation and we thank the reviewer for raising the issue. We have now re-analysed all of our data, removing trials in the adults such that the overall number of trials was the same as for the children. All effects with age remained significant. We chose to keep the Figures in the main manuscript with all good trials (as previously) and present the additional analyses (with matched trial numbers) in supplementary information. However, if the reviewer feels strongly, we could do it the other way around (there is very little difference between the results).

      Reviewer #3 (Public Review):

      This study demonstrated the application of OPM-MEG in neurodevelopment studies of somatosensory beta oscillations and connections with children as young as 2 years old. It provides a new functional neuroimaging method that has a high spatial-temporal resolution as well wearable which makes it a new useful tool for studies in young children. They have constructed a 192-channel wearable OPM-MEG system that includes field compensation coils which allow free head movement scanning with a relatively high ratio of usable trials. Beta band oscillations during somatosensory tasks are well localized and the modulation with age is found in the amplitude, connectivity, and panspectral burst probability. It is demonstrated that the wearable OPM-MEG could be used in children as a quite practical and easy-to-deploy neuroimaging method with performance as good as conventional MEG. With both good spatial (several millimeters) and temporal (milliseconds) resolution, it provides a novel and powerful technology for neurodevelopment research and clinical applications not limited to somatosensory areas.

      We thank the reviewer for their summary, and their time in reviewing our manuscript.

      The conclusions of this paper are mostly well supported by data acquired under the proper method. However, some aspects of data analysis need to be improved and extended.

      (1) The colour bars selected for the pseudo-T-static pictures of beta modulation in Figures 2 and 3, which are blue/black and red/black, are not easily distinguished from the anatomical images which are grey-scale. A colour bar without black/white would make these figures better. The peak point locations are also suggested to be marked in Figure 2 and averaged locations in Figure 3 with an error bar.

      Thank you for this comment which we certainly agree with. The colour scheme used has now been changed to avoid black. We have also added peak locations. 

      (2) The data points in plots are not constant across figures. In Figures 3 and 5, they are classified into triangles and circles for children and adults, but all are circles in Figures 4 and 6.

      Thank you! We apologise for the confusion. Data points are now consistent across plots.

      (3) Although MEG is much less susceptible to conductivity inhomogeneity of the head than EEG, the forward modulating may still be impacted by the small head profile. Add more information about source localization accuracy and stability across ages or head size.

      This is an excellent point. We have added to our discussion relating to the accuracy of the forward model. 

      “…We failed to see a significant difference in the spatial location of the cortical representations of the index and little finger; there are three potential reasons for this. First, the system was not designed to look for such a difference – sensors were sparsely distributed to achieve whole head coverage (rather than packed over sensory cortex to achieve the best spatial resolution in one area22). Second, our “pseudo-MRI” approach to head modelling (see Methods) is less accurate than acquisition of participantspecific MRIs, and so may mask subtle spatial differences. Third, we used a relatively straightforward technique for modelling magnetic fields generated by the brain (a single shell forward model). Although MEG is much less susceptible to conductivity inhomogeneity of the head than EEG, the forward model may still be impacted by the small head profile. This may diminish spatial resolution and future studies might look to implement more complex models based on e.g. finite element modelling23. Finally, previous work 24 suggested that, for a motor paradigm in adults, only the beta rebound, and not the power reduction during stimulation, mapped motortopically. This may also be the case for purely sensory stimulation. Nevertheless, it remains the case that by placing sensors closer to the scalp, OPM-MEG should offer improved spatial resolution in children and adults; this should be the topic of future work…”

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Major items to further test include the differing number of trials, the windowing issue, and the focus on motor findings in the intro and discussion. First, I would recommend the authors adjust the number of trials in adults to equate them between groups; this will make their developmental effects easier to interpret.  

      Thank you for raising this important point. This has now been done and appears in our supplementary information as discussed above.

      Second, to discern which responses are exhibiting developmental effects, the authors need to contrast the 0.3-0.8 window with the later window (2.5-3.0), not the window that appears to have the PMBR-like response. This artificially accentuates the response. I also think they should image the 1.0-1.5 vs 2.5-3.0s window to determine whether the response in this time window is in the same location as the decrease and then contrast this for beta differences. 

      We completely understand this point, which relates to separating the reduction in beta amplitude during stimulation and the rebound post stimulation. However, as explained above, doing so unambiguously would require the use of much longer trials. Here we were only able to measure stimulus induced beta modulation (distinct from the separate contributions of the task induced beta power reduction and rebound). It may be that future studies, with >10 s trial length, could probe the role of the PMBR, but such studies require long paradigms which are challenging to implement with children.

      Third, changing the framing of the study to highlight the somatosensory developmental literature would also be an improvement.

      We have added to our introduction a stated in the responses above.

      Finally, the connectivity analysis on data from a somatosensory task did not make sense given the focus of the study and should be removed in my opinion. It is very difficult to interpret given past studies used resting state data and one would expect the networks to dynamically change during different parts of the current task (i.e., stimulation versus baseline).

      We appreciate the point regarding connectivity. However, it was our intention to examine the developmental trajectory of beta oscillations, and a major role of beta oscillations is in mediating connectivity. It is true that most studies are conducted in the resting state (or more recently – particularly in children – during movie watching). The fact that we had a sensory task running is a confound; nevertheless, the connectivity we derived in adults bears a marked similarity to that from previous papers (e.g. 25) and we do see significant changes with age. We therefore believe this to be an important addition to the paper and we would prefer to keep it.

      References

      (1) Holmes, N., Bowtell, R., Brookes, M. J. & Taulu, S. An Iterative Implementation of the Signal Space Separation Method for Magnetoencephalography Systems with Low Channel Counts.

      Sensors 23, 6537 (2023).

      (2) Boto, E. et al. Moving magnetoencephalography towards real-world applications with a wearable system. Nature (2018) doi:10.1038/nature26147.

      (3) Holmes, M. et al. A bi-planar coil system for nulling background magnetic fields in scalp mounted magnetoencephalography. NeuroImage 181, 760–774 (2018).

      (4) Seymour, R. A. et al. Using OPMs to measure neural activity in standing, mobile participants. NeuroImage 244, 118604 (2021).

      (5) Rea, M. et al. A 90-channel triaxial magnetoencephalography system using optically pumped magnetometers. annals of the new york academy of sciences 1517, https://doi.org/10.1111/nyas.14890 (2022).

      (6) Holmes, N. et al. Enabling ambulatory movement in wearable magnetoencephalography with matrix coil active magnetic shielding. NeuroImage 274, 120157 (2023).

      (7) Pakenham, D. O. et al. Post-stimulus beta responses are modulated by task duration. NeuroImage 206, 116288 (2020).

      (8) Fry, A. et al. Modulation of post-movement beta rebound by contraction force and rate of force development. Human Brain Mapping 37, 2493–2511 (2016).

      (9) Pfurtscheller, G. & Lopes da Silva, F. H. Event-related EEG/MEG synchronization and desynchronization: Basic principles. Clin Neurophysio 110, 1842–1857 (1999).

      (10) Seedat, Z. A. et al. The role of transient spectral ‘bursts’ in functional connectivity: A magnetoencephalography study. NeuroImage 209, 116537 (2020).

      (11) Baker, A. P. et al. Fast transient networks in spontaneous human brain activity. eLife 2014, 1867 (2014).

      (12) Vidaurre, D. et al. Spectrally resolved fast transient brain states in electrophysiological data. NeuroImage 126, 81–95 (2016).

      (13) Gaetz, W. & Cheyne, D. Localization of sensorimotor cortical rhythms induced by tactile stimulation using spatially filtered MEG. NeuroImage 30, 899–908 (2006).

      (14) Cheyne, D. et al. Neuromagnetic imaging of cortical oscillations accompanying tactile stimulation. Cognitive Brain Research 17, 599–611 (2003).

      (15) van Ede, F., Jensen, O. & Maris, E. Tactile expectation modulates pre-stimulus β-band oscillations in human sensorimotor cortex. NeuroImage 51, 867–876 (2010).

      (16) Salenius, S., Schnitzler, A., Salmelin, R., Jousmäki, V. & Hari, R. Modulation of Human Cortical Rolandic Rhythms during Natural Sensorimotor Tasks. NeuroImage 5, 221–228 (1997).

      (17) Cheyne, D. O. MEG studies of sensorimotor rhythms: A review. Experimental Neurology 245, 27–39 (2013).

      (18) Kilavik, B. E., Zaepffel, M., Brovelli, A., MacKay, W. A. & Riehle, A. The ups and downs of beta oscillations in sensorimotor cortex. Experimental Neurology 245, 15–26 (2013).

      (19) Bauer, M., Oostenveld, R., Peeters, M. & Fries, P. Tactile Spatial Attention Enhances Gamma-Band Activity in Somatosensory Cortex and Reduces Low-Frequency Activity in Parieto-Occipital Areas. J. Neurosci. 26, 490–501 (2006).

      (20) Barone, J. & Rossiter, H. E. Understanding the Role of Sensorimotor Beta Oscillations. Frontiers in Systems Neuroscience 15, (2021).

      (21) Rayson, H. et al. Bursting with Potential: How Sensorimotor Beta Bursts Develop from Infancy to Adulthood. J Neurosci 43, 8487–8503 (2023).

      (22) Hill, R. M. et al. Optimising the Sensitivity of Optically-Pumped Magnetometer Magnetoencephalography to Gamma Band Electrophysiological Activity. Imaging Neuroscience (2024) doi:10.1162/imag_a_00112.

      (23) Stenroos, M., Hunold, A. & Haueisen, J. Comparison of three-shell and simplified volume conductor models in magnetoencephalography. NeuroImage 94, 337–348 (2014).

      (24) Barratt, E. L., Francis, S. T., Morris, P. G. & Brookes, M. J. Mapping the topological organisation of beta oscillations in motor cortex using MEG. NeuroImage 181, 831–844 (2018).

      (25) Rier, L. et al. Test-Retest Reliability of the Human Connectome: An OPM-MEG study. Imaging Neuroscience (2023) doi:10.1162/imag_a_00020.